linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] fork: add CLONE_PIDFD
@ 2019-04-10 23:40 Christian Brauner
  2019-04-10 23:40 ` [RFC-1 PATCH 1/1] fork: add CLONE_PIDFD via /proc/<pid> Christian Brauner
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

Hey Linus,

This is an RFC for adding a new CLONE_PIDFD flag to clone() as
previously discussed.
While implementing this Jann and I ran into additional complexity that
prompted us to send out an initial RFC patchset to make sure we still
think going forward with the current implementation is a good idea and
also provide an alternative approach:

RFC-1:
This is an RFC for the implementation of pidfds as /proc/<pid> file
descriptors.
The tricky part here is that we need to retrieve a file descriptor for
/proc/<pid> before clone's point of no return. Otherwise, we need to fail
the creation of a process that has already passed all barriers and is
visible in userspace. Getting that file descriptor then becomes a rather
intricate dance including allocating a detached dentry that we need to
commit once attach_pid() has been called.
Note that this RFC only includes the logic we think is needed to return
/proc/<pid> file descriptors from clone. It does *not* yet include the even
more complex logic needed to restrict procfs itself. And the additional
logic needed to prevent attacks such as openat(pidfd, "..", ...) and access
to /proc/<pid>/net/ on top of the procfs restriction.

There are a couple of reasons why we stopped short of this and decided to
sent out an RFC first:
- Even the initial part of getting file descriptors from /proc/<pid> out
  of clone() required rather complex code that struck us as very
  inelegant and heavy (which granted, might partially caused by not seeing
  a cleaner way to implement this). Thus, it felt like we needed to see
  whether this is even remotely considered acceptable.
- While discussing further aspects of this approach with Al we received
  rather substantiated opposition to exposing even more codepaths to
  procfs.
- Restricting access to procfs properly requires a lot of invasive work
  even touching core vfs functions such as
  follow_dotdot()/follow_dotdot_rcu() which also caused 2.

RFC-2:
This alternative patchset uses anonymous file descriptors instead of
file descriptors from /proc/<pid>.
A pidfd in this style comes with additional information in fdinfo: the pid
of the process it refers to in the current pid namespace (Pid: %d).

We have chosen to implement this alternative to illustrate how
strikingly simple this patchset is in comparision to the original
approach.
To remove worries about missing metadata access we have written a POC
that illustrates how a combination of CLONE_PIDFD, fdinfo, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Thanks!
Jann & Christian

RFC-1:
Christian Brauner (1):
  fork: add CLONE_PIDFD via /proc/<pid>

 fs/proc/base.c             | 130 ++++++++++++++++++++++++++++++++++---
 fs/proc/fd.c               |   4 +-
 fs/proc/internal.h         |   2 +-
 fs/proc/namespaces.c       |   2 +-
 include/linux/proc_fs.h    |  19 ++++++
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c              |  40 ++++++++++--
 7 files changed, 180 insertions(+), 18 deletions(-)

RFC-2:
Christian Brauner (3):
  fork: add CLONE_PIDFD via anonymous inode
  signal: support CLONE_PIDFD with pidfd_send_signal
  samples: show race-free pidfd metadata access

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig           |   1 -
 arch/arm64/kvm/Kconfig         |   1 -
 arch/mips/kvm/Kconfig          |   1 -
 arch/powerpc/kvm/Kconfig       |   1 -
 arch/s390/kvm/Kconfig          |   1 -
 arch/x86/Kconfig               |   1 -
 arch/x86/kvm/Kconfig           |   1 -
 drivers/base/Kconfig           |   1 -
 drivers/char/tpm/Kconfig       |   1 -
 drivers/dma-buf/Kconfig        |   1 -
 drivers/gpio/Kconfig           |   1 -
 drivers/iio/Kconfig            |   1 -
 drivers/infiniband/Kconfig     |   1 -
 drivers/vfio/Kconfig           |   1 -
 fs/Makefile                    |   2 +-
 fs/notify/fanotify/Kconfig     |   1 -
 fs/notify/inotify/Kconfig      |   1 -
 include/linux/pid.h            |   2 +
 include/uapi/linux/sched.h     |   1 +
 init/Kconfig                   |  10 --
 kernel/fork.c                  |  94 +++++++++++++++++-
 kernel/signal.c                |  14 ++-
 kernel/sys_ni.c                |   3 -
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 169 +++++++++++++++++++++++++++++++++
 26 files changed, 279 insertions(+), 40 deletions(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC-1 PATCH 1/1] fork: add CLONE_PIDFD via /proc/<pid>
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
@ 2019-04-10 23:40 ` Christian Brauner
  2019-04-10 23:40 ` [RFC-2 PATCH 1/4] Make anon_inodes unconditional Christian Brauner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

This is an RFC for the implementation of pidfds as /proc/<pid> file
descriptors. They can be retrieved through the clone() with the addition of
the CLONE_PIDFD flag.
The tricky part here is that we need to retrieve a file descriptor for
/proc/<pid> before clone's point of no return. Otherwise, we need to fail
the creation of a process that has already passed all barriers and is
visible in userspace. Getting that file descriptor then becomes a rather
intricate dance including allocating a detached dentry that we need to
commit once attach_pid() has been called.
Note that this RFC only includes the logic we think is needed to return
/proc/<pid> file descriptors from clone. It does *not* yet include the even
more complex logic needed to restrict procfs itself. And the additional
logic needed to prevent attacks such as openat(pidfd, "..", ...) and access
to /proc/<pid>/net/.

There are a couple of reasons why we stopped short of this and decided to
sent out an RFC first.
- Even the initial part of getting file descriptors from /proc/<pid> out
  out clone() required rather complex code that struck us as very
  inelegant and heavy (which granted, might partially caused by not seeing
  a cleaner way to implement this). Thus, it felt like we needed to see
  whether this is even remotely considered acceptable.
- While discussion further aspects of this approach with Al we received
  rather substantiated opposition to exposing even more codepaths to
  procfs.
- Restricting access to procfs properly requires a lot of invasive work
  even touching core vfs functions such as
  follow_dotdot()/follow_dotdot_rcu() which also caused 2.

Jann and I are providing a second RFC alongside this one that shows an
alternative and rather much simpler approach we think might be preferable.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/proc/base.c             | 130 ++++++++++++++++++++++++++++++++++---
 fs/proc/fd.c               |   4 +-
 fs/proc/internal.h         |   2 +-
 fs/proc/namespaces.c       |   2 +-
 include/linux/proc_fs.h    |  19 ++++++
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c              |  40 ++++++++++--
 7 files changed, 180 insertions(+), 18 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6a803a0b75df..2f5d7bd5d047 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1731,7 +1731,7 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
 	*rgid = gid;
 }
 
-struct inode *proc_pid_make_inode(struct super_block * sb,
+struct inode *proc_pid_make_inode(struct super_block * sb, struct pid *pid,
 				  struct task_struct *task, umode_t mode)
 {
 	struct inode * inode;
@@ -1753,7 +1753,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
 	/*
 	 * grab the reference to task.
 	 */
-	ei->pid = get_task_pid(task, PIDTYPE_PID);
+	ei->pid = pid ? get_pid(pid) : get_task_pid(task, PIDTYPE_PID);
 	if (!ei->pid)
 		goto out_unlock;
 
@@ -2070,7 +2070,7 @@ proc_map_files_instantiate(struct dentry *dentry,
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK |
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, S_IFLNK |
 				    ((mode & FMODE_READ ) ? S_IRUSR : 0) |
 				    ((mode & FMODE_WRITE) ? S_IWUSR : 0));
 	if (!inode)
@@ -2428,7 +2428,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
 	struct inode *inode;
 	struct proc_inode *ei;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, p->mode);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, p->mode);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
@@ -3184,12 +3184,13 @@ void proc_flush_task(struct task_struct *task)
 	}
 }
 
-static struct dentry *proc_pid_instantiate(struct dentry * dentry,
-				   struct task_struct *task, const void *ptr)
+static struct inode *proc_pid_dentry_init(struct dentry *dentry,
+					  struct task_struct *task)
 {
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task,
+				    S_IFDIR | S_IRUGO | S_IXUGO);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
@@ -3201,9 +3202,122 @@ static struct dentry *proc_pid_instantiate(struct dentry * dentry,
 	pid_update_inode(task, inode);
 
 	d_set_d_op(dentry, &pid_dentry_operations);
+	return inode;
+}
+
+static struct dentry *proc_pid_instantiate(struct dentry * dentry,
+				   struct task_struct *task, const void *ptr)
+{
+	struct inode *inode = proc_pid_dentry_init(dentry, task);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
 	return d_splice_alias(inode, dentry);
 }
 
+/*
+ * Open /proc/$pid before clone()'s point of no return, i.e. before
+ * attach_pid() has been called.
+ */
+int proc_pid_open_early(struct task_struct *task, struct pid *pid,
+			struct early_proc_pid *info)
+{
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct dentry *root = ns->proc_mnt->mnt_root;
+	pid_t vpid = pid_nr_ns(pid, ns);
+	struct inode *inode;
+	char pid_str[12];
+	int res;
+
+	if (WARN_ON(vpid == 0))
+		return -ESRCH;
+
+	/*
+	 * We can't use lookup_one_len() here. When this function is called
+	 * attach_pid() will not have been called which means that
+	 * proc_pid_lookup() will fail with ENOENT as it can't successfully
+	 * find_task_by_pid_ns().
+	 * We can just use d_alloc_name() though.
+	 */
+	snprintf(pid_str, sizeof(pid_str), "%d", vpid);
+	info->dentry = d_alloc_name(root, pid_str);
+	if (IS_ERR(info->dentry))
+		return PTR_ERR(info->dentry);
+
+	info->fd = __alloc_fd(current->files, 1, rlimit(RLIMIT_NOFILE),
+			      O_CLOEXEC);
+	if (info->fd < 0) {
+		res = info->fd;
+		goto out_put_dentry;
+	}
+
+	info->tmp_dentry = d_alloc_anon(root->d_sb);
+	if (!info->tmp_dentry) {
+		res = -ENOMEM;
+		goto out_put_fd;
+	}
+
+	inode = proc_pid_dentry_init(info->tmp_dentry, task);
+	if (IS_ERR(inode)) {
+		res = PTR_ERR(inode);
+		goto out_put_tmp_dentry;
+	}
+
+	d_instantiate(info->tmp_dentry, inode);
+	info->file = file_open_root(info->tmp_dentry, ns->proc_mnt, "/",
+				    O_RDONLY | O_NOFOLLOW | O_DIRECTORY, 0);
+	if (IS_ERR(info->file)) {
+		res = PTR_ERR(info->file);
+		goto out_put_tmp_dentry;
+	}
+
+	return 0;
+
+out_put_tmp_dentry:
+	dput(info->tmp_dentry);
+
+out_put_fd:
+	put_unused_fd(info->fd);
+
+out_put_dentry:
+	dput(info->dentry);
+
+	return res;
+}
+
+void proc_pid_dentry_commit_lock(struct early_proc_pid *info)
+{
+	lock_rename(info->tmp_dentry, info->dentry->d_parent);
+}
+
+/*
+ * Commit /proc/$pid after clone()'s point of no return, and install the file
+ * descriptor.
+ * Drops the locks acquired by proc_pid_dentry_commit_lock().
+ * Returns the file descriptor.
+ */
+int proc_pid_dentry_commit_unlock(struct early_proc_pid *info)
+{
+	/* commit the dentry */
+	d_move(info->tmp_dentry, info->dentry);
+	unlock_rename(info->tmp_dentry, info->dentry->d_parent);
+
+	/* release extra references */
+	dput(info->tmp_dentry);
+	dput(info->dentry);
+
+	/* install fd */
+	fd_install(info->fd, info->file);
+	return info->fd;
+}
+
+void proc_pid_dentry_abort(struct early_proc_pid *info)
+{
+	fput(info->file);
+	dput(info->tmp_dentry);
+	put_unused_fd(info->fd);
+	dput(info->dentry);
+}
+
 struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 {
 	struct task_struct *task;
@@ -3480,7 +3594,7 @@ static struct dentry *proc_task_instantiate(struct dentry *dentry,
 	struct task_struct *task, const void *ptr)
 {
 	struct inode *inode;
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, S_IFDIR | S_IRUGO | S_IXUGO);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..7e624695db5a 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -186,7 +186,7 @@ static struct dentry *proc_fd_instantiate(struct dentry *dentry,
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, S_IFLNK);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
@@ -325,7 +325,7 @@ static struct dentry *proc_fdinfo_instantiate(struct dentry *dentry,
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUSR);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, S_IFREG | S_IRUSR);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index d1671e97f7fe..9b4cb85b96be 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -159,7 +159,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 extern const struct dentry_operations pid_dentry_operations;
 extern int pid_getattr(const struct path *, struct kstat *, u32, unsigned int);
 extern int proc_setattr(struct dentry *, struct iattr *);
-extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
+extern struct inode *proc_pid_make_inode(struct super_block *, struct pid *pid, struct task_struct *, umode_t);
 extern void pid_update_inode(struct task_struct *, struct inode *);
 extern int pid_delete_dentry(const struct dentry *);
 extern int proc_pid_readdir(struct file *, struct dir_context *);
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..b77e4234a892 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry *dentry,
 	struct inode *inode;
 	struct proc_inode *ei;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+	inode = proc_pid_make_inode(dentry->d_sb, NULL, task, S_IFLNK | S_IRWXUGO);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..e801481a8a24 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -75,6 +75,18 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    void *data);
 extern struct pid *tgid_pidfd_to_pid(const struct file *file);
 
+struct early_proc_pid {
+	struct dentry *dentry;
+	struct dentry *tmp_dentry;
+	struct file *file;
+	int fd;
+};
+int proc_pid_open_early(struct task_struct *task, struct pid *pid,
+			struct early_proc_pid *info);
+void proc_pid_dentry_commit_lock(struct early_proc_pid *info);
+int proc_pid_dentry_commit_unlock(struct early_proc_pid *info);
+void proc_pid_dentry_abort(struct early_proc_pid *info);
+
 #else /* CONFIG_PROC_FS */
 
 static inline void proc_root_init(void)
@@ -120,6 +132,13 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
 	return ERR_PTR(-EBADF);
 }
 
+struct early_proc_pid {};
+static inline int proc_pid_open_early(struct task_struct *task,
+		struct pid *pid, struct early_proc_pid *info) { return -EINVAL; }
+void proc_pid_dentry_commit_lock(struct early_proc_pid *info) { }
+static inline int proc_pid_dentry_commit_unlock(struct early_proc_pid *info) { return -EINVAL; }
+static inline void proc_pid_dentry_abort(struct early_proc_pid *info) { }
+
 #endif /* CONFIG_PROC_FS */
 
 struct net;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..cd9bd14ce56d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD	0x00001000	/* create new pid file descriptor */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..31b405eee020 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -32,6 +32,7 @@
 #include <linux/sem.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
 #include <linux/iocontext.h>
 #include <linux/key.h>
 #include <linux/binfmts.h>
@@ -1678,11 +1679,13 @@ static __latent_entropy struct task_struct *copy_process(
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
-					int node)
+					int node,
+					int *pidfd)
 {
 	int retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	struct early_proc_pid proc_pid_info;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1936,6 +1939,17 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
+	/*
+	 * This has to happen after we've potentially unshared the file
+	 * descriptor table (so that the pidfd doesn't leak into the child if
+	 * the fd table isn't shared).
+	 */
+	if (clone_flags & CLONE_PIDFD) {
+		retval = proc_pid_open_early(p, pid, &proc_pid_info);
+		if (retval)
+			goto bad_fork_free_pid;
+	}
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1996,7 +2010,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	retval = cgroup_can_fork(p);
 	if (retval)
-		goto bad_fork_free_pid;
+		goto bad_fork_abort_cgroup;
 
 	/*
 	 * From this point on we must avoid any synchronous user-space
@@ -2009,6 +2023,9 @@ static __latent_entropy struct task_struct *copy_process(
 	p->start_time = ktime_get_ns();
 	p->real_start_time = ktime_get_boot_ns();
 
+	if (clone_flags & CLONE_PIDFD)
+		proc_pid_dentry_commit_lock(&proc_pid_info);
+
 	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
@@ -2091,12 +2108,16 @@ static __latent_entropy struct task_struct *copy_process(
 		attach_pid(p, PIDTYPE_PID);
 		nr_threads++;
 	}
+
 	total_forks++;
 	hlist_del_init(&delayed.node);
 	spin_unlock(&current->sighand->siglock);
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	if (clone_flags & CLONE_PIDFD)
+		*pidfd = proc_pid_dentry_commit_unlock(&proc_pid_info);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	cgroup_threadgroup_change_end(current);
@@ -2111,8 +2132,11 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
-bad_fork_free_pid:
+bad_fork_abort_cgroup:
 	cgroup_threadgroup_change_end(current);
+	if (clone_flags & CLONE_PIDFD)
+		proc_pid_dentry_abort(&proc_pid_info);
+bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_thread:
@@ -2177,7 +2201,7 @@ struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+			    cpu_to_node(cpu), NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2202,7 +2226,7 @@ long _do_fork(unsigned long clone_flags,
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
-	int trace = 0;
+	int trace = 0, pidfd;
 	long nr;
 
 	/*
@@ -2224,7 +2248,7 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE, &pidfd);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2260,6 +2284,10 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	put_pid(pid);
+
+	if (clone_flags & CLONE_PIDFD)
+		nr = pidfd;
+
 	return nr;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC-2 PATCH 1/4] Make anon_inodes unconditional
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
  2019-04-10 23:40 ` [RFC-1 PATCH 1/1] fork: add CLONE_PIDFD via /proc/<pid> Christian Brauner
@ 2019-04-10 23:40 ` Christian Brauner
  2019-04-10 23:40 ` [RFC-2 PATCH 2/4] fork: add CLONE_PIDFD via anonymous inode Christian Brauner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

From: David Howells <dhowells@redhat.com>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and the pidfd_open() syscall.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfd_open()]
Signed-off-by: Christian Brauner <christian@brauner.io>
---
 arch/arm/kvm/Kconfig       |  1 -
 arch/arm64/kvm/Kconfig     |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/Kconfig           |  1 -
 arch/x86/kvm/Kconfig       |  1 -
 drivers/base/Kconfig       |  1 -
 drivers/char/tpm/Kconfig   |  1 -
 drivers/dma-buf/Kconfig    |  1 -
 drivers/gpio/Kconfig       |  1 -
 drivers/iio/Kconfig        |  1 -
 drivers/infiniband/Kconfig |  1 -
 drivers/vfio/Kconfig       |  1 -
 fs/Makefile                |  2 +-
 fs/notify/fanotify/Kconfig |  1 -
 fs/notify/inotify/Kconfig  |  1 -
 init/Kconfig               | 10 ----------
 18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on MMU && OF
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select ARM_GIC
 	select ARM_GIC_V3
 	select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	depends on OF
 	select MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	depends on MIPS_FP_SUPPORT
 	select EXPORT_UASM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..7a70fb58b2d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
-	select ANON_INODES
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
-	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
 config DMA_SHARED_BUFFER
 	bool
 	default n
-	select ANON_INODES
 	select IRQ_WORK
 	help
 	  This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
 config TCG_VTPM_PROXY
 	tristate "VTPM Proxy Interface"
 	depends on TCG_TPM
-	select ANON_INODES
 	---help---
 	  This driver proxies for an emulated TPM (vTPM) running in userspace.
 	  A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
 config SYNC_FILE
 	bool "Explicit Synchronization Framework"
 	default n
-	select ANON_INODES
 	select DMA_SHARED_BUFFER
 	---help---
 	  The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
 
 menuconfig GPIOLIB
 	bool "GPIO Support"
-	select ANON_INODES
 	help
 	  This enables GPIO support through the generic GPIO library.
 	  You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
 
 menuconfig IIO
 	tristate "Industrial I/O support"
-	select ANON_INODES
 	help
 	  The industrial I/O subsystem provides a unified framework for
 	  drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
 
 config INFINIBAND_USER_ACCESS
 	tristate "InfiniBand userspace access (verbs and CM)"
-	select ANON_INODES
 	depends on MMU
 	---help---
 	  Userspace InfiniBand access support.  This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
-	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
 config FANOTIFY
 	bool "Filesystem wide access notification"
 	select FSNOTIFY
-	select ANON_INODES
 	select EXPORTFS
 	default n
 	---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	select ANON_INODES
 	select FSNOTIFY
 	default y
 	---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC-2 PATCH 2/4] fork: add CLONE_PIDFD via anonymous inode
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
  2019-04-10 23:40 ` [RFC-1 PATCH 1/1] fork: add CLONE_PIDFD via /proc/<pid> Christian Brauner
  2019-04-10 23:40 ` [RFC-2 PATCH 1/4] Make anon_inodes unconditional Christian Brauner
@ 2019-04-10 23:40 ` Christian Brauner
  2019-04-10 23:40 ` [RFC-2 PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal Christian Brauner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

This patchset makes it possible to retrieve pid file descriptors at process
creation time by introducing a new flag CLONE_PIDFD. As spotted by Linus,
there is exactly one bit left.

In this version of CLONE_PIDFD anonymous inode file descriptors are used.
They serve as a simple opaque handle on pids. Logically, this makes it
possible to interpret a pidfd differently, narrowing or widening the scope
of various operations (e.g. signal sending). Thus, a pidfd cannot just
refer to a tgid, but also a tid, or in theory - given appropriate flag
arguments in relevant syscalls - a process group or session.

This patchset uses anonymous file descriptors instead of file descriptors
from /proc/<pid>.
A pidfd in this style comes with additional information in fdinfo: the pid
of the process it refers to in the current pid namespace (Pid: %d).

Even though originally file descriptors to /proc/<pid> were preferred we
discovered the associated complexity while implementing this solution which
prompted us to implement an alternative and put it up for debate. We have
chosen to implement this alternative to illustrate how strikingly simple
this patchset is in comparision to the original approach.
To remove worries about missing metadata access we have written a POC that
illustrates how a combination of CLONE_PIDFD, fdinfo, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be translated
into a helper that would be suitable for inclusion in libc so that users
don't have to worry about writing it themselves.

We hope that this ultimately will be the approach the community prefers.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/pid.h        |  2 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c              | 94 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..cd9bd14ce56d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD	0x00001000	/* create new pid file descriptor */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..5716ea8c32e5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/slab.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
@@ -21,8 +22,10 @@
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
 #include <linux/rtmutex.h>
 #include <linux/init.h>
+#include <linux/fsnotify.h>
 #include <linux/unistd.h>
 #include <linux/module.h>
 #include <linux/vmalloc.h>
@@ -1662,6 +1665,64 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+	struct pid *pid = f->private_data;
+
+	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+	seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+static int pidfd_create_cloexec(struct pid *pid, struct file **file)
+{
+	unsigned int flags = O_RDWR | O_CLOEXEC;
+	int error, fd;
+	struct file *f;
+
+	error = __alloc_fd(current->files, 1, rlimit(RLIMIT_NOFILE), flags);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	f = anon_inode_getfile("pidfd", &pidfd_fops, get_pid(pid), flags);
+	if (IS_ERR(f)) {
+		put_pid(pid);
+		error = PTR_ERR(f);
+		goto err_put_unused_fd;
+	}
+
+	*file = f;
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+	return error;
+}
+
+static inline void pidfd_put_cloexec(struct pid *pid, int fd, struct file *file)
+{
+        put_unused_fd(fd);
+	fput(file);
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -1678,11 +1739,12 @@ static __latent_entropy struct task_struct *copy_process(
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
-					int node)
+					int node, int *pidfd)
 {
 	int retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	struct file *pidfdf = NULL;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1936,6 +1998,18 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
+	/*
+	 * This has to happen after we've potentially unshared the file
+	 * descriptor table (so that the pidfd doesn't leak into the child if
+	 * the fd table isn't shared).
+	 */
+	if (clone_flags & CLONE_PIDFD) {
+		retval = pidfd_create_cloexec(pid, &pidfdf);
+		if (retval < 0)
+			goto bad_fork_free_pid;
+		*pidfd = retval;
+	}
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1996,7 +2070,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	retval = cgroup_can_fork(p);
 	if (retval)
-		goto bad_fork_free_pid;
+		goto bad_fork_put_pidfd;
 
 	/*
 	 * From this point on we must avoid any synchronous user-space
@@ -2097,6 +2171,9 @@ static __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	if (clone_flags & CLONE_PIDFD)
+		fd_install(*pidfd, pidfdf);
+
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	cgroup_threadgroup_change_end(current);
@@ -2111,6 +2188,9 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+	if (clone_flags & CLONE_PIDFD)
+		pidfd_put_cloexec(pid, *pidfd, pidfdf);
 bad_fork_free_pid:
 	cgroup_threadgroup_change_end(current);
 	if (pid != &init_struct_pid)
@@ -2177,7 +2257,7 @@ struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
 	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+			    cpu_to_node(cpu), NULL);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2202,7 +2282,7 @@ long _do_fork(unsigned long clone_flags,
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
-	int trace = 0;
+	int pidfd, trace = 0;
 	long nr;
 
 	/*
@@ -2224,7 +2304,7 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE, &pidfd);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2260,6 +2340,10 @@ long _do_fork(unsigned long clone_flags,
 	}
 
 	put_pid(pid);
+
+	if (clone_flags & CLONE_PIDFD)
+		nr = pidfd;
+
 	return nr;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC-2 PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
                   ` (2 preceding siblings ...)
  2019-04-10 23:40 ` [RFC-2 PATCH 2/4] fork: add CLONE_PIDFD via anonymous inode Christian Brauner
@ 2019-04-10 23:40 ` Christian Brauner
  2019-04-10 23:40 ` [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access Christian Brauner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 kernel/signal.c | 14 ++++++++++----
 kernel/sys_ni.c |  3 ---
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..cd83cc376767 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 	return kill_something_info(sig, &info, pid);
 }
 
-#ifdef CONFIG_PROC_FS
 /*
  * Verify that the signaler and signalee either are in the same pid namespace
  * or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 	return copy_siginfo_from_user(kinfo, info);
 }
 
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return tgid_pidfd_to_pid(file);
+}
+
 /**
  * sys_pidfd_send_signal - send a signal to a process through a task file
  *                          descriptor
@@ -3581,12 +3588,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	if (flags)
 		return -EINVAL;
 
-	f = fdget_raw(pidfd);
+	f = fdget(pidfd);
 	if (!f.file)
 		return -EBADF;
 
 	/* Is this a pidfd? */
-	pid = tgid_pidfd_to_pid(f.file);
+	pid = pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
 		goto err;
@@ -3620,7 +3627,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	fdput(f);
 	return ret;
 }
-#endif /* CONFIG_PROC_FS */
 
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
 
 /* kernel/sched/core.c */
 
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
 COND_SYSCALL(setgid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
                   ` (3 preceding siblings ...)
  2019-04-10 23:40 ` [RFC-2 PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal Christian Brauner
@ 2019-04-10 23:40 ` Christian Brauner
  2019-04-11  0:08   ` Daniel Colascione
  2019-04-11  0:12 ` [RFC PATCH] fork: add CLONE_PIDFD Daniel Colascione
  2019-04-11 16:50 ` Linus Torvalds
  6 siblings, 1 reply; 10+ messages in thread
From: Christian Brauner @ 2019-04-10 23:40 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, adobriyan, tglx,
	mtk.manpages, bl0pbl33p, ldv, akpm, oleg, cyphar, joel, dancol,
	Christian Brauner

This is an sample program to show userspace how to get race-free access to
process metadata from a pidfd.
It is really not that difficult and instead of burdening the kernel with
this task by using fds to /proc/<pid> we can simply add a helper to libc
that does it for the user.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 169 +++++++++++++++++++++++++++++++++
 3 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

diff --git a/samples/Makefile b/samples/Makefile
index b1142a958811..fadadb1c3b05 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ livepatch/ \
 			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
 			   configfs/ connector/ v4l/ trace_printk/ \
-			   vfio-mdev/ statx/ qmi/ binderfs/
+			   vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
new file mode 100644
index 000000000000..0ff97784177a
--- /dev/null
+++ b/samples/pidfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+hostprogs-y := pidfd-metadata
+always := $(hostprogs-y)
+HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
+all: pidfd-metadata
diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
new file mode 100644
index 000000000000..c46c6c34a012
--- /dev/null
+++ b/samples/pidfd/pidfd-metadata.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+static int raw_clone_pidfd(void)
+{
+	unsigned long flags = CLONE_PIDFD;
+
+#if defined(__s390x__) || defined(__s390__) || defined(__CRIS__)
+	/* On s390/s390x and cris the order of the first and second arguments
+         * of the system call is reversed.
+         */
+	return (int)syscall(__NR_clone, NULL, flags | SIGCHLD);
+#elif defined(__sparc__) && defined(__arch64__)
+	{
+		/*
+                 * sparc64 always returns the other process id in %o0, and a
+                 * boolean flag whether this is the child or the parent in %o1.
+                 * Inline assembly is needed to get the flag returned in %o1.
+                 */
+		int in_child;
+		int child_pid;
+		asm volatile("mov %2, %%g1\n\t"
+			     "mov %3, %%o0\n\t"
+			     "mov 0 , %%o1\n\t"
+			     "t 0x6d\n\t"
+			     "mov %%o1, %0\n\t"
+			     "mov %%o0, %1"
+			     : "=r"(in_child), "=r"(child_pid)
+			     : "i"(__NR_clone), "r"(flags | SIGCHLD)
+			     : "%o1", "%o0", "%g1");
+
+		if (in_child)
+			return 0;
+		else
+			return child_pid;
+	}
+#elif defined(__ia64__)
+	/* On ia64 the stack and stack size are passed as separate arguments. */
+	return (int)syscall(__NR_clone, flags | SIGCHLD, NULL, prctl_arg(0));
+#else
+	return (int)syscall(__NR_clone, flags | SIGCHLD, NULL);
+#endif
+}
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+					unsigned int flags)
+{
+	return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static int pidfd_metadata_fd(int pidfd)
+{
+	int procfd, ret;
+	char path[100];
+	FILE *f;
+	size_t n = 0;
+	char *line = NULL;
+
+	snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
+
+	f = fopen(path, "re");
+	if (!f)
+		return -1;
+
+	ret = 0;
+	while (getline(&line, &n, f) != -1) {
+		char *numstr;
+		size_t len;
+
+		if (strncmp(line, "Pid:\t", 5))
+			continue;
+
+		numstr = line + 5;
+		len = strlen(numstr);
+		if (len > 0 && numstr[len - 1] == '\n')
+			numstr[len - 1] = '\0';
+		ret = snprintf(path, sizeof(path), "/proc/%s", numstr);
+		break;
+	}
+	free(line);
+	fclose(f);
+
+	if (!ret) {
+		errno = ENOENT;
+		warn("Failed to parse pid from fdinfo\n");
+		return -1;
+	}
+
+	procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+	if (procfd < 0) {
+		warn("Failed to open %s\n", path);
+		return -1;
+	}
+
+	/*
+	 * Verify that the pid has not been recycled and our /proc/<pid> handle
+	 * is still valid.
+	 */
+	if (sys_pidfd_send_signal(pidfd, 0, NULL, 0) < 0) {
+		/* process does not exist */
+		if (errno == ESRCH) {
+			warn("The pid was recycled\n");
+			close(procfd);
+			return -1;
+		}
+
+		/* just not allowed to signal it */
+	}
+
+	return procfd;
+}
+
+int main(int argc, char *argv[])
+{
+	int procfd, ret = EXIT_FAILURE;
+	ssize_t bytes;
+	char buf[4096] = { 0 };
+
+	int pidfd = raw_clone_pidfd();
+	if (pidfd < 0)
+		return -1;
+
+	if (pidfd == 0) {
+		printf("%d\n", getpid());
+		exit(EXIT_SUCCESS);
+	}
+
+	procfd = pidfd_metadata_fd(pidfd);
+	close(pidfd);
+	if (procfd < 0)
+		goto out;
+
+	int statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
+	close(procfd);
+	if (statusfd < 0)
+		goto out;
+
+	bytes = read(statusfd, buf, sizeof(buf));
+	if (bytes > 0)
+		bytes = write(STDOUT_FILENO, buf, bytes);
+	close(statusfd);
+
+out:
+	(void)wait(NULL);
+	if (bytes < 0 || ret)
+		exit(EXIT_FAILURE);
+
+	exit(EXIT_SUCCESS);
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access
  2019-04-10 23:40 ` [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access Christian Brauner
@ 2019-04-11  0:08   ` Daniel Colascione
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Colascione @ 2019-04-11  0:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, Jann Horn, David Howells, Linux API,
	linux-kernel, Serge E. Hallyn, Andy Lutomirski, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Aleksa Sarai, Joel Fernandes,
	Daniel Colascione

Thanks for providing this example. A new nits below.

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner <christian@brauner.io> wrote:
>
> This is an sample program to show userspace how to get race-free access to
> process metadata from a pidfd.
> It is really not that difficult and instead of burdening the kernel with
> this task by using fds to /proc/<pid> we can simply add a helper to libc
> that does it for the user.
>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Signed-off-by: Jann Horn <jann@thejh.net>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: David Howells <dhowells@redhat.com>
> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> Cc: "Dmitry V. Levin" <ldv@altlinux.org>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  samples/Makefile               |   2 +-
>  samples/pidfd/Makefile         |   6 ++
>  samples/pidfd/pidfd-metadata.c | 169 +++++++++++++++++++++++++++++++++
>  3 files changed, 176 insertions(+), 1 deletion(-)
>  create mode 100644 samples/pidfd/Makefile
>  create mode 100644 samples/pidfd/pidfd-metadata.c
>
> diff --git a/samples/Makefile b/samples/Makefile
> index b1142a958811..fadadb1c3b05 100644
> --- a/samples/Makefile
> +++ b/samples/Makefile
> @@ -3,4 +3,4 @@
>  obj-$(CONFIG_SAMPLES)  += kobject/ kprobes/ trace_events/ livepatch/ \
>                            hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
>                            configfs/ connector/ v4l/ trace_printk/ \
> -                          vfio-mdev/ statx/ qmi/ binderfs/
> +                          vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
> diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
> new file mode 100644
> index 000000000000..0ff97784177a
> --- /dev/null
> +++ b/samples/pidfd/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +hostprogs-y := pidfd-metadata
> +always := $(hostprogs-y)
> +HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
> +all: pidfd-metadata
> diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
> new file mode 100644
> index 000000000000..c46c6c34a012
> --- /dev/null
> +++ b/samples/pidfd/pidfd-metadata.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define _GNU_SOURCE
> +#include <err.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <sched.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +
> +#ifndef CLONE_PIDFD
> +#define CLONE_PIDFD 0x00001000
> +#endif
> +
> +static int raw_clone_pidfd(void)
> +{
> +       unsigned long flags = CLONE_PIDFD;
> +
> +#if defined(__s390x__) || defined(__s390__) || defined(__CRIS__)
> +       /* On s390/s390x and cris the order of the first and second arguments
> +         * of the system call is reversed.
> +         */
> +       return (int)syscall(__NR_clone, NULL, flags | SIGCHLD);
> +#elif defined(__sparc__) && defined(__arch64__)
> +       {
> +               /*
> +                 * sparc64 always returns the other process id in %o0, and a
> +                 * boolean flag whether this is the child or the parent in %o1.
> +                 * Inline assembly is needed to get the flag returned in %o1.
> +                 */
> +               int in_child;
> +               int child_pid;
> +               asm volatile("mov %2, %%g1\n\t"
> +                            "mov %3, %%o0\n\t"
> +                            "mov 0 , %%o1\n\t"
> +                            "t 0x6d\n\t"
> +                            "mov %%o1, %0\n\t"
> +                            "mov %%o0, %1"
> +                            : "=r"(in_child), "=r"(child_pid)
> +                            : "i"(__NR_clone), "r"(flags | SIGCHLD)
> +                            : "%o1", "%o0", "%g1");
> +
> +               if (in_child)
> +                       return 0;
> +               else
> +                       return child_pid;
> +       }
> +#elif defined(__ia64__)
> +       /* On ia64 the stack and stack size are passed as separate arguments. */
> +       return (int)syscall(__NR_clone, flags | SIGCHLD, NULL, prctl_arg(0));
> +#else
> +       return (int)syscall(__NR_clone, flags | SIGCHLD, NULL);
> +#endif
> +}
> +
> +static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> +                                       unsigned int flags)
> +{
> +       return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> +}
> +
> +static int pidfd_metadata_fd(int pidfd)
> +{
> +       int procfd, ret;
> +       char path[100];
> +       FILE *f;
> +       size_t n = 0;
> +       char *line = NULL;
> +
> +       snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
> +
> +       f = fopen(path, "re");
> +       if (!f)
> +               return -1;
> +
> +       ret = 0;
> +       while (getline(&line, &n, f) != -1) {
> +               char *numstr;
> +               size_t len;
> +
> +               if (strncmp(line, "Pid:\t", 5))
> +                       continue;
> +
> +               numstr = line + 5;
> +               len = strlen(numstr);
> +               if (len > 0 && numstr[len - 1] == '\n')
> +                       numstr[len - 1] = '\0';
> +               ret = snprintf(path, sizeof(path), "/proc/%s", numstr);
> +               break;
> +       }
> +       free(line);
> +       fclose(f);
> +
> +       if (!ret) {
> +               errno = ENOENT;
> +               warn("Failed to parse pid from fdinfo\n");
> +               return -1;
> +       }
> +
> +       procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
> +       if (procfd < 0) {
> +               warn("Failed to open %s\n", path);
> +               return -1;
> +       }
> +
> +       /*
> +        * Verify that the pid has not been recycled and our /proc/<pid> handle
> +        * is still valid.
> +        */
> +       if (sys_pidfd_send_signal(pidfd, 0, NULL, 0) < 0) {
> +               /* process does not exist */
> +               if (errno == ESRCH) {
> +                       warn("The pid was recycled\n");

ITYM that the process was reaped.

> +                       close(procfd);
> +                       return -1;
> +               }
> +
> +               /* just not allowed to signal it */

I'd look for EPERM specifically instead of just assuming that any
error indicates that a permission failure. I'd also explicitly state
that EPERM still implies process existence.

> +       }
> +
> +       return procfd;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int procfd, ret = EXIT_FAILURE;
> +       ssize_t bytes;
> +       char buf[4096] = { 0 };
> +
> +       int pidfd = raw_clone_pidfd();
> +       if (pidfd < 0)
> +               return -1;
> +
> +       if (pidfd == 0) {
> +               printf("%d\n", getpid());
> +               exit(EXIT_SUCCESS);
> +       }
> +
> +       procfd = pidfd_metadata_fd(pidfd);
> +       close(pidfd);
> +       if (procfd < 0)
> +               goto out;
> +
> +       int statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
> +       close(procfd);
> +       if (statusfd < 0)
> +               goto out;
> +
> +       bytes = read(statusfd, buf, sizeof(buf));
> +       if (bytes > 0)
> +               bytes = write(STDOUT_FILENO, buf, bytes);
> +       close(statusfd);
> +
> +out:
> +       (void)wait(NULL);
> +       if (bytes < 0 || ret)
> +               exit(EXIT_FAILURE);
> +
> +       exit(EXIT_SUCCESS);
> +}
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] fork: add CLONE_PIDFD
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
                   ` (4 preceding siblings ...)
  2019-04-10 23:40 ` [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access Christian Brauner
@ 2019-04-11  0:12 ` Daniel Colascione
  2019-04-11 16:50 ` Linus Torvalds
  6 siblings, 0 replies; 10+ messages in thread
From: Daniel Colascione @ 2019-04-11  0:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Al Viro, Jann Horn, David Howells, Linux API,
	linux-kernel, Serge E. Hallyn, Andy Lutomirski, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Aleksa Sarai, Joel Fernandes,
	Daniel Colascione

Thanks for trying it both ways.

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner <christian@brauner.io> wrote:
>
> Hey Linus,
>
> This is an RFC for adding a new CLONE_PIDFD flag to clone() as
> previously discussed.
> While implementing this Jann and I ran into additional complexity that
> prompted us to send out an initial RFC patchset to make sure we still
> think going forward with the current implementation is a good idea and
> also provide an alternative approach:
>
> RFC-1:
> This is an RFC for the implementation of pidfds as /proc/<pid> file
> descriptors.
> The tricky part here is that we need to retrieve a file descriptor for
> /proc/<pid> before clone's point of no return. Otherwise, we need to fail
> the creation of a process that has already passed all barriers and is
> visible in userspace. Getting that file descriptor then becomes a rather
> intricate dance including allocating a detached dentry that we need to
> commit once attach_pid() has been called.
> Note that this RFC only includes the logic we think is needed to return
> /proc/<pid> file descriptors from clone. It does *not* yet include the even
> more complex logic needed to restrict procfs itself. And the additional
> logic needed to prevent attacks such as openat(pidfd, "..", ...) and access
> to /proc/<pid>/net/ on top of the procfs restriction.

Why would filtering proc be all that complicated? Wouldn't it just be
adding a "sensitive" flag to struct pid_entry and skipping entries
with that flag when constructing proc entries?

> There are a couple of reasons why we stopped short of this and decided to
> sent out an RFC first:
> - Even the initial part of getting file descriptors from /proc/<pid> out
>   of clone() required rather complex code that struck us as very
>   inelegant and heavy (which granted, might partially caused by not seeing
>   a cleaner way to implement this). Thus, it felt like we needed to see
>   whether this is even remotely considered acceptable.
> - While discussing further aspects of this approach with Al we received
>   rather substantiated opposition to exposing even more codepaths to
>   procfs.
> - Restricting access to procfs properly requires a lot of invasive work
>   even touching core vfs functions such as
>   follow_dotdot()/follow_dotdot_rcu() which also caused 2.

Wasn't an internal bind mount supposed to take care of the parent
traversal problem?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] fork: add CLONE_PIDFD
  2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
                   ` (5 preceding siblings ...)
  2019-04-11  0:12 ` [RFC PATCH] fork: add CLONE_PIDFD Daniel Colascione
@ 2019-04-11 16:50 ` Linus Torvalds
  2019-04-11 18:09   ` Christian Brauner
  6 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-04-11 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Jann Horn, David Howells, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, Jonathan Kowalski,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov, Aleksa Sarai,
	Joel Fernandes, Daniel Colascione

On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner <christian@brauner.io> wrote:
>
> RFC-2:
> This alternative patchset uses anonymous file descriptors instead of
> file descriptors from /proc/<pid>.

I think I prefer this one. Your diffstat makes it look bigger, but
that's because you added the example code to use this to that rfc..

Plus making anon_inodes unconditional makes sense anyway.  They are
always selected in practice to begin with.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] fork: add CLONE_PIDFD
  2019-04-11 16:50 ` Linus Torvalds
@ 2019-04-11 18:09   ` Christian Brauner
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Brauner @ 2019-04-11 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jann Horn, David Howells, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, Jonathan Kowalski,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov, Aleksa Sarai,
	Joel Fernandes, Daniel Colascione

On April 11, 2019 6:50:47 PM GMT+02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Wed, Apr 10, 2019 at 4:43 PM Christian Brauner
><christian@brauner.io> wrote:
>>
>> RFC-2:
>> This alternative patchset uses anonymous file descriptors instead of
>> file descriptors from /proc/<pid>.
>
>I think I prefer this one. Your diffstat makes it look bigger, but
>that's because you added the example code to use this to that rfc..

Jann and I think this is the correct way to proceed as well.
And this will be way more acceptable for Al too we think.
I have spoken to Joel just now and he's
happy to change RFC patchsets related to
pidfds he has to the anon inode approach as well.
I'll take care to do the necessary coordination before anything
gets sent your way before the next merge window.

>
>Plus making anon_inodes unconditional makes sense anyway.  They are
>always selected in practice to begin with.

Yes, indeed. I coordinated this with David and Al.
They need anonymous inodes for the mount API as well
and thus need anon inodes useable in core vfs functions.

Christian


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-11 18:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 23:40 [RFC PATCH] fork: add CLONE_PIDFD Christian Brauner
2019-04-10 23:40 ` [RFC-1 PATCH 1/1] fork: add CLONE_PIDFD via /proc/<pid> Christian Brauner
2019-04-10 23:40 ` [RFC-2 PATCH 1/4] Make anon_inodes unconditional Christian Brauner
2019-04-10 23:40 ` [RFC-2 PATCH 2/4] fork: add CLONE_PIDFD via anonymous inode Christian Brauner
2019-04-10 23:40 ` [RFC-2 PATCH 3/4] signal: support CLONE_PIDFD with pidfd_send_signal Christian Brauner
2019-04-10 23:40 ` [RFC-2 PATCH 4/4] samples: show race-free pidfd metadata access Christian Brauner
2019-04-11  0:08   ` Daniel Colascione
2019-04-11  0:12 ` [RFC PATCH] fork: add CLONE_PIDFD Daniel Colascione
2019-04-11 16:50 ` Linus Torvalds
2019-04-11 18:09   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).