linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nsfs: add NS_GET_INIT_PID ioctl
@ 2020-06-18  8:45 Christian Brauner
  2020-06-18  9:03 ` Michael Kerrisk (man-pages)
  2020-07-10 11:58 ` Qian Cai
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Brauner @ 2020-06-18  8:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Wolfgang Bumiller, Serge Hallyn, Michael Kerrisk, Alexander Viro,
	Christian Brauner

Add an ioctl() to return the PID of the init process/child reaper of a pid
namespace as seen in the caller's pid namespace.

LXCFS is a tiny fuse filesystem used to virtualize various aspects of
procfs. It is used actively by a large number of users including ChromeOS
and cloud providers. LXCFS is run on the host. The files and directories it
creates can be bind-mounted by e.g. a container at startup and mounted over
the various procfs files the container wishes to have virtualized. When
e.g. a read request for uptime is received, LXCFS will receive the pid of
the reader. In order to virtualize the corresponding read, LXCFS needs to
know the pid of the init process of the reader's pid namespace. In order to
do this, LXCFS first needs to fork() two helper processes. The first helper
process setns() to the readers pid namespace. The second helper process is
needed to create a process that is a proper member of the pid namespace.
The second helper process then creates a ucred message with ucred.pid set
to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid
field to the corresponding pid number in LXCFS's pid namespace. This way
LXCFS can learn the init pid number of the reader's pid namespace and can
go on to virtualize. Since these two forks() are costly LXCFS maintains an
init pid cache that caches a given pid for a fixed amount of time. The
cache is pruned during new read requests. However, even with the cache the
hit of the two forks() is singificant when a very large number of
containers are running. With this simple patch we add an ns ioctl that
let's a caller retrieve the init pid nr of a pid namespace through its
pid namespace fd. This _significantly_ improves our performance with a very
simple change. A caller should do something like:
- pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID);
- verify init_pid is still valid (not necessarily both but recommended):
  - opening a pidfd to get a stable reference
  - opening /proc/<init_pid>/ns/pid and verifying that <pid_ns_fd>
    and the pid namespace fd of <init_pid> refer to the same pid namespace

Note, it is possible for the init process of the pid namespace (identified
via the child_reaper member in the relevant pid namespace) to die and get
reaped right after the ioctl returned. If that happens there are two cases
to consider:
- if the init process was single threaded, all other processes in the pid
  namespace will be zapped and any new process creation in there will fail;
  A caller can detect this case since either the init pid is still around
  but it is a zombie, or it already has exited and not been recycled, or it
  has exited, been reaped, and also been recycled. The last case is the
  most interesting one but a caller would then be able to detect that the
  recycled process lives in a different pid namespace.
- if the init process was multi-threaded, then the kernel will try to make
  one of the threads in the same thread-group - if any are still alive -
  the new child_reaper. In this case the caller can detect that the thread
  which exited and used to be the child_reaper is no longer alive. If it's
  tid has been recycled in the same pid namespace a caller can detect this
  by parsing through /proc/<tid>/stat, looking at the Nspid: field and if
  there's a entry with pid nr 1 in the respective pid namespace it can be
  sure that it hasn't been recycled.
Both options can be combined with pidfd_open() to make sure that a stable
reference is maintained.

Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/nsfs.c                 | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/nsfs.h |  2 ++
 2 files changed, 31 insertions(+)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 800c1d0eb0d0..5a7de1ee6df0 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -8,6 +8,7 @@
 #include <linux/magic.h>
 #include <linux/ktime.h>
 #include <linux/seq_file.h>
+#include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/nsfs.h>
 #include <linux/uaccess.h>
@@ -189,6 +190,10 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
 			unsigned long arg)
 {
 	struct user_namespace *user_ns;
+	struct pid_namespace *pid_ns;
+	struct task_struct *child_reaper;
+	struct pid *pid_struct;
+	pid_t pid;
 	struct ns_common *ns = get_proc_ns(file_inode(filp));
 	uid_t __user *argp;
 	uid_t uid;
@@ -209,6 +214,30 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
 		argp = (uid_t __user *) arg;
 		uid = from_kuid_munged(current_user_ns(), user_ns->owner);
 		return put_user(uid, argp);
+	case NS_GET_INIT_PID:
+		if (ns->ops->type != CLONE_NEWPID)
+			return -EINVAL;
+
+		pid_ns = container_of(ns, struct pid_namespace, ns);
+
+		/*
+		 * If we're asking for the init pid of our own pid namespace
+		 * that's of course silly but no need to fail this since we can
+		 * both infer or find out our own pid namespaces's init pid
+		 * trivially. In all other cases, we require the same
+		 * privileges as for setns().
+		 */
+		if (task_active_pid_ns(current) != pid_ns &&
+		    !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+			return -EPERM;
+
+		pid = -ESRCH;
+		read_lock(&tasklist_lock);
+		if (likely(pid_ns->child_reaper))
+			pid = task_pid_vnr(pid_ns->child_reaper);
+		read_unlock(&tasklist_lock);
+
+		return pid;
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
index a0c8552b64ee..29c775f42bbe 100644
--- a/include/uapi/linux/nsfs.h
+++ b/include/uapi/linux/nsfs.h
@@ -15,5 +15,7 @@
 #define NS_GET_NSTYPE		_IO(NSIO, 0x3)
 /* Get owner UID (in the caller's user namespace) for a user namespace */
 #define NS_GET_OWNER_UID	_IO(NSIO, 0x4)
+/* Get init PID (in the caller's pid namespace) of a pid namespace */
+#define NS_GET_INIT_PID		_IO(NSIO, 0x5)
 
 #endif /* __LINUX_NSFS_H */

base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl
  2020-06-18  8:45 [PATCH] nsfs: add NS_GET_INIT_PID ioctl Christian Brauner
@ 2020-06-18  9:03 ` Michael Kerrisk (man-pages)
  2020-06-18  9:18   ` Christian Brauner
  2020-07-10 11:58 ` Qian Cai
  1 sibling, 1 reply; 5+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-06-18  9:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, lkml, Wolfgang Bumiller, Serge Hallyn, Alexander Viro

On Thu, 18 Jun 2020 at 10:45, Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> Add an ioctl() to return the PID of the init process/child reaper of a pid
> namespace as seen in the caller's pid namespace.

What are the pros and cons of returning a PID FD instead of a PID?

Thanks,

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl
  2020-06-18  9:03 ` Michael Kerrisk (man-pages)
@ 2020-06-18  9:18   ` Christian Brauner
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2020-06-18  9:18 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-fsdevel, lkml, Wolfgang Bumiller, Serge Hallyn, Alexander Viro

On Thu, Jun 18, 2020 at 11:03:25AM +0200, Michael Kerrisk (man-pages) wrote:
> On Thu, 18 Jun 2020 at 10:45, Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > Add an ioctl() to return the PID of the init process/child reaper of a pid
> > namespace as seen in the caller's pid namespace.
> 
> What are the pros and cons of returning a PID FD instead of a PID?

A pidfd doesn't buy you much here since you can race-free turn the PID
into a pidfd via pidfd_open() right after.
But mostly, I don't want to introduce the pattern of returning pidfds
from all corners of the kernel especially when it's not strictly
required. The central entrypoints should remain clone{3}() and
pidfd_open() for now. I want to remain conservative with this until we
have had more of userspace rely on them for a while and the bugs and
features requests come trickling in. We've seen a good portion of that
but we'll likely see more. If we need to do global changes (e.g. sending
signals outside of your own pid namespace hierarchy or attaching
capabilities to them) we will be in better shape if we don't return them
from everywhere just yet.

Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl
  2020-06-18  8:45 [PATCH] nsfs: add NS_GET_INIT_PID ioctl Christian Brauner
  2020-06-18  9:03 ` Michael Kerrisk (man-pages)
@ 2020-07-10 11:58 ` Qian Cai
  2020-07-10 12:01   ` Christian Brauner
  1 sibling, 1 reply; 5+ messages in thread
From: Qian Cai @ 2020-07-10 11:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Wolfgang Bumiller, Serge Hallyn,
	Michael Kerrisk, Alexander Viro

On Thu, Jun 18, 2020 at 10:45:43AM +0200, Christian Brauner wrote:
> Add an ioctl() to return the PID of the init process/child reaper of a pid
> namespace as seen in the caller's pid namespace.
> 
> LXCFS is a tiny fuse filesystem used to virtualize various aspects of
> procfs. It is used actively by a large number of users including ChromeOS
> and cloud providers. LXCFS is run on the host. The files and directories it
> creates can be bind-mounted by e.g. a container at startup and mounted over
> the various procfs files the container wishes to have virtualized. When
> e.g. a read request for uptime is received, LXCFS will receive the pid of
> the reader. In order to virtualize the corresponding read, LXCFS needs to
> know the pid of the init process of the reader's pid namespace. In order to
> do this, LXCFS first needs to fork() two helper processes. The first helper
> process setns() to the readers pid namespace. The second helper process is
> needed to create a process that is a proper member of the pid namespace.
> The second helper process then creates a ucred message with ucred.pid set
> to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid
> field to the corresponding pid number in LXCFS's pid namespace. This way
> LXCFS can learn the init pid number of the reader's pid namespace and can
> go on to virtualize. Since these two forks() are costly LXCFS maintains an
> init pid cache that caches a given pid for a fixed amount of time. The
> cache is pruned during new read requests. However, even with the cache the
> hit of the two forks() is singificant when a very large number of
> containers are running. With this simple patch we add an ns ioctl that
> let's a caller retrieve the init pid nr of a pid namespace through its
> pid namespace fd. This _significantly_ improves our performance with a very
> simple change. A caller should do something like:
> - pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID);
> - verify init_pid is still valid (not necessarily both but recommended):
>   - opening a pidfd to get a stable reference
>   - opening /proc/<init_pid>/ns/pid and verifying that <pid_ns_fd>
>     and the pid namespace fd of <init_pid> refer to the same pid namespace
> 
> Note, it is possible for the init process of the pid namespace (identified
> via the child_reaper member in the relevant pid namespace) to die and get
> reaped right after the ioctl returned. If that happens there are two cases
> to consider:
> - if the init process was single threaded, all other processes in the pid
>   namespace will be zapped and any new process creation in there will fail;
>   A caller can detect this case since either the init pid is still around
>   but it is a zombie, or it already has exited and not been recycled, or it
>   has exited, been reaped, and also been recycled. The last case is the
>   most interesting one but a caller would then be able to detect that the
>   recycled process lives in a different pid namespace.
> - if the init process was multi-threaded, then the kernel will try to make
>   one of the threads in the same thread-group - if any are still alive -
>   the new child_reaper. In this case the caller can detect that the thread
>   which exited and used to be the child_reaper is no longer alive. If it's
>   tid has been recycled in the same pid namespace a caller can detect this
>   by parsing through /proc/<tid>/stat, looking at the Nspid: field and if
>   there's a entry with pid nr 1 in the respective pid namespace it can be
>   sure that it hasn't been recycled.
> Both options can be combined with pidfd_open() to make sure that a stable
> reference is maintained.
> 
> Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

fs/nsfs.c: In function ‘ns_ioctl’:
fs/nsfs.c:195:14: warning: unused variable ‘pid_struct’ [-Wunused-variable]
  struct pid *pid_struct;
              ^~~~~~~~~~
fs/nsfs.c:194:22: warning: unused variable ‘child_reaper’ [-Wunused-variable]
  struct task_struct *child_reaper;
                      ^~~~~~~~~~~~

> ---
>  fs/nsfs.c                 | 29 +++++++++++++++++++++++++++++
>  include/uapi/linux/nsfs.h |  2 ++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 800c1d0eb0d0..5a7de1ee6df0 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -8,6 +8,7 @@
>  #include <linux/magic.h>
>  #include <linux/ktime.h>
>  #include <linux/seq_file.h>
> +#include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
>  #include <linux/nsfs.h>
>  #include <linux/uaccess.h>
> @@ -189,6 +190,10 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
>  			unsigned long arg)
>  {
>  	struct user_namespace *user_ns;
> +	struct pid_namespace *pid_ns;
> +	struct task_struct *child_reaper;
> +	struct pid *pid_struct;
> +	pid_t pid;
>  	struct ns_common *ns = get_proc_ns(file_inode(filp));
>  	uid_t __user *argp;
>  	uid_t uid;
> @@ -209,6 +214,30 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
>  		argp = (uid_t __user *) arg;
>  		uid = from_kuid_munged(current_user_ns(), user_ns->owner);
>  		return put_user(uid, argp);
> +	case NS_GET_INIT_PID:
> +		if (ns->ops->type != CLONE_NEWPID)
> +			return -EINVAL;
> +
> +		pid_ns = container_of(ns, struct pid_namespace, ns);
> +
> +		/*
> +		 * If we're asking for the init pid of our own pid namespace
> +		 * that's of course silly but no need to fail this since we can
> +		 * both infer or find out our own pid namespaces's init pid
> +		 * trivially. In all other cases, we require the same
> +		 * privileges as for setns().
> +		 */
> +		if (task_active_pid_ns(current) != pid_ns &&
> +		    !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +			return -EPERM;
> +
> +		pid = -ESRCH;
> +		read_lock(&tasklist_lock);
> +		if (likely(pid_ns->child_reaper))
> +			pid = task_pid_vnr(pid_ns->child_reaper);
> +		read_unlock(&tasklist_lock);
> +
> +		return pid;
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
> index a0c8552b64ee..29c775f42bbe 100644
> --- a/include/uapi/linux/nsfs.h
> +++ b/include/uapi/linux/nsfs.h
> @@ -15,5 +15,7 @@
>  #define NS_GET_NSTYPE		_IO(NSIO, 0x3)
>  /* Get owner UID (in the caller's user namespace) for a user namespace */
>  #define NS_GET_OWNER_UID	_IO(NSIO, 0x4)
> +/* Get init PID (in the caller's pid namespace) of a pid namespace */
> +#define NS_GET_INIT_PID		_IO(NSIO, 0x5)
>  
>  #endif /* __LINUX_NSFS_H */
> 
> base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl
  2020-07-10 11:58 ` Qian Cai
@ 2020-07-10 12:01   ` Christian Brauner
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2020-07-10 12:01 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-fsdevel, linux-kernel, Wolfgang Bumiller, Serge Hallyn,
	Michael Kerrisk, Alexander Viro

On Fri, Jul 10, 2020 at 07:58:36AM -0400, Qian Cai wrote:
> On Thu, Jun 18, 2020 at 10:45:43AM +0200, Christian Brauner wrote:
> > Add an ioctl() to return the PID of the init process/child reaper of a pid
> > namespace as seen in the caller's pid namespace.
> > 
> > LXCFS is a tiny fuse filesystem used to virtualize various aspects of
> > procfs. It is used actively by a large number of users including ChromeOS
> > and cloud providers. LXCFS is run on the host. The files and directories it
> > creates can be bind-mounted by e.g. a container at startup and mounted over
> > the various procfs files the container wishes to have virtualized. When
> > e.g. a read request for uptime is received, LXCFS will receive the pid of
> > the reader. In order to virtualize the corresponding read, LXCFS needs to
> > know the pid of the init process of the reader's pid namespace. In order to
> > do this, LXCFS first needs to fork() two helper processes. The first helper
> > process setns() to the readers pid namespace. The second helper process is
> > needed to create a process that is a proper member of the pid namespace.
> > The second helper process then creates a ucred message with ucred.pid set
> > to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid
> > field to the corresponding pid number in LXCFS's pid namespace. This way
> > LXCFS can learn the init pid number of the reader's pid namespace and can
> > go on to virtualize. Since these two forks() are costly LXCFS maintains an
> > init pid cache that caches a given pid for a fixed amount of time. The
> > cache is pruned during new read requests. However, even with the cache the
> > hit of the two forks() is singificant when a very large number of
> > containers are running. With this simple patch we add an ns ioctl that
> > let's a caller retrieve the init pid nr of a pid namespace through its
> > pid namespace fd. This _significantly_ improves our performance with a very
> > simple change. A caller should do something like:
> > - pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID);
> > - verify init_pid is still valid (not necessarily both but recommended):
> >   - opening a pidfd to get a stable reference
> >   - opening /proc/<init_pid>/ns/pid and verifying that <pid_ns_fd>
> >     and the pid namespace fd of <init_pid> refer to the same pid namespace
> > 
> > Note, it is possible for the init process of the pid namespace (identified
> > via the child_reaper member in the relevant pid namespace) to die and get
> > reaped right after the ioctl returned. If that happens there are two cases
> > to consider:
> > - if the init process was single threaded, all other processes in the pid
> >   namespace will be zapped and any new process creation in there will fail;
> >   A caller can detect this case since either the init pid is still around
> >   but it is a zombie, or it already has exited and not been recycled, or it
> >   has exited, been reaped, and also been recycled. The last case is the
> >   most interesting one but a caller would then be able to detect that the
> >   recycled process lives in a different pid namespace.
> > - if the init process was multi-threaded, then the kernel will try to make
> >   one of the threads in the same thread-group - if any are still alive -
> >   the new child_reaper. In this case the caller can detect that the thread
> >   which exited and used to be the child_reaper is no longer alive. If it's
> >   tid has been recycled in the same pid namespace a caller can detect this
> >   by parsing through /proc/<tid>/stat, looking at the Nspid: field and if
> >   there's a entry with pid nr 1 in the respective pid namespace it can be
> >   sure that it hasn't been recycled.
> > Both options can be combined with pidfd_open() to make sure that a stable
> > reference is maintained.
> > 
> > Cc: Wolfgang Bumiller <w.bumiller@proxmox.com>
> > Cc: Serge Hallyn <serge@hallyn.com>
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: linux-fsdevel@vger.kernel.org
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> 
> fs/nsfs.c: In function ‘ns_ioctl’:
> fs/nsfs.c:195:14: warning: unused variable ‘pid_struct’ [-Wunused-variable]
>   struct pid *pid_struct;
>               ^~~~~~~~~~
> fs/nsfs.c:194:22: warning: unused variable ‘child_reaper’ [-Wunused-variable]
>   struct task_struct *child_reaper;
>                       ^~~~~~~~~~~~

Thanks. Fyi, this has been reported by Stephen already.
Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-10 12:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-18  8:45 [PATCH] nsfs: add NS_GET_INIT_PID ioctl Christian Brauner
2020-06-18  9:03 ` Michael Kerrisk (man-pages)
2020-06-18  9:18   ` Christian Brauner
2020-07-10 11:58 ` Qian Cai
2020-07-10 12:01   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).