linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/4] pid: add pidctl()
@ 2019-03-26 15:55 Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

This is v1 of this patchset with various minor fixes which are listed in
the individual commits. Notably, pidfds are now O_CLOEXEC by default.

The pidctl() syscalls builds on, extends, and improves translate_pid()
[4] and serves as the natural connection between the pid-based and the
pidfd-based api.

I quote Konstantins original patchset first that has already been acked
and picked up by Eric before and whose functionality is preserved in
this syscall. Multiple people have asked when this patchset will be sent
in for merging (cf. [1], [2]). It has recently been revived by
Nagarathnam Muthusamy from Oracle [3].

The intention of the original translate_pid() syscall was twofold:
1. Provide translation of pids between pid namespaces especially for the
   case of deeply nested pid namespaces.
   The most obvious use-case is strace which has been waiting for this
   feature for a while.
2. Provide implicit pid namespace introspection

Both functionalities are preserved. The latter task has been improved
upon though. In the original version of the pachset passing pid as 1
would allow to deterimine the relationship between the pid namespaces.
This is inherhently racy. If pid 1 inside a pid namespace has died it
would report false negatives. For example, if pid 1 inside of the target
pid namespace already died, it would report that the target pid
namespace cannot be reached from the source pid namespace because it
couldn't find the pid inside of the target pid namespace and thus
falsely report to the user that the two pid namespaces are not related.
This problem is simple to avoid. In the new version we simply walk the
list of ancestors and check whether the namespace are related to each
other. By doing it this way we can reliably report what the relationship
between two pid namespace file descriptors looks like.

Additionally, this syscall has been extended to allow the retrieval of
pidfds independent of procfs. These pidfds can e.g. be used with the new
pidfd_send_signal() syscall we recently merged. The ability to retrieve
pidfds independent of procfs had already been requested in the
pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
[5]. A use-case where a kernel is compiled without procfs but where
pidfds are still useful has been outlined by Andy in [6]. Regular
anon-inode based file descriptors are used that stash a reference to
struct pid in file->private_data and drop that reference on close.

With this pidctl() has three closely related functionalities that
provide a natural connection between the pid-based and the pidfd-based
api. To clarify the semantics and to make it easier for userspace to use
the syscall it has a command argument and three commands clearly
reflecting the functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
PIDCMD_GET_PIDFD).

Embedding the retrieval of pidfds into this syscall has two main
advantages:
- pidctl provides a natural and clean connection between the traditional
  pid-based and the newer pidfd-based process API
- allows the retrieval of pidfds for other pid namespaces while
  enforcing that
  - the caller must have been given access to two file descriptors
    referring to target and source pid namespace
  - the source pid namespace must be an ancestor of the target pid
    namespace
  - the pid must be translatable from the source pid namespace into the
    target pid namespace

Note that this patchset also includes Al's and David's commit to make anon
inodes unconditional. The original intention is to make it possible to use
anon inodes in core vfs functions. pidctl() has the same requirement so
David suggested I sent this in alongside this patch. Both are informed of
this.

The syscall comes with extensive testing for all functionalities.

/* References */
[1]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[2]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
[3]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[4]: 3eb39f47934f9d5a3027fe00d906a45fe3a15fad
[5]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[6]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/

Thanks!
Christian

Christian Brauner (3):
  pid: add pidctl()
  signal: support pidctl() with pidfd_send_signal()
  tests: add pidctl() tests

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig                        |   1 -
 arch/arm64/kvm/Kconfig                      |   1 -
 arch/mips/kvm/Kconfig                       |   1 -
 arch/powerpc/kvm/Kconfig                    |   1 -
 arch/s390/kvm/Kconfig                       |   1 -
 arch/x86/Kconfig                            |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/x86/kvm/Kconfig                        |   1 -
 drivers/base/Kconfig                        |   1 -
 drivers/char/tpm/Kconfig                    |   1 -
 drivers/dma-buf/Kconfig                     |   1 -
 drivers/gpio/Kconfig                        |   1 -
 drivers/iio/Kconfig                         |   1 -
 drivers/infiniband/Kconfig                  |   1 -
 drivers/vfio/Kconfig                        |   1 -
 fs/Makefile                                 |   2 +-
 fs/notify/fanotify/Kconfig                  |   1 -
 fs/notify/inotify/Kconfig                   |   1 -
 include/linux/pid.h                         |   2 +
 include/linux/pid_namespace.h               |   8 +
 include/linux/syscalls.h                    |   2 +
 include/uapi/linux/wait.h                   |  14 +
 init/Kconfig                                |  10 -
 kernel/pid.c                                | 161 ++++++
 kernel/pid_namespace.c                      |  25 +
 kernel/signal.c                             |  29 +-
 kernel/sys_ni.c                             |   3 -
 tools/testing/selftests/pidfd/Makefile      |   2 +-
 tools/testing/selftests/pidfd/pidctl_test.c | 537 ++++++++++++++++++++
 30 files changed, 765 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidctl_test.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH v1 0/4] pidfd_open()
@ 2019-03-27 22:19 Christian Brauner
  2019-03-27 22:19 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2019-03-27 22:19 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

Hey,

This is v1 of this patchset. No major changes. Just fixing nits that
Jann detected.

After the discussion over the last days, this is a fresh approach to
getting pidfds independent of the translate_pid() patchset.

pidfd_open() allows to retrieve pidfds for processes and removes the
dependency of pidfd on procfs.
These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so. Currently they do not
support any advanced operations. The pidfds are not associated with a
specific pid namespaces but rather only reference struct pid of a given
process in their private_data member.

One of the oustanding issues has been how to get information about a given
process if pidfds are regular file descriptors and do not provide access to
the process /proc/<pid> directory.
Various solutions have been proposed. The one that most people prefer is to
be able to retrieve a file descriptor to /proc/<pid> based on a pidfd (and
the other way around).
IF PROCFD_TO_PIDFD is passed as a flag together with a file descriptor to a
/proc mount in a given pid namespace and a pidfd pidfd_open() will return a
file descriptor to the corresponding /proc/<pid> directory in procfs
mounts' pid namespace. pidfd_open() is very careful to verify that the pid
hasn't been recycled in between.
IF PIDFD_TO_PROCFD is passed as a flag together with a file descriptor
referencing a /proc/<pid> directory a pidfd referencing the struct pid
stashed in /proc/<pid> of the process will be returned.
The pidfd_open() syscalls in that manner resembles openat() as it uses a
flag argument to modify what type of file descriptor will be returned.

The pidfd_open() implementation together with the flags argument strikes me
as an elegant compromise between splitting this into multiple syscalls and
avoiding ioctls().

Note that this patchset also includes Al's and David's commit to make anon
inodes unconditional. The original intention is to make it possible to use
anon inodes in core vfs functions. pidctl() has the same requirement so
David suggested I sent this in alongside this patch. Both are informed of
this.

The syscall comes with appropriate basic testing.

/* Examples */
// Retrieve pidfd
int pidfd = pidfd_open(1234, -1, -1, 0);

// Retrieve /proc/<pid> handle for pidfd
int procfd = open("/proc", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int procpidfd = pidfd_open(-1, procfd, pidfd, PIDFD_TO_PROCFD);

// Retrieve pidfd for /proc/<pid>
int procpidfd = open("/proc/1234", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
int pidfd = pidfd_open(-1, procpidfd, -1, PROCFD_TO_PIDFD);

Thanks!
Christian

Christian Brauner (3):
  pid: add pidfd_open()
  signal: support pidfd_open() with pidfd_send_signal()
  tests: add pidfd_open() tests

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig                          |   1 -
 arch/arm64/kvm/Kconfig                        |   1 -
 arch/mips/kvm/Kconfig                         |   1 -
 arch/powerpc/kvm/Kconfig                      |   1 -
 arch/s390/kvm/Kconfig                         |   1 -
 arch/x86/Kconfig                              |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/kvm/Kconfig                          |   1 -
 drivers/base/Kconfig                          |   1 -
 drivers/char/tpm/Kconfig                      |   1 -
 drivers/dma-buf/Kconfig                       |   1 -
 drivers/gpio/Kconfig                          |   1 -
 drivers/iio/Kconfig                           |   1 -
 drivers/infiniband/Kconfig                    |   1 -
 drivers/vfio/Kconfig                          |   1 -
 fs/Makefile                                   |   2 +-
 fs/notify/fanotify/Kconfig                    |   1 -
 fs/notify/inotify/Kconfig                     |   1 -
 include/linux/pid.h                           |   2 +
 include/linux/syscalls.h                      |   2 +
 include/uapi/linux/wait.h                     |   3 +
 init/Kconfig                                  |  10 -
 kernel/pid.c                                  | 242 ++++++++++++++++++
 kernel/signal.c                               |  14 +-
 kernel/sys_ni.c                               |   3 -
 tools/testing/selftests/pidfd/Makefile        |   2 +-
 .../testing/selftests/pidfd/pidfd_open_test.c | 201 +++++++++++++++
 28 files changed, 464 insertions(+), 35 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH v1 0/4] clone: add CLONE_PIDFD
@ 2019-04-16 17:02 Christian Brauner
  2019-04-16 17:02 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2019-04-16 17:02 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	oleg, cyphar, joel, dancol, Christian Brauner

Hey,

This is v1.
The only significant change is to have pidfds returned in the fourth
argument of clone allowing us to return a pidfd and its pid to the
caller at the same time. This has various advantages:
- callers get the associated pid for the pidfd without additional
  parsing 
  This makes it easier for userspce to get metadata access through
  procfs.
- the type of the return value for clone() remains unchanged
  (was changed to return an fd in the previous iteration)
- pid file descriptor numbering can start at 0 as is customary for
  file descriptors
  (was changed to start at 1 in the previous patchset to not break
   fork()-like error checking when returning pidfds)
- finally, the patchset has gotten smaller

The patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call as previously discussed.

As decided last week [1] Jann and I have refined the implementation of
pidfds as anonymous inodes. Based on last weeks RFC we have only tweaked
documentation and naming, as well as making the sample program how to
get easy metadata access from a pidfd a little cleaner and more paranoid
when checking for errors.
The sample program can also serve as a test for the patchset.

When clone is called with CLONE_PIDFD a pidfd will be returned in the
fourth argument of clone. This is based on an idea from Oleg. It allows
us to return a pidfd and the associated pid to the caller at the same
time.

We have taken care that pidfds are created *after* the fd table has
been unshared to not leak pidfds into child processes.
pidfd creation during clone is split into two distinct steps:
1. preparing both an fd and a file referencing struct pid for fd_install()
2. fd_install()ing the pidfd
Step 1. is performed before clone's point of no return and especially
before write_lock_irq(&tasklist_lock) is taken.
Performing 1. before clone's point of no return ensures that we don't
need to fail a process that is already visible to userspace when pidfd
creation fails. Step 2. happens after attach_pid() is performed and the
process is visible to userspace.
Technically, we could have also performed step 1. and 2. together before
clone's point of no return and then calling close on the file descriptor
on failure. This would slightly increase code-locality but it is
semantically more correct and clean to bring the pidfd into existence
once the process is fully attached and not before.

The actual code for CLONE_PIDFD in patch 2 is completely confined to
fork.c (apart from the CLONE_PIDFD definition of course) and is rather
small and hopefully good to review.

The additional changes listed under David's name in the diffstat below
are here to make anon_inodes available unconditionally. They are needed
for the new mount api and thus for core vfs code in addition to pidfds.
David knows this and he has informed Al that this patch is sent out
here. The changes themselves are rather automatic.

As promised I have also contacted Joel who has sent a patchset to make
pidfds pollable. He has been informed and is happy to port his patchset
once we have moved forward [2].
Jann and I currently plan to target this patchset for inclusion in the 5.2
merge window.

Thanks!
Jann & Christian

[1]: https://lore.kernel.org/lkml/CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20190411200059.GA75190@google.com/

Christian Brauner (3):
  clone: add CLONE_PIDFD
  signal: support CLONE_PIDFD with pidfd_send_signal
  samples: show race-free pidfd metadata access

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig           |   1 -
 arch/arm64/kvm/Kconfig         |   1 -
 arch/mips/kvm/Kconfig          |   1 -
 arch/powerpc/kvm/Kconfig       |   1 -
 arch/s390/kvm/Kconfig          |   1 -
 arch/x86/Kconfig               |   1 -
 arch/x86/kvm/Kconfig           |   1 -
 drivers/base/Kconfig           |   1 -
 drivers/char/tpm/Kconfig       |   1 -
 drivers/dma-buf/Kconfig        |   1 -
 drivers/gpio/Kconfig           |   1 -
 drivers/iio/Kconfig            |   1 -
 drivers/infiniband/Kconfig     |   1 -
 drivers/vfio/Kconfig           |   1 -
 fs/Makefile                    |   2 +-
 fs/notify/fanotify/Kconfig     |   1 -
 fs/notify/inotify/Kconfig      |   1 -
 include/linux/pid.h            |   2 +
 include/uapi/linux/sched.h     |   1 +
 init/Kconfig                   |  10 ---
 kernel/fork.c                  | 121 +++++++++++++++++++++++++++++++--
 kernel/signal.c                |  14 ++--
 kernel/sys_ni.c                |   3 -
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 112 ++++++++++++++++++++++++++++++
 26 files changed, 250 insertions(+), 39 deletions(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-04-16 17:04 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
2019-03-26 16:17   ` Daniel Colascione
2019-03-26 16:23     ` Christian Brauner
2019-03-26 16:31       ` Christian Brauner
2019-03-26 16:34         ` Christian Brauner
2019-03-26 16:38           ` Daniel Colascione
2019-03-26 16:43             ` Christian Brauner
2019-03-26 16:50               ` Daniel Colascione
2019-03-26 17:05                 ` Christian Brauner
2019-03-26 16:42           ` Andy Lutomirski
2019-03-26 16:46             ` Christian Brauner
2019-03-26 17:00               ` Daniel Colascione
2019-03-26 16:33       ` Daniel Colascione
2019-03-26 17:06   ` Joel Fernandes
2019-03-26 17:08     ` Christian Brauner
2019-03-26 17:15       ` Joel Fernandes
2019-03-26 17:17         ` Christian Brauner
2019-03-26 17:18           ` Joel Fernandes
2019-03-26 17:22     ` Christian Brauner
2019-03-26 18:10       ` Joel Fernandes
2019-03-26 18:19         ` Christian Brauner
2019-03-26 18:47           ` Joel Fernandes
2019-03-26 19:19         ` Christian Brauner
2019-03-26 19:51           ` Joel Fernandes
2019-03-26 19:57             ` Christian Brauner
2019-03-26 15:55 ` [PATCH v1 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
2019-03-26 15:55 ` [PATCH v1 4/4] tests: add pidctl() tests Christian Brauner
2019-03-27 22:19 [PATCH v1 0/4] pidfd_open() Christian Brauner
2019-03-27 22:19 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
2019-04-16 17:02 [PATCH v1 0/4] clone: add CLONE_PIDFD Christian Brauner
2019-04-16 17:02 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).