linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] pid: add pidctl()
@ 2019-03-25 16:20 Christian Brauner
  2019-03-25 16:20 ` [PATCH 1/4] Make anon_inodes unconditional Christian Brauner
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 16:20 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
I quote Konstantins original patchset first that has already been acked and
picked up by Eric before and whose functionality is preserved in this
syscall. Multiple people have asked when this patchset will be sent in
for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
Muthusamy from Oracle [3].

The intention of the original translate_pid() syscall was twofold:
1. Provide translation of pids between pid namespaces
2. Provide implicit pid namespace introspection

Both functionalities are preserved. The latter task has been improved
upon though. In the original version of the pachset passing pid as 1
would allow to deterimine the relationship between the pid namespaces.
This is inherhently racy. If pid 1 inside a pid namespace has died it
would report false negatives. For example, if pid 1 inside of the target
pid namespace already died, it would report that the target pid
namespace cannot be reached from the source pid namespace because it
couldn't find the pid inside of the target pid namespace and thus
falsely report to the user that the two pid namespaces are not related.
This problem is simple to avoid. In the new version we simply walk the
list of ancestors and check whether the namespace are related to each
other. By doing it this way we can reliably report what the relationship
between two pid namespace file descriptors looks like.

Additionally, this syscall has been extended to allow the retrieval of
pidfds independent of procfs. These pidfds can e.g. be used with the new
pidfd_send_signal() syscall we recently merged. The ability to retrieve
pidfds independent of procfs had already been requested in the
pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
[5]. A use-case where a kernel is compiled without procfs but where
pidfds are still useful has been outlined by Andy in [6]. Regular
anon-inode based file descriptors are used that stash a reference to
struct pid in file->private_data and drop that reference on close.

With this translate_pid() has three closely related but still distinct
functionalities. To clarify the semantics and to make it easier for
userspace to use the syscall it has:
- gained a command argument and three commands clearly reflecting the
  distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
  PIDCMD_GET_PIDFD).
- been renamed to pidctl()

By gaining support for cleanly retrieving pidfds this syscall connects the
traditional pid-based and the newer pidfd-based process API in a natural
and clean way. Another advantage is that embedding this functionality into
pidctl() let's us avoid adding another syscall just serving the single
purpose of retrieving a pidfd.
The flag argument allows to atomically set the cloexec when retrieving
pidfds.

Note that this patchset also includes Al's and David's commit to make anon
inodes unconditional. The original intention is to make it possible to use
anon inodes in core vfs functions. pidctl() has the same requirement so
David suggested I sent this in alongside this patch. Both are informed of
this.

The syscall comes with extensive testing for all functionalities.

/* References */
[1]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[2]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
[3]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[4]: 3eb39f47934f9d5a3027fe00d906a45fe3a15fad
[5]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[6]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/

Thanks!
Christian

Christian Brauner (3):
  pid: add pidctl()
  signal: support pidctl() with pidfd_send_signal()
  tests: add pidctl() tests

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig                        |   1 -
 arch/arm64/kvm/Kconfig                      |   1 -
 arch/mips/kvm/Kconfig                       |   1 -
 arch/powerpc/kvm/Kconfig                    |   1 -
 arch/s390/kvm/Kconfig                       |   1 -
 arch/x86/Kconfig                            |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/x86/kvm/Kconfig                        |   1 -
 drivers/base/Kconfig                        |   1 -
 drivers/char/tpm/Kconfig                    |   1 -
 drivers/dma-buf/Kconfig                     |   1 -
 drivers/gpio/Kconfig                        |   1 -
 drivers/iio/Kconfig                         |   1 -
 drivers/infiniband/Kconfig                  |   1 -
 drivers/vfio/Kconfig                        |   1 -
 fs/Makefile                                 |   2 +-
 fs/notify/fanotify/Kconfig                  |   1 -
 fs/notify/inotify/Kconfig                   |   1 -
 include/linux/pid.h                         |   2 +
 include/linux/pid_namespace.h               |   8 +
 include/linux/syscalls.h                    |   2 +
 include/uapi/linux/wait.h                   |  17 +
 init/Kconfig                                |  10 -
 kernel/pid.c                                | 162 ++++++
 kernel/pid_namespace.c                      |  25 +
 kernel/signal.c                             |  20 +-
 kernel/sys_ni.c                             |   3 -
 tools/testing/selftests/pidfd/Makefile      |   2 +-
 tools/testing/selftests/pidfd/pidctl_test.c | 553 ++++++++++++++++++++
 30 files changed, 782 insertions(+), 42 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidctl_test.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] Make anon_inodes unconditional
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
@ 2019-03-25 16:20 ` Christian Brauner
  2019-03-25 16:20 ` [PATCH 2/4] pid: add pidctl() Christian Brauner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 16:20 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

From: David Howells <dhowells@redhat.com>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and the pidctl() syscall.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidctl()]
Signed-off-by: Christian Brauner <christian@brauner.io>
---
 arch/arm/kvm/Kconfig       |  1 -
 arch/arm64/kvm/Kconfig     |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/Kconfig           |  1 -
 arch/x86/kvm/Kconfig       |  1 -
 drivers/base/Kconfig       |  1 -
 drivers/char/tpm/Kconfig   |  1 -
 drivers/dma-buf/Kconfig    |  1 -
 drivers/gpio/Kconfig       |  1 -
 drivers/iio/Kconfig        |  1 -
 drivers/infiniband/Kconfig |  1 -
 drivers/vfio/Kconfig       |  1 -
 fs/Makefile                |  2 +-
 fs/notify/fanotify/Kconfig |  1 -
 fs/notify/inotify/Kconfig  |  1 -
 init/Kconfig               | 10 ----------
 18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on MMU && OF
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select ARM_GIC
 	select ARM_GIC_V3
 	select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	depends on OF
 	select MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	depends on MIPS_FP_SUPPORT
 	select EXPORT_UASM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..18f2c954464e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
-	select ANON_INODES
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
-	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
 config DMA_SHARED_BUFFER
 	bool
 	default n
-	select ANON_INODES
 	select IRQ_WORK
 	help
 	  This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
 config TCG_VTPM_PROXY
 	tristate "VTPM Proxy Interface"
 	depends on TCG_TPM
-	select ANON_INODES
 	---help---
 	  This driver proxies for an emulated TPM (vTPM) running in userspace.
 	  A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
 config SYNC_FILE
 	bool "Explicit Synchronization Framework"
 	default n
-	select ANON_INODES
 	select DMA_SHARED_BUFFER
 	---help---
 	  The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
 
 menuconfig GPIOLIB
 	bool "GPIO Support"
-	select ANON_INODES
 	help
 	  This enables GPIO support through the generic GPIO library.
 	  You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
 
 menuconfig IIO
 	tristate "Industrial I/O support"
-	select ANON_INODES
 	help
 	  The industrial I/O subsystem provides a unified framework for
 	  drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
 
 config INFINIBAND_USER_ACCESS
 	tristate "InfiniBand userspace access (verbs and CM)"
-	select ANON_INODES
 	depends on MMU
 	---help---
 	  Userspace InfiniBand access support.  This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
-	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
 config FANOTIFY
 	bool "Filesystem wide access notification"
 	select FSNOTIFY
-	select ANON_INODES
 	select EXPORTFS
 	default n
 	---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	select ANON_INODES
 	select FSNOTIFY
 	default y
 	---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] pid: add pidctl()
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
  2019-03-25 16:20 ` [PATCH 1/4] Make anon_inodes unconditional Christian Brauner
@ 2019-03-25 16:20 ` Christian Brauner
  2019-03-25 17:20   ` Mika Penttilä
  2019-03-25 18:18   ` Jann Horn
  2019-03-25 16:20 ` [PATCH 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 16:20 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
I quote Konstantins original patchset first that has already been acked and
picked up by Eric before and whose functionality is preserved in this
syscall:

"Each process have different pids, one for each pid namespace it belongs.
 When interaction happens within single pid-ns translation isn't required.
 More complicated scenarios needs special handling.

 For example:
 - reading pid-files or logs written inside container with pid namespace
 - writing logs with internal pids outside container for pushing them into
 - attaching with ptrace to tasks from different pid namespace

 Generally speaking, any cross pid-ns API with pids needs translation.

 Currently there are several interfaces that could be used here:

 Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.

 Pids for nested pid namespaces are shown in file /proc/[pid]/status.
 In some cases pid translation could be easily done using this information.
 Backward translation requires scanning all tasks and becomes really
 complicated for deeper namespace nesting.

 Unix socket automatically translates pid attached to SCM_CREDENTIALS.
 This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
 into pid namespace, this expose process and could be insecure."

The original patchset allowed two distinct operations implicitly:
- discovering whether pid namespaces (pidns) have a parent-child
  relationship
- translating a pid from a source pidns into a target pidns

Both tasks are accomplished in the original patchset by passing a pid
along. If the pid argument is passed as 1 the relationship between two pid
namespaces can be discovered.
The syscall will gain a lot clearer syntax and will be easier to use for
userspace if the task it is asked to perform is passed through a
command argument. Additionally, it allows us to remove an intrinsic race
caused by using the pid argument as a way to discover the relationship
between pid namespaces.
This patch introduces three commands:

/* PIDCMD_QUERY_PID */
PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
Given a source pid namespace fd return the pid of the process in the target
namespace:
1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve callers pidns
  - return pid in callers pidns

2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
  - retrieve callers pidns
  - retrieve struct pid identifed by pid in callers pidns
  - retrieve pidns identified by target_fd
  - return pid in pidns identified by target_fd

3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by init task in pidns identified by source_fd
  - retrieve callers pidns
  - return pid of init task of pidns identified by source_fd in callers pidns

4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve pidns identified by target_fd
  - check whether struct pid can be found in pidns identified by target_fd
  - return pid in pidns identified by target_fd

/* PIDCMD_QUERY_PIDNS */
PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
namespaces.
In the original version of the pachset passing pid as 1 would allow to
deterimine the relationship between the pid namespaces. This is inherhently
racy. If pid 1 inside a pid namespace has died it would report false
negatives. For example, if pid 1 inside of the target pid namespace already
died, it would report that the target pid namespace cannot be reached from
the source pid namespace because it couldn't find the pid inside of the
target pid namespace and thus falsely report to the user that the two pid
namespaces are not related. This problem is simple to avoid. In the new
version we simply walk the list of ancestors and check whether the
namespace are related to each other. By doing it this way we can reliably
report what the relationship between two pid namespace file descriptors
looks like.

1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
   - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other

2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
   - pidns_of(ns_fd1) == pidns_of(ns_fd2)

3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
   - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)

4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
   - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)

These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
improve the functionality expressed implicitly in translate_pid() before.

/* PIDCMD_GET_PIDFD */
This command allows to retrieve file descriptors for processes and removes
the dependency of pidfds and thereby the pidfd_send_signal() syscall on
procfs. First, multiple people have expressed a desire to do this even when
pidfd_send_signal() was merged. It is even recorded in the commit message
for pidfd_send_signal() itself
(cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
Q-06: (Andrew Morton)
      Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
      so this is something that we will need to support anyway. Hence,
      there's no immediate need to add another syscalls just to make
      pidfd_send_signal() not dependent on the presence of procfs. However,
      adding a syscalls to get such file descriptors is planned for a
      future patchset).
Alexey made a similar request (cf. [2]).
Additionally, Andy made an additional, idependent argument that we should
go forward with non-proc-dirfd file descriptors for the sake of security
and extensibility (cf. [3]).

The pidfds are not associated with a specific pid namespaces but rather
only with struct pid. What the pidctl() syscall enforces is that when a
caller wants to retrieve a pidfd file descriptor for a pid in a given
target pid namespace the caller needs to proof it has access to a pid
namespace in which the pid he wants a pidfd for a) exists and b) that this
pid namespace is an ancestor of the target pid namespace, i.e. that the pid
can be translated into the target pid namespaces.

1. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, -1, PIDCTL_CLOEXEC)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve callers pidns
  - return pidfd

2. pidctl(PIDCMD_GET_PIDFD, pid, -1, target_fd, PIDCTL_CLOEXEC)
  - retrieve callers pidns
  - retrieve struct pid identifed by pid in callers pidns
  - retrieve pidns identified by target_fd
  - return pidfd

3. pidctl(PIDCMD_GET_PIDFD, 1, source_fd, -1, PIDCTL_CLOEXEC)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by init task in pidns identified by
    source_fd
  - retrieve callers pidns
  - return pidfd

4. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, target_fd, PIDCTL_CLOEXEC)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve pidns identified by target_fd
  - check whether struct pid can be found in pidns identified by target_fd
  - return pidfd

These pidfds are allocated using anon_inode_getfd() and can be used with
the pidfd_send_signal() syscall. They are not dirfds and as such have the
advantage that we can make them pollable or readable in the future if we
see a need to do so (which I'm pretty sure we will eventually). Currently
they do not support any advanced operations.

/* References */
[1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/
[4]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/pid.h                    |   2 +
 include/linux/pid_namespace.h          |   8 ++
 include/linux/syscalls.h               |   2 +
 include/uapi/linux/wait.h              |  17 +++
 kernel/pid.c                           | 162 +++++++++++++++++++++++++
 kernel/pid_namespace.c                 |  25 ++++
 8 files changed, 218 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1f9607ed087c..afc5a8997140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -433,3 +433,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	pidctl			sys_pidctl			__ia32_sys_pidctl
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4378d4..8ef25d9a003d 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	pidctl			__x64_sys_pidctl
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..c5aae8151287 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -64,6 +64,8 @@ extern struct pid_namespace *copy_pid_ns(unsigned long flags,
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
+extern int pidnscmp(struct pid_namespace *ancestor,
+		    struct pid_namespace *descendant);
 
 #else /* !CONFIG_PID_NS */
 #include <linux/err.h>
@@ -94,6 +96,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
 	return 0;
 }
+
+static inline int pidnscmp(struct pid_namespace *ancestor,
+			   struct pid_namespace *descendant)
+{
+	return 0;
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806a561f..a4c8c59f7c8f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 				struct old_timex32 __user *tx);
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidctl(unsigned int cmd, pid_t pid, int source, int target,
+			   unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
index ac49a220cf2a..4f3dfb528b20 100644
--- a/include/uapi/linux/wait.h
+++ b/include/uapi/linux/wait.h
@@ -18,5 +18,22 @@
 #define P_PID		1
 #define P_PGID		2
 
+/* Commands to pass to pidctl() */
+enum pidcmd {
+	PIDCMD_QUERY_PID   = 0, /* Get pid in target pid namespace */
+	PIDCMD_QUERY_PIDNS = 1, /* Determine relationship between pid namespaces */
+	PIDCMD_GET_PIDFD   = 2, /* Get pidfd for a process */
+};
+
+/* Return values of PIDCMD_QUERY_PIDNS */
+enum pidcmd_query_pidns {
+	PIDNS_UNRELATED          = 0, /* The pid namespaces are unrelated */
+	PIDNS_EQUAL              = 1, /* The pid namespaces are equal */
+	PIDNS_SOURCE_IS_ANCESTOR = 2, /* Source pid namespace is ancestor of target pid namespace */
+	PIDNS_TARGET_IS_ANCESTOR = 3, /* Target pid namespace is ancestor of source pid namespace */
+};
+
+/* Flags to pass to pidctl() */
+#define PIDCTL_CLOEXEC 1
 
 #endif /* _UAPI_LINUX_WAIT_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..b3702bd38bdd 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,6 +26,7 @@
  *
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/mm.h>
 #include <linux/export.h>
 #include <linux/slab.h>
@@ -40,6 +41,7 @@
 #include <linux/proc_fs.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/wait.h>
 
 struct pid init_struct_pid = {
 	.count 		= ATOMIC_INIT(1),
@@ -451,6 +453,166 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return idr_get_next(&ns->idr, &nr);
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	if (pid) {
+		file->private_data = NULL;
+		put_pid(pid);
+	}
+
+	return 0;
+}
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+	.llseek	 = no_llseek,
+};
+
+static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
+{
+	int fd;
+
+	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid),
+			      O_RDWR | o_flags);
+	if (fd < 0)
+		put_pid(pid);
+
+	return fd;
+}
+
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+	struct pid_namespace *pidns = ERR_PTR(-EINVAL);
+
+	if (fd >= 0) {
+#ifdef CONFIG_PID_NS
+		struct ns_common *ns;
+		struct file *file = proc_ns_fget(fd);
+		if (IS_ERR(file))
+			return ERR_CAST(file);
+
+		ns = get_proc_ns(file_inode(file));
+		if (ns->ops->type == CLONE_NEWPID)
+			pidns = get_pid_ns(
+				container_of(ns, struct pid_namespace, ns));
+
+		fput(file);
+#endif
+	} else {
+		pidns = task_active_pid_ns(current);
+	}
+
+	return pidns;
+}
+
+static int pidns_related(struct pid_namespace *source,
+			 struct pid_namespace *target)
+{
+	int query;
+
+	query = pidnscmp(source, target);
+	switch (query) {
+	case 0:
+		return PIDNS_EQUAL;
+	case 1:
+		return PIDNS_SOURCE_IS_ANCESTOR;
+	}
+
+	query = pidnscmp(target, source);
+	if (query == 1)
+		return PIDNS_TARGET_IS_ANCESTOR;
+
+	return PIDNS_UNRELATED;
+}
+
+/*
+ * pidctl - perform operations on pids
+ * @cmd:    command to execute
+ * @pid:    pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file descriptor or -1 for active namesapce
+ * @flags:  flags to pass
+ *
+ * If cmd is PIDCMD_QUERY_PID translates pid between pid namespaces
+ * identified by @source and @target. Returns pid if process has pid in
+ * @target, 0 if process has no pid in @target, -ESRCH if process does
+ * not have a pid in @source.
+ *
+ * If cmd is PIDCMD_QUERY_PIDNS determines the relations between two pid
+ * namespaces. Returns 2 if @source is an ancestor pid namespace
+ * of @target, 1 if @source and @target refer to the same pid namespace,
+ * 3 if @target is an ancestor pid namespace of @source, 0 if they have
+ * no parent-child relationship in either direction.
+ *
+ * If cmd is PIDCMD_GET_PIDFD returns pidfd for process in @target pid
+ * namespace. Returns pidfd if process has pid in @target, -ESRCH if
+ * process does not have a pid in @source, -ENOENT if process does not have a
+ * pid in @target pid namespace.
+ *
+ */
+SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
+		unsigned int, flags)
+{
+	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
+	struct pid *struct_pid;
+	pid_t result;
+
+	switch (cmd) {
+	case PIDCMD_QUERY_PIDNS:
+		if (pid != 0)
+			return -EINVAL;
+		pid = 1;
+		/* fall through */
+	case PIDCMD_QUERY_PID:
+		if (flags != 0)
+			return -EINVAL;
+		break;
+	case PIDCMD_GET_PIDFD:
+		if (flags & ~PIDCTL_CLOEXEC)
+			return -EINVAL;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	source_ns = get_pid_ns_by_fd(source);
+	result = PTR_ERR(source_ns);
+	if (IS_ERR(source_ns))
+		goto err_source;
+
+	target_ns = get_pid_ns_by_fd(target);
+	result = PTR_ERR(target_ns);
+	if (IS_ERR(target_ns))
+		goto err_target;
+
+	if (cmd == PIDCMD_QUERY_PIDNS) {
+		result = pidns_related(source_ns, target_ns);
+	} else {
+		rcu_read_lock();
+		struct_pid = find_pid_ns(pid, source_ns);
+		result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
+		rcu_read_unlock();
+
+		if (cmd == PIDCMD_GET_PIDFD) {
+			int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
+			if (result > 0)
+				result = pidfd_create_fd(struct_pid, cloexec);
+			else if (result == 0)
+				result = -ENOENT;
+		}
+	}
+
+	if (target)
+		put_pid_ns(target_ns);
+err_target:
+	if (source)
+		put_pid_ns(source_ns);
+err_source:
+	return result;
+}
+
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index aa6e72fb7c08..1c863fb3d55a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
 	return &get_pid_ns(pid_ns)->ns;
 }
 
+/**
+ * pidnscmp - Determine if @ancestor is ancestor of @descendant
+ * @ancestor:   pidns suspected to be the ancestor of @descendant
+ * @descendant: pidns suspected to be the descendant of @ancestor
+ *
+ * Returns -1 if @ancestor is not an ancestor of @descendant,
+ * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
+ * is an ancestor of @descendant.
+ */
+int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
+{
+	if (ancestor == descendant)
+		return 0;
+
+	for (;;) {
+		if (!descendant)
+			return -1;
+		if (descendant == ancestor)
+			break;
+		descendant = descendant->parent;
+	}
+
+	return 1;
+}
+
 static struct user_namespace *pidns_owner(struct ns_common *ns)
 {
 	return to_pid_ns(ns)->user_ns;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
  2019-03-25 16:20 ` [PATCH 1/4] Make anon_inodes unconditional Christian Brauner
  2019-03-25 16:20 ` [PATCH 2/4] pid: add pidctl() Christian Brauner
@ 2019-03-25 16:20 ` Christian Brauner
  2019-03-25 18:28   ` Jonathan Kowalski
  2019-03-25 18:39   ` Jann Horn
  2019-03-25 16:20 ` [PATCH 4/4] tests: add pidctl() tests Christian Brauner
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 16:20 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 kernel/signal.c | 20 +++++++++-----------
 kernel/sys_ni.c |  3 ---
 2 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..d77183be1677 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 	return kill_something_info(sig, &info, pid);
 }
 
-#ifdef CONFIG_PROC_FS
 /*
  * Verify that the signaler and signalee either are in the same pid namespace
  * or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3521,16 +3520,13 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
  */
 static bool access_pidfd_pidns(struct pid *pid)
 {
+	int ret;
 	struct pid_namespace *active = task_active_pid_ns(current);
 	struct pid_namespace *p = ns_of_pid(pid);
 
-	for (;;) {
-		if (!p)
-			return false;
-		if (p == active)
-			break;
-		p = p->parent;
-	}
+	ret = pidnscmp(active, p);
+	if (ret < 0)
+		return false;
 
 	return true;
 }
@@ -3581,12 +3577,15 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	if (flags)
 		return -EINVAL;
 
-	f = fdget_raw(pidfd);
+	f = fdget(pidfd);
 	if (!f.file)
 		return -EBADF;
 
 	/* Is this a pidfd? */
-	pid = tgid_pidfd_to_pid(f.file);
+	if (f.file->f_op == &pidfd_fops)
+		pid = f.file->private_data;
+	else
+		pid = tgid_pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
 		goto err;
@@ -3625,7 +3624,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	fdput(f);
 	return ret;
 }
-#endif /* CONFIG_PROC_FS */
 
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
 
 /* kernel/sched/core.c */
 
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
 COND_SYSCALL(setgid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/4] tests: add pidctl() tests
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
                   ` (2 preceding siblings ...)
  2019-03-25 16:20 ` [PATCH 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
@ 2019-03-25 16:20 ` Christian Brauner
  2019-03-25 16:48 ` [PATCH 0/4] pid: add pidctl() Daniel Colascione
  2019-03-25 16:56 ` David Howells
  5 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 16:20 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

This adds test cases for all three subcommands and verifies that they
succeed and fail as expected.
Additionally, the tests verify that pidctl() pidfds are correctly useable
with pidfd_send_signal().

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 tools/testing/selftests/pidfd/Makefile      |   2 +-
 tools/testing/selftests/pidfd/pidctl_test.c | 553 ++++++++++++++++++++
 2 files changed, 554 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/pidfd/pidctl_test.c

diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..29dfa29b3afa 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
 CFLAGS += -g -I../../../../usr/include/
 
-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidctl_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidctl_test.c b/tools/testing/selftests/pidfd/pidctl_test.c
new file mode 100644
index 000000000000..e274be42acfb
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidctl_test.c
@@ -0,0 +1,553 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static int parent_pidns_fd = -1;
+static pid_t parent_pidns_pid = 0;
+
+static int child_pidns_fd = -1;
+static pid_t child_pidns_pid = 0;
+
+static int cousin_pidns_fd = -1;
+static pid_t cousin_pidns_pid = 0;
+
+static bool pidns_supported = false;
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+					unsigned int flags)
+{
+	return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static inline int sys_pidctl(unsigned int cmd, pid_t pid, int source,
+			     int target, unsigned int flags)
+{
+	return syscall(__NR_pidctl, cmd, pid, source, target, flags);
+}
+
+struct cr_clone_arg {
+	char stack[128] __attribute__((aligned(16)));
+	char stack_ptr[0];
+};
+
+static int child_pidns_creator(void *args)
+{
+	(void)prctl(PR_SET_PDEATHSIG, SIGKILL);
+	while (1)
+		sleep(5);
+
+	exit(0);
+}
+
+static int prepare_pid_namespaces(void)
+{
+	char path[512];
+	struct cr_clone_arg ca;
+	pid_t pid;
+
+	parent_pidns_fd = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd < 0) {
+		ksft_print_msg("failed to open current pid namespace");
+		return -1;
+	}
+	parent_pidns_pid = getpid();
+
+	pid = clone(child_pidns_creator, ca.stack_ptr, CLONE_NEWPID | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		ksft_print_msg("failed to clone child-pidns process in new pid namespace");
+		return -1;
+	}
+
+	snprintf(path, sizeof(path), "/proc/%d/ns/pid", pid);
+
+	child_pidns_fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (child_pidns_fd < 0) {
+		ksft_print_msg("failed to open pid namespace");
+		return -1;
+	}
+	child_pidns_pid = pid;
+
+	pid = clone(child_pidns_creator, ca.stack_ptr, CLONE_NEWPID | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		ksft_print_msg("failed to clone cousin-pidns process in new pid namespace");
+		return -1;
+	}
+
+	snprintf(path, sizeof(path), "/proc/%d/ns/pid", pid);
+
+	cousin_pidns_fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (cousin_pidns_fd < 0) {
+		ksft_print_msg("failed to open cousin pid namespace");
+		return -1;
+	}
+	cousin_pidns_pid = pid;
+
+	return 0;
+}
+
+static int test_pidcmd_query_pid(void)
+{
+	const char *test_name = "pidctl PIDCMD_QUERY_PID";
+	pid_t pid, self;
+	int parent_pidns_fd2;
+
+	self = getpid();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, -1, PIDCTL_CLOEXEC);
+	if (pid >= 0) {
+		ksft_print_msg("%s test %d: managed to pass invalid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, -1, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open current pid namespace\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd,
+			 parent_pidns_fd2, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, parent_pidns_fd2, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd, -1, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	close(parent_pidns_fd2);
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd,
+			 child_pidns_fd, 0);
+	if (pid != 0) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, child_pidns_fd,
+			 parent_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, child_pidns_pid, parent_pidns_fd,
+			 child_pidns_fd, 0);
+	if (pid != 1) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), child_pidns_pid, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, 1, child_pidns_fd, parent_pidns_fd,
+			 0);
+	if (pid != child_pidns_pid) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), 1, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, 1, child_pidns_fd, cousin_pidns_fd,
+			 0);
+	if (pid != 0) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), 1, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, cousin_pidns_pid, child_pidns_fd,
+			 cousin_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), cousin_pidns_pid, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static int test_pidcmd_query_pidns(void)
+{
+	const char *test_name = "pidctl PIDCMD_QUERY_PIDNS";
+	int parent_pidns_fd2;
+	int query;
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, -1, -1, PIDCTL_CLOEXEC);
+	if (query >= 0) {
+		ksft_print_msg("%s test %d: managed to pass invalid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 1234, -1, -1, 0);
+	if (query >= 0)
+		ksft_print_msg("%s test %d: managed to pass invalid pid argument\n",
+			test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open second pid namespace file descriptor\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, parent_pidns_fd,
+			   parent_pidns_fd2, 0);
+	close(parent_pidns_fd2);
+	if (query != PIDNS_EQUAL) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespaces are identical %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, parent_pidns_fd,
+			   child_pidns_fd, 0);
+	if (query != PIDNS_SOURCE_IS_ANCESTOR) {
+		ksft_print_msg("%s test %d: failed to detect that source pid namespace is ancestor of target pid namespace %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   parent_pidns_fd, 0);
+	if (query != PIDNS_TARGET_IS_ANCESTOR) {
+		ksft_print_msg("%s test %d: failed to detect that target pid namespace is ancestor of source pid namespace %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   cousin_pidns_fd, 0);
+	if (query != PIDNS_UNRELATED) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespace are not related %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   cousin_pidns_fd, 0);
+	if (query != PIDNS_UNRELATED) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespace are not related %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static int test_pidcmd_get_pidfd(void)
+{
+	const char *test_name = "pidctl PIDCMD_GET_PIDFD";
+	pid_t self;
+	int pidfd, parent_pidns_fd2;
+
+	self = getpid();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, -1, -1, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, -1, -1, PIDCTL_CLOEXEC);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to pass valid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open current pid namespace\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd,
+			   parent_pidns_fd2, PIDCTL_CLOEXEC);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, -1, parent_pidns_fd2,
+			   PIDCTL_CLOEXEC);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd, -1,
+			   PIDCTL_CLOEXEC);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	close(parent_pidns_fd2);
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd,
+			   child_pidns_fd, PIDCTL_CLOEXEC);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, child_pidns_fd,
+			   parent_pidns_fd, PIDCTL_CLOEXEC);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, 1, child_pidns_fd, parent_pidns_fd,
+			   PIDCTL_CLOEXEC);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, 1, child_pidns_fd, cousin_pidns_fd,
+			   PIDCTL_CLOEXEC);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, cousin_pidns_pid, child_pidns_fd,
+			   cousin_pidns_fd, PIDCTL_CLOEXEC);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static void test_pidctl_pidfd_send_signal(void)
+{
+	const char *test_name = "pidctl with pidfd_send_signal";
+	int child_pidfd, cousin_pidfd, ret;
+
+	child_pidfd = sys_pidctl(PIDCMD_GET_PIDFD, child_pidns_pid, -1, -1,
+				 PIDCTL_CLOEXEC);
+	if (child_pidfd < 0)
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	ret = sys_pidfd_send_signal(child_pidfd, SIGKILL, NULL, 0);
+	if (ret < 0) {
+		kill(child_pidns_pid, SIGKILL);
+		ksft_print_msg("%s test %d: failed to send signal via pidfd\n",
+			       test_name, ksft_test_num());
+	}
+	ksft_inc_pass_cnt();
+
+	cousin_pidfd = sys_pidctl(PIDCMD_GET_PIDFD, cousin_pidns_pid, -1, -1,
+				  PIDCTL_CLOEXEC);
+	if (cousin_pidfd < 0)
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	ret = sys_pidfd_send_signal(cousin_pidfd, SIGKILL, NULL, 0);
+	if (ret < 0) {
+		kill(cousin_pidfd, SIGKILL);
+		ksft_print_msg("%s test %d: failed to send signal via pidfd\n",
+			       test_name, ksft_test_num());
+	}
+	ksft_inc_pass_cnt();
+
+	ksft_test_result_pass("%s test: passed\n", test_name);
+}
+
+int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (ret != pid)
+		goto again;
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
+		return -1;
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	pid_t pid;
+
+	pid = fork();
+	if (pid < 0)
+		ksft_exit_fail_msg("Failed to create new process\n");
+
+	if (pid == 0) {
+		if (unshare(CLONE_NEWPID) < 0)
+			exit(EXIT_FAILURE);
+
+		exit(EXIT_SUCCESS);
+	}
+
+	if (!wait_for_pid(pid))
+		pidns_supported = true;
+
+	ksft_print_header();
+
+	if (pidns_supported)
+		prepare_pid_namespaces();
+	else
+		ksft_print_msg(
+			"kernel does not support pid namespaces: skipping pid namespace parts of testsuite");
+
+	ret = test_pidcmd_query_pid();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_QUERY_PID tests failed");
+		goto on_error;
+	}
+
+	ret = test_pidcmd_query_pidns();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_QUERY_PIDNS tests failed");
+		goto on_error;
+	}
+
+	ret = test_pidcmd_get_pidfd();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_GET_PIDFD tests failed");
+		goto on_error;
+	}
+
+	ret = 0;
+
+on_error:
+	if (pidns_supported)
+		test_pidctl_pidfd_send_signal();
+
+	if (parent_pidns_fd >= 0)
+		close(parent_pidns_fd);
+
+	if (child_pidns_fd >= 0)
+		close(child_pidns_fd);
+
+	if (cousin_pidns_fd >= 0)
+		close(cousin_pidns_fd);
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
                   ` (3 preceding siblings ...)
  2019-03-25 16:20 ` [PATCH 4/4] tests: add pidctl() tests Christian Brauner
@ 2019-03-25 16:48 ` Daniel Colascione
  2019-03-25 17:05   ` Konstantin Khlebnikov
  2019-03-25 17:36   ` Joel Fernandes
  2019-03-25 16:56 ` David Howells
  5 siblings, 2 replies; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 16:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, nagarathnam.muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall. Multiple people have asked when this patchset will be sent in
> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> Muthusamy from Oracle [3].
>
> The intention of the original translate_pid() syscall was twofold:
> 1. Provide translation of pids between pid namespaces
> 2. Provide implicit pid namespace introspection
>
> Both functionalities are preserved. The latter task has been improved
> upon though. In the original version of the pachset passing pid as 1
> would allow to deterimine the relationship between the pid namespaces.
> This is inherhently racy. If pid 1 inside a pid namespace has died it
> would report false negatives. For example, if pid 1 inside of the target
> pid namespace already died, it would report that the target pid
> namespace cannot be reached from the source pid namespace because it
> couldn't find the pid inside of the target pid namespace and thus
> falsely report to the user that the two pid namespaces are not related.
> This problem is simple to avoid. In the new version we simply walk the
> list of ancestors and check whether the namespace are related to each
> other. By doing it this way we can reliably report what the relationship
> between two pid namespace file descriptors looks like.
>
> Additionally, this syscall has been extended to allow the retrieval of
> pidfds independent of procfs. These pidfds can e.g. be used with the new
> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> pidfds independent of procfs had already been requested in the
> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> [5]. A use-case where a kernel is compiled without procfs but where
> pidfds are still useful has been outlined by Andy in [6]. Regular
> anon-inode based file descriptors are used that stash a reference to
> struct pid in file->private_data and drop that reference on close.
>
> With this translate_pid() has three closely related but still distinct
> functionalities. To clarify the semantics and to make it easier for
> userspace to use the syscall it has:
> - gained a command argument and three commands clearly reflecting the
>   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
>   PIDCMD_GET_PIDFD).
> - been renamed to pidctl()

Having made these changes, you've built a general-purpose command
command multiplexer, not one operation that happens to be flexible.
The general-purpose command multiplexer is a common antipattern:
multiplexers make it hard to talk about different kernel-provided
operations using the common vocabulary we use to distinguish
kernel-related operations, the system call number. socketcall, for
example, turned out to be cumbersome for users like SELinux policy
writers. People had to do work work later to split socketcall into
fine-grained system calls. Please split the pidctl system call so that
the design is clean from the start and we avoid work later. System
calls are cheap.

Also, I'm still confused about how metadata access is supposed to work
for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
You snipped out a portion of a previous email in which I asked about
your thoughts on this question. With the PIDCMD_GET_PIDFD command in
place, we have two different kinds of file descriptors for processes,
one derived from procfs and one that's independent. The former works
with openat(2). The latter does not. To be very specific; if I'm
writing a function that accepts a pidfd and I get a pidfd that comes
from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
smaps or oom_score_adj or statm for the named process in a race-free
manner?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
                   ` (4 preceding siblings ...)
  2019-03-25 16:48 ` [PATCH 0/4] pid: add pidctl() Daniel Colascione
@ 2019-03-25 16:56 ` David Howells
  2019-03-25 16:58   ` Daniel Colascione
  2019-03-25 23:39   ` Andy Lutomirski
  5 siblings, 2 replies; 42+ messages in thread
From: David Howells @ 2019-03-25 16:56 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: dhowells, Christian Brauner, Jann Horn, khlebnikov,
	Andy Lutomirski, Serge E. Hallyn, Eric W. Biederman, Linux API,
	linux-kernel, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, bl0pbl33p,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	nagarathnam.muthusamy, Aleksa Sarai, Al Viro, Joel Fernandes

Daniel Colascione <dancol@google.com> wrote:

> System calls are cheap.

Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
currently at 427.

David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:56 ` David Howells
@ 2019-03-25 16:58   ` Daniel Colascione
  2019-03-25 23:39   ` Andy Lutomirski
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 16:58 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, Jann Horn, khlebnikov, Andy Lutomirski,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, nagarathnam.muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Mon, Mar 25, 2019 at 9:56 AM David Howells <dhowells@redhat.com> wrote:
> Daniel Colascione <dancol@google.com> wrote:
>
> > System calls are cheap.
>
> Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
> currently at 427.

IIRC, a while ago, someone proposed restarting system call numbering
above (again, IIRC) 1024 for both 32- and 64-bit varieties to
establish a clean slate and sidestep this problem.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:48 ` [PATCH 0/4] pid: add pidctl() Daniel Colascione
@ 2019-03-25 17:05   ` Konstantin Khlebnikov
  2019-03-25 17:07     ` Daniel Colascione
  2019-03-25 17:36   ` Joel Fernandes
  1 sibling, 1 reply; 42+ messages in thread
From: Konstantin Khlebnikov @ 2019-03-25 17:05 UTC (permalink / raw)
  To: Daniel Colascione, Christian Brauner
  Cc: Jann Horn, Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, nagarathnam.muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes



On 25.03.2019 19:48, Daniel Colascione wrote:
> On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
>> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
>> I quote Konstantins original patchset first that has already been acked and
>> picked up by Eric before and whose functionality is preserved in this
>> syscall. Multiple people have asked when this patchset will be sent in
>> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
>> Muthusamy from Oracle [3].
>>
>> The intention of the original translate_pid() syscall was twofold:
>> 1. Provide translation of pids between pid namespaces
>> 2. Provide implicit pid namespace introspection
>>
>> Both functionalities are preserved. The latter task has been improved
>> upon though. In the original version of the pachset passing pid as 1
>> would allow to deterimine the relationship between the pid namespaces.
>> This is inherhently racy. If pid 1 inside a pid namespace has died it
>> would report false negatives. For example, if pid 1 inside of the target
>> pid namespace already died, it would report that the target pid
>> namespace cannot be reached from the source pid namespace because it
>> couldn't find the pid inside of the target pid namespace and thus
>> falsely report to the user that the two pid namespaces are not related.
>> This problem is simple to avoid. In the new version we simply walk the
>> list of ancestors and check whether the namespace are related to each
>> other. By doing it this way we can reliably report what the relationship
>> between two pid namespace file descriptors looks like.
>>
>> Additionally, this syscall has been extended to allow the retrieval of
>> pidfds independent of procfs. These pidfds can e.g. be used with the new
>> pidfd_send_signal() syscall we recently merged. The ability to retrieve
>> pidfds independent of procfs had already been requested in the
>> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
>> [5]. A use-case where a kernel is compiled without procfs but where
>> pidfds are still useful has been outlined by Andy in [6]. Regular
>> anon-inode based file descriptors are used that stash a reference to
>> struct pid in file->private_data and drop that reference on close.
>>
>> With this translate_pid() has three closely related but still distinct
>> functionalities. To clarify the semantics and to make it easier for
>> userspace to use the syscall it has:
>> - gained a command argument and three commands clearly reflecting the
>>    distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
>>    PIDCMD_GET_PIDFD).
>> - been renamed to pidctl()
> 
> Having made these changes, you've built a general-purpose command
> command multiplexer, not one operation that happens to be flexible.
> The general-purpose command multiplexer is a common antipattern:
> multiplexers make it hard to talk about different kernel-provided
> operations using the common vocabulary we use to distinguish
> kernel-related operations, the system call number. socketcall, for
> example, turned out to be cumbersome for users like SELinux policy
> writers. People had to do work work later to split socketcall into
> fine-grained system calls. Please split the pidctl system call so that
> the design is clean from the start and we avoid work later. System
> calls are cheap.
> 
> Also, I'm still confused about how metadata access is supposed to work
> for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> You snipped out a portion of a previous email in which I asked about
> your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> place, we have two different kinds of file descriptors for processes,
> one derived from procfs and one that's independent. The former works
> with openat(2). The latter does not. To be very specific; if I'm
> writing a function that accepts a pidfd and I get a pidfd that comes
> from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> smaps or oom_score_adj or statm for the named process in a race-free
> manner?
> 

Task metadata could be exposed via "pages" identified by offset:

struct pidfd_stats stats;

pread(pidfd, &stats, sizeof(stats), PIDFD_STATS_OFFSET);

I'm not sure that we need yet another binary procfs.
But it will be faster than current text-based for sure.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 17:05   ` Konstantin Khlebnikov
@ 2019-03-25 17:07     ` Daniel Colascione
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 17:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Christian Brauner, Jann Horn, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, nagarathnam.muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Mon, Mar 25, 2019 at 10:05 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
> On 25.03.2019 19:48, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> >> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> >> I quote Konstantins original patchset first that has already been acked and
> >> picked up by Eric before and whose functionality is preserved in this
> >> syscall. Multiple people have asked when this patchset will be sent in
> >> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> >> Muthusamy from Oracle [3].
> >>
> >> The intention of the original translate_pid() syscall was twofold:
> >> 1. Provide translation of pids between pid namespaces
> >> 2. Provide implicit pid namespace introspection
> >>
> >> Both functionalities are preserved. The latter task has been improved
> >> upon though. In the original version of the pachset passing pid as 1
> >> would allow to deterimine the relationship between the pid namespaces.
> >> This is inherhently racy. If pid 1 inside a pid namespace has died it
> >> would report false negatives. For example, if pid 1 inside of the target
> >> pid namespace already died, it would report that the target pid
> >> namespace cannot be reached from the source pid namespace because it
> >> couldn't find the pid inside of the target pid namespace and thus
> >> falsely report to the user that the two pid namespaces are not related.
> >> This problem is simple to avoid. In the new version we simply walk the
> >> list of ancestors and check whether the namespace are related to each
> >> other. By doing it this way we can reliably report what the relationship
> >> between two pid namespace file descriptors looks like.
> >>
> >> Additionally, this syscall has been extended to allow the retrieval of
> >> pidfds independent of procfs. These pidfds can e.g. be used with the new
> >> pidfd_send_signal() syscall we recently merged. The ability to retrieve
> >> pidfds independent of procfs had already been requested in the
> >> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> >> [5]. A use-case where a kernel is compiled without procfs but where
> >> pidfds are still useful has been outlined by Andy in [6]. Regular
> >> anon-inode based file descriptors are used that stash a reference to
> >> struct pid in file->private_data and drop that reference on close.
> >>
> >> With this translate_pid() has three closely related but still distinct
> >> functionalities. To clarify the semantics and to make it easier for
> >> userspace to use the syscall it has:
> >> - gained a command argument and three commands clearly reflecting the
> >>    distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> >>    PIDCMD_GET_PIDFD).
> >> - been renamed to pidctl()
> >
> > Having made these changes, you've built a general-purpose command
> > command multiplexer, not one operation that happens to be flexible.
> > The general-purpose command multiplexer is a common antipattern:
> > multiplexers make it hard to talk about different kernel-provided
> > operations using the common vocabulary we use to distinguish
> > kernel-related operations, the system call number. socketcall, for
> > example, turned out to be cumbersome for users like SELinux policy
> > writers. People had to do work work later to split socketcall into
> > fine-grained system calls. Please split the pidctl system call so that
> > the design is clean from the start and we avoid work later. System
> > calls are cheap.
> >
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
> >
>
> Task metadata could be exposed via "pages" identified by offset:
>
> struct pidfd_stats stats;
>
> pread(pidfd, &stats, sizeof(stats), PIDFD_STATS_OFFSET);
>
> I'm not sure that we need yet another binary procfs.
> But it will be faster than current text-based for sure.

There are many options. I have some detailed thoughts on several of
them, but long design emails seem rather unwelcome. Right now, I'd
like to hear how Christian intends his patch set to address this use
case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-25 16:20 ` [PATCH 2/4] pid: add pidctl() Christian Brauner
@ 2019-03-25 17:20   ` Mika Penttilä
  2019-03-25 19:59     ` Christian Brauner
  2019-03-25 18:18   ` Jann Horn
  1 sibling, 1 reply; 42+ messages in thread
From: Mika Penttilä @ 2019-03-25 17:20 UTC (permalink / raw)
  To: Christian Brauner, jannh, khlebnikov, luto, dhowells, serge,
	ebiederm, linux-api, linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol

Hi!


> +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> +		unsigned int, flags)
> +{
> +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> +	struct pid *struct_pid;
> +	pid_t result;
> +
> +	switch (cmd) {
> +	case PIDCMD_QUERY_PIDNS:
> +		if (pid != 0)
> +			return -EINVAL;
> +		pid = 1;
> +		/* fall through */
> +	case PIDCMD_QUERY_PID:
> +		if (flags != 0)
> +			return -EINVAL;
> +		break;
> +	case PIDCMD_GET_PIDFD:
> +		if (flags & ~PIDCTL_CLOEXEC)
> +			return -EINVAL;
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	source_ns = get_pid_ns_by_fd(source);
> +	result = PTR_ERR(source_ns);
> +	if (IS_ERR(source_ns))
> +		goto err_source;
> +
> +	target_ns = get_pid_ns_by_fd(target);
> +	result = PTR_ERR(target_ns);
> +	if (IS_ERR(target_ns))
> +		goto err_target;
> +
> +	if (cmd == PIDCMD_QUERY_PIDNS) {
> +		result = pidns_related(source_ns, target_ns);
> +	} else {
> +		rcu_read_lock();
> +		struct_pid = find_pid_ns(pid, source_ns);
> +		result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;

Should you do get_pid(struct_pid) here to keep it alive till 
pidfd_create_fd() ?

> +		rcu_read_unlock();
> +
> +		if (cmd == PIDCMD_GET_PIDFD) {
> +			int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> +			if (result > 0)
> +				result = pidfd_create_fd(struct_pid, cloexec);
> +			else if (result == 0)
> +				result = -ENOENT;
> +		}
> +	}
> +
> +	if (target)
> +		put_pid_ns(target_ns);
> +err_target:
> +	if (source)
> +		put_pid_ns(source_ns);
> +err_source:
> +	return result;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>  	/* Verify no one has done anything silly: */
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index aa6e72fb7c08..1c863fb3d55a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
>  	return &get_pid_ns(pid_ns)->ns;
>  }
>  
> +/**
> + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> + * @ancestor:   pidns suspected to be the ancestor of @descendant
> + * @descendant: pidns suspected to be the descendant of @ancestor
> + *
> + * Returns -1 if @ancestor is not an ancestor of @descendant,
> + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> + * is an ancestor of @descendant.
> + */
> +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
> +{
> +	if (ancestor == descendant)
> +		return 0;
> +
> +	for (;;) {
> +		if (!descendant)
> +			return -1;
> +		if (descendant == ancestor)
> +			break;
> +		descendant = descendant->parent;
> +	}
> +
> +	return 1;
> +}
> +
>  static struct user_namespace *pidns_owner(struct ns_common *ns)
>  {
>  	return to_pid_ns(ns)->user_ns;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:48 ` [PATCH 0/4] pid: add pidctl() Daniel Colascione
  2019-03-25 17:05   ` Konstantin Khlebnikov
@ 2019-03-25 17:36   ` Joel Fernandes
  2019-03-25 17:53     ` Daniel Colascione
  2019-03-25 20:15     ` Christian Brauner
  1 sibling, 2 replies; 42+ messages in thread
From: Joel Fernandes @ 2019-03-25 17:36 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, Jann Horn, khlebnikov, Andy Lutomirski,
	David Howells, Serge E. Hallyn, Eric W. Biederman, Linux API,
	linux-kernel, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, bl0pbl33p,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	nagarathnam.muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall. Multiple people have asked when this patchset will be sent in
> > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > Muthusamy from Oracle [3].
> >
> > The intention of the original translate_pid() syscall was twofold:
> > 1. Provide translation of pids between pid namespaces
> > 2. Provide implicit pid namespace introspection
> >
> > Both functionalities are preserved. The latter task has been improved
> > upon though. In the original version of the pachset passing pid as 1
> > would allow to deterimine the relationship between the pid namespaces.
> > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > would report false negatives. For example, if pid 1 inside of the target
> > pid namespace already died, it would report that the target pid
> > namespace cannot be reached from the source pid namespace because it
> > couldn't find the pid inside of the target pid namespace and thus
> > falsely report to the user that the two pid namespaces are not related.
> > This problem is simple to avoid. In the new version we simply walk the
> > list of ancestors and check whether the namespace are related to each
> > other. By doing it this way we can reliably report what the relationship
> > between two pid namespace file descriptors looks like.
> >
> > Additionally, this syscall has been extended to allow the retrieval of
> > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > pidfds independent of procfs had already been requested in the
> > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > [5]. A use-case where a kernel is compiled without procfs but where
> > pidfds are still useful has been outlined by Andy in [6]. Regular
> > anon-inode based file descriptors are used that stash a reference to
> > struct pid in file->private_data and drop that reference on close.
> >
> > With this translate_pid() has three closely related but still distinct
> > functionalities. To clarify the semantics and to make it easier for
> > userspace to use the syscall it has:
> > - gained a command argument and three commands clearly reflecting the
> >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> >   PIDCMD_GET_PIDFD).
> > - been renamed to pidctl()
> 
[snip]
> Also, I'm still confused about how metadata access is supposed to work
> for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> You snipped out a portion of a previous email in which I asked about
> your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> place, we have two different kinds of file descriptors for processes,
> one derived from procfs and one that's independent. The former works
> with openat(2). The latter does not. To be very specific; if I'm
> writing a function that accepts a pidfd and I get a pidfd that comes
> from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> smaps or oom_score_adj or statm for the named process in a race-free
> manner?

This is true, that such usecase will not be supportable.  But the advantage
on the other hand, is that suchs "pidfd" can be made pollable or readable in
the future. Potentially allowing us to return exit status without a new
syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
with proc.

But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
can be translated into a "pidfd" using another syscall or even a node, like
/proc/pid/handle or something. I think this is what Christian suggested in
the previous threads.

And also for the translation the other way, add a syscall or modify
translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
to /proc/pid directory fds.

Should we work on patches for these? Please let us know if this idea makes
sense and thanks a lot for adding us to the review as well.

Best,

 - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 17:36   ` Joel Fernandes
@ 2019-03-25 17:53     ` Daniel Colascione
  2019-03-25 18:19       ` Jonathan Kowalski
  2019-03-25 20:15     ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 17:53 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Christian Brauner, Jann Horn, Konstantin Khlebnikov,
	Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 10:36 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > I quote Konstantins original patchset first that has already been acked and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall. Multiple people have asked when this patchset will be sent in
> > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > Muthusamy from Oracle [3].
> > >
> > > The intention of the original translate_pid() syscall was twofold:
> > > 1. Provide translation of pids between pid namespaces
> > > 2. Provide implicit pid namespace introspection
> > >
> > > Both functionalities are preserved. The latter task has been improved
> > > upon though. In the original version of the pachset passing pid as 1
> > > would allow to deterimine the relationship between the pid namespaces.
> > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > would report false negatives. For example, if pid 1 inside of the target
> > > pid namespace already died, it would report that the target pid
> > > namespace cannot be reached from the source pid namespace because it
> > > couldn't find the pid inside of the target pid namespace and thus
> > > falsely report to the user that the two pid namespaces are not related.
> > > This problem is simple to avoid. In the new version we simply walk the
> > > list of ancestors and check whether the namespace are related to each
> > > other. By doing it this way we can reliably report what the relationship
> > > between two pid namespace file descriptors looks like.
> > >
> > > Additionally, this syscall has been extended to allow the retrieval of
> > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > pidfds independent of procfs had already been requested in the
> > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > [5]. A use-case where a kernel is compiled without procfs but where
> > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > anon-inode based file descriptors are used that stash a reference to
> > > struct pid in file->private_data and drop that reference on close.
> > >
> > > With this translate_pid() has three closely related but still distinct
> > > functionalities. To clarify the semantics and to make it easier for
> > > userspace to use the syscall it has:
> > > - gained a command argument and three commands clearly reflecting the
> > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > >   PIDCMD_GET_PIDFD).
> > > - been renamed to pidctl()
> >
> [snip]
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
>
> This is true, that such usecase will not be supportable.  But the advantage
> on the other hand, is that suchs "pidfd" can be made pollable or readable in
> the future. Potentially allowing us to return exit status without a new
> syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> with proc.

I don't like the idea of having one kind of pollfd be pollable and
another not. Such an interface would be confusing for users. If, as
you suggest below, we instead make the procfs-less FD the only thing
we call a "pidfd", then this proposal becomes less confusing and more
viable.

> But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> can be translated into a "pidfd" using another syscall or even a node, like
> /proc/pid/handle or something. I think this is what Christian suggested in
> the previous threads.

/proc/pid/handle, if I understand correctly, "translates" a
procfs-based pidfd to a non-pidfd one. The needed interface would have
to perform the opposite translation, providing a procfs directory for
a given pidfd.

> And also for the translation the other way, add a syscall or modify
> translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> to /proc/pid directory fds.

This approach would work, but there's one subtlety to take into
account: which procfs? You'd want to take, as inputs, 1) the procfs
root you want, and 2) the pidfd, this way you could return to the user
a directory FD in the right procfs.

> Should we work on patches for these? Please let us know if this idea makes
> sense and thanks a lot for adding us to the review as well.

I would strongly prefer that we not merge pidctl (or whatever it is)
without a story for metadata access, be it your suggestion or
something else.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-25 16:20 ` [PATCH 2/4] pid: add pidctl() Christian Brauner
  2019-03-25 17:20   ` Mika Penttilä
@ 2019-03-25 18:18   ` Jann Horn
  2019-03-25 19:58     ` Christian Brauner
  2019-03-26 16:07     ` Joel Fernandes
  1 sibling, 2 replies; 42+ messages in thread
From: Jann Horn @ 2019-03-25 18:18 UTC (permalink / raw)
  To: Christian Brauner, Andy Lutomirski
  Cc: Konstantin Khlebnikov, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, kernel list, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy, cyphar,
	Al Viro, Joel Fernandes (Google),
	Daniel Colascione

On Mon, Mar 25, 2019 at 5:21 PM Christian Brauner <christian@brauner.io> wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:
[...]
> +
> +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> +{
> +       struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> +
> +       if (fd >= 0) {
> +#ifdef CONFIG_PID_NS
> +               struct ns_common *ns;
> +               struct file *file = proc_ns_fget(fd);
> +               if (IS_ERR(file))
> +                       return ERR_CAST(file);
> +
> +               ns = get_proc_ns(file_inode(file));
> +               if (ns->ops->type == CLONE_NEWPID)
> +                       pidns = get_pid_ns(
> +                               container_of(ns, struct pid_namespace, ns));

This increments the refcount of the pidns...

> +
> +               fput(file);
> +#endif
> +       } else {
> +               pidns = task_active_pid_ns(current);

... but this doesn't. That's pretty subtle; could you please put a
comment on top of this function that points this out? Or even better,
change the function to always take a reference, so that the caller
doesn't have to worry about figuring this out.

> +       }
> +
> +       return pidns;
> +}
[...]
> +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> +               unsigned int, flags)
> +{
> +       struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> +       struct pid *struct_pid;
> +       pid_t result;
> +
> +       switch (cmd) {
> +       case PIDCMD_QUERY_PIDNS:
> +               if (pid != 0)
> +                       return -EINVAL;
> +               pid = 1;
> +               /* fall through */
> +       case PIDCMD_QUERY_PID:
> +               if (flags != 0)
> +                       return -EINVAL;
> +               break;
> +       case PIDCMD_GET_PIDFD:
> +               if (flags & ~PIDCTL_CLOEXEC)
> +                       return -EINVAL;
> +               break;
> +       default:
> +               return -EOPNOTSUPP;
> +       }
> +
> +       source_ns = get_pid_ns_by_fd(source);
> +       result = PTR_ERR(source_ns);

I very much dislike using PTR_ERR() on pointers before checking
whether they contain an error value or not. I understand that the
result of this won't actually be used, but it still seems weird to
have what is essentially a cast of a potentially valid pointer to a
potentially smaller integer here.

Could you maybe move the PTR_ERR() into the error branch? Like so:

if (IS_ERR(source_ns)) {
  result = PTR_ERR(source_ns);
  goto err_source;
}

> +       if (IS_ERR(source_ns))
> +               goto err_source;
> +
> +       target_ns = get_pid_ns_by_fd(target);
> +       result = PTR_ERR(target_ns);
> +       if (IS_ERR(target_ns))
> +               goto err_target;
> +
> +       if (cmd == PIDCMD_QUERY_PIDNS) {
> +               result = pidns_related(source_ns, target_ns);
> +       } else {
> +               rcu_read_lock();
> +               struct_pid = find_pid_ns(pid, source_ns);

find_pid_ns() doesn't take a reference on its return value, the return
value is only pinned into memory by the RCU read-side critical
section...

> +               result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> +               rcu_read_unlock();

... which ends here, making struct_pid a dangling pointer...

> +
> +               if (cmd == PIDCMD_GET_PIDFD) {
> +                       int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> +                       if (result > 0)
> +                               result = pidfd_create_fd(struct_pid, cloexec);

... and then here you continue to use struct_pid. That seems bogus.

> +                       else if (result == 0)
> +                               result = -ENOENT;

You don't need to have flags for this for new syscalls, you can just
make everything O_CLOEXEC; if someone wants to preserve an fd across
execve(), they can just clear that bit with fcntl(). (I thiiink it was
Andy Lutomirski who said that before regarding another syscall? But my
memory of that is pretty foggy, might've been someone else.)

> +               }
> +       }
> +
> +       if (target)
> +               put_pid_ns(target_ns);
> +err_target:
> +       if (source)
> +               put_pid_ns(source_ns);

I think you probably intended to check for "if (target >= 0)" and "if
(source >= 0)" instead of these conditions, matching the condition in
get_pid_ns_by_fd()? The current code looks as if it will leave the
refcount one too high when used with target==0 or source==0, and leave
the refcount one too low when used with target==-1 or source==-1.
Anyway, as stated above, I think it would be simpler to
unconditionally take a reference instead.

> +err_source:
> +       return result;
> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 17:53     ` Daniel Colascione
@ 2019-03-25 18:19       ` Jonathan Kowalski
  2019-03-25 18:57         ` Daniel Colascione
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 18:19 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Christian Brauner, Jann Horn,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
>
> [..snip..]
>
> I don't like the idea of having one kind of pollfd be pollable and
> another not. Such an interface would be confusing for users. If, as
> you suggest below, we instead make the procfs-less FD the only thing
> we call a "pidfd", then this proposal becomes less confusing and more
> viable.

That's certainly on the table, we remove the ability to open
/proc/<PID> and use the dir fd with pidfd_send_signal. I'm in favor of
this.

> > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > can be translated into a "pidfd" using another syscall or even a node, like
> > /proc/pid/handle or something. I think this is what Christian suggested in
> > the previous threads.
>
> /proc/pid/handle, if I understand correctly, "translates" a
> procfs-based pidfd to a non-pidfd one. The needed interface would have
> to perform the opposite translation, providing a procfs directory for
> a given pidfd.
>
> > And also for the translation the other way, add a syscall or modify
> > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> > to /proc/pid directory fds.
>
> This approach would work, but there's one subtlety to take into
> account: which procfs? You'd want to take, as inputs, 1) the procfs
> root you want, and 2) the pidfd, this way you could return to the user
> a directory FD in the right procfs.
>

I don't think metadata access is in the scope of such a pidfd itself.

> > Should we work on patches for these? Please let us know if this idea makes
> > sense and thanks a lot for adding us to the review as well.
>
> I would strongly prefer that we not merge pidctl (or whatever it is)
> without a story for metadata access, be it your suggestion or
> something else.

IMO, this is out of scope for a process handle. Hence, the need for
metadata access musn't stall it as is.

If you really need this, the pid this pidfd is mapped to can be
exposed through fdinfo (which is perfectly fine for your case as you
use /proc anyway). This means you can reliably determine if you're
opening it for the same pid (and it hasn't been recycled) because this
pidfd will be pollable for state change (in the future). Exposing it
through fdinfo isn't a problem, pid ns is bound to the process during
its lifetime, it's a process creation time property, so the value
isn't going to change anyway. So you can do all metadata access you
want subsequently.

The upshot is that this same pidfd can then be returned from a future
extension of clone(2) Christian is planning to work on. It is still
undecided whether this fd should be only readable by the parent
process or everyone (which would just be a matter of changing the
access mode depending on the caller), but that we can discuss all that
when the patchset is posted.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-25 16:20 ` [PATCH 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
@ 2019-03-25 18:28   ` Jonathan Kowalski
  2019-03-25 20:05     ` Christian Brauner
  2019-03-25 18:39   ` Jann Horn
  1 sibling, 1 reply; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 18:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro,
	Joel Fernandes, Daniel Colascione

On Mon, Mar 25, 2019 at 4:21 PM Christian Brauner <christian@brauner.io> wrote:
>
> Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
> pidfd_send_signal() becomes independent of procfs. This fullfils the
> request made when we merged the pidfd_send_signal() patchset. The
> pidfd_send_signal() syscall is now always available allowing for it to be
> used by users without procfs mounted or even users without procfs support
> compiled into the kernel.
>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Reviewed-by: David Howells <dhowells@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jann Horn <jannh@google.com
> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> Cc: "Dmitry V. Levin" <ldv@altlinux.org>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  kernel/signal.c | 20 +++++++++-----------
>  kernel/sys_ni.c |  3 ---
>  2 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index b7953934aa99..d77183be1677 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
>         return kill_something_info(sig, &info, pid);
>  }
>
> -#ifdef CONFIG_PROC_FS
>  /*
>   * Verify that the signaler and signalee either are in the same pid namespace
>   * or that the signaler's pid namespace is an ancestor of the signalee's pid
> @@ -3521,16 +3520,13 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
>   */
>  static bool access_pidfd_pidns(struct pid *pid)

If it is agreed upon to remove the ability of using /proc/<PID> as a
pidfd pidfd_send_signal accepts, is lifting this check acceptable?

The system call as is does not allow for a process to acquire a pidfd
without being in an ancestor namespace or the same namespace. Thus,
there are good reasons to allow for this to work and be able to work
around the limitations imposed by pid namespaces if userspace
explicitly decides to do so, by passing around the pidfd to some other
process.

Also, you would need to translate this only once, when filling in the
structure. The other process is bound to its pid ns as long as it is
alive, therefore it would be simple without having to do anything
fancy at read side (unlike unix sockets).

>  {
> +       int ret;
>         struct pid_namespace *active = task_active_pid_ns(current);
>         struct pid_namespace *p = ns_of_pid(pid);
>
> -       for (;;) {
> -               if (!p)
> -                       return false;
> -               if (p == active)
> -                       break;
> -               p = p->parent;
> -       }
> +       ret = pidnscmp(active, p);
> +       if (ret < 0)
> +               return false;
>
>         return true;
>  }
> @@ -3581,12 +3577,15 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>         if (flags)
>                 return -EINVAL;
>
> -       f = fdget_raw(pidfd);
> +       f = fdget(pidfd);
>         if (!f.file)
>                 return -EBADF;
>
>         /* Is this a pidfd? */
> -       pid = tgid_pidfd_to_pid(f.file);
> +       if (f.file->f_op == &pidfd_fops)
> +               pid = f.file->private_data;
> +       else
> +               pid = tgid_pidfd_to_pid(f.file);
>         if (IS_ERR(pid)) {
>                 ret = PTR_ERR(pid);
>                 goto err;
> @@ -3625,7 +3624,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>         fdput(f);
>         return ret;
>  }
> -#endif /* CONFIG_PROC_FS */
>
>  static int
>  do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index d21f4befaea4..4d9ae5ea6caf 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
>
>  /* kernel/sched/core.c */
>
> -/* kernel/signal.c */
> -COND_SYSCALL(pidfd_send_signal);
> -
>  /* kernel/sys.c */
>  COND_SYSCALL(setregid);
>  COND_SYSCALL(setgid);
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-25 16:20 ` [PATCH 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
  2019-03-25 18:28   ` Jonathan Kowalski
@ 2019-03-25 18:39   ` Jann Horn
  2019-03-25 19:41     ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: Jann Horn @ 2019-03-25 18:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, kernel list,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy, cyphar,
	Al Viro, Joel Fernandes (Google),
	Daniel Colascione

On Mon, Mar 25, 2019 at 5:21 PM Christian Brauner <christian@brauner.io> wrote:
> Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
> pidfd_send_signal() becomes independent of procfs. This fullfils the
> request made when we merged the pidfd_send_signal() patchset. The
> pidfd_send_signal() syscall is now always available allowing for it to be
> used by users without procfs mounted or even users without procfs support
> compiled into the kernel.
[...]
>  static bool access_pidfd_pidns(struct pid *pid)
>  {
> +       int ret;
>         struct pid_namespace *active = task_active_pid_ns(current);
>         struct pid_namespace *p = ns_of_pid(pid);
>
> -       for (;;) {
> -               if (!p)
> -                       return false;
> -               if (p == active)
> -                       break;
> -               p = p->parent;
> -       }
> +       ret = pidnscmp(active, p);
> +       if (ret < 0)
> +               return false;
>
>         return true;
>  }

Nit, if we keep this function: "if (...) return false; return true;"
seems like an antipattern to me. How about "return ret >= 0", or even
"return pidnscmp(active, p) >= 0"?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 18:19       ` Jonathan Kowalski
@ 2019-03-25 18:57         ` Daniel Colascione
  2019-03-25 19:42           ` Jonathan Kowalski
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 18:57 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Joel Fernandes, Christian Brauner, Jann Horn,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > [..snip..]
> >
> > I don't like the idea of having one kind of pollfd be pollable and
> > another not. Such an interface would be confusing for users. If, as
> > you suggest below, we instead make the procfs-less FD the only thing
> > we call a "pidfd", then this proposal becomes less confusing and more
> > viable.
>
> That's certainly on the table, we remove the ability to open
> /proc/<PID> and use the dir fd with pidfd_send_signal. I'm in favor of
> this.
>
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > /proc/pid/handle, if I understand correctly, "translates" a
> > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > to perform the opposite translation, providing a procfs directory for
> > a given pidfd.
> >
> > > And also for the translation the other way, add a syscall or modify
> > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> > > to /proc/pid directory fds.
> >
> > This approach would work, but there's one subtlety to take into
> > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > root you want, and 2) the pidfd, this way you could return to the user
> > a directory FD in the right procfs.
> >
>
> I don't think metadata access is in the scope of such a pidfd itself.

It should be possible to write a race-free pkill(1) using pidfds.
Without metadata access tied to the pidfd somehow, that can't be done.

> > > Should we work on patches for these? Please let us know if this idea makes
> > > sense and thanks a lot for adding us to the review as well.
> >
> > I would strongly prefer that we not merge pidctl (or whatever it is)
> > without a story for metadata access, be it your suggestion or
> > something else.
>
> IMO, this is out of scope for a process handle. Hence, the need for
> metadata access musn't stall it as is.

On the contrary, rather than metadata being out of scope, metadata
access is essential. Given a file handle (an FD), you can learn about
the file backing that handle by using fstat(2). Given a socket handle
(a socket FD), you can learn about the socket with getsockopt(2) and
ioctl(2). It would be strange not to be able, similarly, to learn
things about a process given a handle (a pidfd) to that process. I
want process handles to be "boring" in that they let users query and
manipulate processes mostly like they manipulate other resources.

> If you really need this, the pid this pidfd is mapped to can be
> exposed through fdinfo (which is perfectly fine for your case as you
> use /proc anyway). This means you can reliably determine if you're
> opening it for the same pid (and it hasn't been recycled) because this
> pidfd will be pollable for state change (in the future). Exposing it
> through fdinfo isn't a problem, pid ns is bound to the process during
> its lifetime, it's a process creation time property, so the value
> isn't going to change anyway. So you can do all metadata access you
> want subsequently.

Thanks for the proposal. I'm a bit confused. Are you suggesting that
we report the numeric FD in fdinfo? Are you saying it should work
basically like this?

int get_oom_score_adj(int pidfd) {
  unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
  string fdinfo = read_all(fdinfo_fd);
  numeric_pid = parse_fdinfo_for_line("pid");
  unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
  if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
LIVENESS CHECK */
  // We know the process named by pidfd was called NUMERIC_PID
  // in our namespace (because fdinfo told us) and we know that the
named process
  // was alive after we successfully opened its /proc/pid directory, therefore,
  // PROCDIR_FD and PIDFD must refer to the same underlying process.
  unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
  return parse_int(read_all(oom_adj_score_fd));
}

While this interface functions correctly, I have two concerns: 1) the
number of system calls necessary is very large -- by my count,
starting with the pifd, we need six, not counting the follow-on
close(2) calls (which would bring the total to nine[1]), and 2) it's
"fail unsafe": IMHO, most users in practice will skip the line marked
"LIVENESS CHECK", and as a result, their code will appear to work but
contain subtle race conditions. An explicit interface to translate
from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
descriptor would be both more efficient and fail-safe.

[1] as a separate matter, it'd be nice to have a batch version of close(2).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-25 18:39   ` Jann Horn
@ 2019-03-25 19:41     ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 19:41 UTC (permalink / raw)
  To: Jann Horn
  Cc: Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, kernel list,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy, cyphar,
	Al Viro, Joel Fernandes (Google),
	Daniel Colascione

On Mon, Mar 25, 2019 at 07:39:25PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 5:21 PM Christian Brauner <christian@brauner.io> wrote:
> > Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
> > pidfd_send_signal() becomes independent of procfs. This fullfils the
> > request made when we merged the pidfd_send_signal() patchset. The
> > pidfd_send_signal() syscall is now always available allowing for it to be
> > used by users without procfs mounted or even users without procfs support
> > compiled into the kernel.
> [...]
> >  static bool access_pidfd_pidns(struct pid *pid)
> >  {
> > +       int ret;
> >         struct pid_namespace *active = task_active_pid_ns(current);
> >         struct pid_namespace *p = ns_of_pid(pid);
> >
> > -       for (;;) {
> > -               if (!p)
> > -                       return false;
> > -               if (p == active)
> > -                       break;
> > -               p = p->parent;
> > -       }
> > +       ret = pidnscmp(active, p);
> > +       if (ret < 0)
> > +               return false;
> >
> >         return true;
> >  }
> 
> Nit, if we keep this function: "if (...) return false; return true;"
> seems like an antipattern to me. How about "return ret >= 0", or even
> "return pidnscmp(active, p) >= 0"?

Yip, sounds good.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 18:57         ` Daniel Colascione
@ 2019-03-25 19:42           ` Jonathan Kowalski
  2019-03-25 20:14             ` Daniel Colascione
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 19:42 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Christian Brauner, Jann Horn,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > [..snip..]
> > >
> > > I don't like the idea of having one kind of pollfd be pollable and
> > > another not. Such an interface would be confusing for users. If, as
> > > you suggest below, we instead make the procfs-less FD the only thing
> > > we call a "pidfd", then this proposal becomes less confusing and more
> > > viable.
> >
> > That's certainly on the table, we remove the ability to open
> > /proc/<PID> and use the dir fd with pidfd_send_signal. I'm in favor of
> > this.
> >
> > > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > > can be translated into a "pidfd" using another syscall or even a node, like
> > > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > > the previous threads.
> > >
> > > /proc/pid/handle, if I understand correctly, "translates" a
> > > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > > to perform the opposite translation, providing a procfs directory for
> > > a given pidfd.
> > >
> > > > And also for the translation the other way, add a syscall or modify
> > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > > directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> > > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> > > > to /proc/pid directory fds.
> > >
> > > This approach would work, but there's one subtlety to take into
> > > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > > root you want, and 2) the pidfd, this way you could return to the user
> > > a directory FD in the right procfs.
> > >
> >
> > I don't think metadata access is in the scope of such a pidfd itself.
>
> It should be possible to write a race-free pkill(1) using pidfds.
> Without metadata access tied to the pidfd somehow, that can't be done.
>
>
> > > > Should we work on patches for these? Please let us know if this idea makes
> > > > sense and thanks a lot for adding us to the review as well.
> > >
> > > I would strongly prefer that we not merge pidctl (or whatever it is)
> > > without a story for metadata access, be it your suggestion or
> > > something else.
> >
> > IMO, this is out of scope for a process handle. Hence, the need for
> > metadata access musn't stall it as is.
>
> On the contrary, rather than metadata being out of scope, metadata
> access is essential. Given a file handle (an FD), you can learn about
> the file backing that handle by using fstat(2). Given a socket handle
> (a socket FD), you can learn about the socket with getsockopt(2) and
> ioctl(2). It would be strange not to be able, similarly, to learn
> things about a process given a handle (a pidfd) to that process. I
> want process handles to be "boring" in that they let users query and
> manipulate processes mostly like they manipulate other resources.
>

Yes, but everything in /proc is not equivalent to an attribute, or an
option, and depending on its configuration, you may not want to allow
processes to even be able to see /proc for any PIDs other than those
running as their own user (hidepid). This means, even if this new
system call is added, to respect hidepid, it must, depending on if
/proc is mounted (and what hidepid is set to, and what gid= is set
to), return EPERM, because then there is a discrepancy between how the
two entrypoints to acquire a process handle do access control. This is
ugly, but ideally should work the way it is described above.
Otherwise, things would be able to bypass the restrictions made
therein, because we also want a system call.

PIDs however are addressable with kill(2) even with hidepid enabled.
For good reason, the capabilities you can exercise using a PID is
limited in that case. The same logic applies to a process reference
like pidfd.

I agree there should be some way (if you can take care of the hidepid
gotcha) to return a dir fd of /proc entry, but I don't think it should
be the pidfd itself. You can get it to work using the fdinfo thing for
now.

> > If you really need this, the pid this pidfd is mapped to can be
> > exposed through fdinfo (which is perfectly fine for your case as you
> > use /proc anyway). This means you can reliably determine if you're
> > opening it for the same pid (and it hasn't been recycled) because this
> > pidfd will be pollable for state change (in the future). Exposing it
> > through fdinfo isn't a problem, pid ns is bound to the process during
> > its lifetime, it's a process creation time property, so the value
> > isn't going to change anyway. So you can do all metadata access you
> > want subsequently.
>
> Thanks for the proposal. I'm a bit confused. Are you suggesting that
> we report the numeric FD in fdinfo? Are you saying it should work
> basically like this?
>

Yes, numeric PID that you would see from your namespace for the said
process the pidfd is for. It could be -1 if this process is not in any
of the namespaces (current or children) (which would happen, say if
you pass it to some other pidns residing process, during fd
installation on ithe target process we set that up properly).

> int get_oom_score_adj(int pidfd) {
>   unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
>   string fdinfo = read_all(fdinfo_fd);
>   numeric_pid = parse_fdinfo_for_line("pid");
>   unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
>   if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
> LIVENESS CHECK */
>   // We know the process named by pidfd was called NUMERIC_PID
>   // in our namespace (because fdinfo told us) and we know that the
> named process
>   // was alive after we successfully opened its /proc/pid directory, therefore,
>   // PROCDIR_FD and PIDFD must refer to the same underlying process.
>   unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
>   return parse_int(read_all(oom_adj_score_fd));
> }
>
> While this interface functions correctly, I have two concerns: 1) the
> number of system calls necessary is very large -- by my count,
> starting with the pifd, we need six, not counting the follow-on
> close(2) calls (which would bring the total to nine[1]),

But hey, that's a bit rich if you're planning to collect metadata from
/proc ;-).

> and 2) it's
> "fail unsafe": IMHO, most users in practice will skip the line marked
> "LIVENESS CHECK", and as a result, their code will appear to work but
> contain subtle race conditions. An explicit interface to translate
> from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> descriptor would be both more efficient and fail-safe.
>
> [1] as a separate matter, it'd be nice to have a batch version of close(2).

Since /proc is full of gunk, how about adding more to it and making
the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
of the /proc entry of the process it maps to, when one uses
O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
It would be equivalent to opening the proc entry with usual access
restrictions (and hidepid made to work) but without the races, and
because for processes outside your and children pid ns, it shouldn't
work anyway, and since they wouldn't have their entry on this procfs
instance, it would all just fit in nicely?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-25 18:18   ` Jann Horn
@ 2019-03-25 19:58     ` Christian Brauner
  2019-03-26 16:07     ` Joel Fernandes
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 19:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Konstantin Khlebnikov, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, kernel list,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy, cyphar,
	Al Viro, Joel Fernandes (Google),
	Daniel Colascione

On Mon, Mar 25, 2019 at 07:18:42PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 5:21 PM Christian Brauner <christian@brauner.io> wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> [...]
> > +
> > +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> > +{
> > +       struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> > +
> > +       if (fd >= 0) {
> > +#ifdef CONFIG_PID_NS
> > +               struct ns_common *ns;
> > +               struct file *file = proc_ns_fget(fd);
> > +               if (IS_ERR(file))
> > +                       return ERR_CAST(file);
> > +
> > +               ns = get_proc_ns(file_inode(file));
> > +               if (ns->ops->type == CLONE_NEWPID)
> > +                       pidns = get_pid_ns(
> > +                               container_of(ns, struct pid_namespace, ns));
> 
> This increments the refcount of the pidns...

I didn't touch that code. That's taken over from the orginial patchset.
Thanks for catching this.

> 
> > +
> > +               fput(file);
> > +#endif
> > +       } else {
> > +               pidns = task_active_pid_ns(current);
> 
> ... but this doesn't. That's pretty subtle; could you please put a
> comment on top of this function that points this out? Or even better,
> change the function to always take a reference, so that the caller
> doesn't have to worry about figuring this out.

Always taking a reference sounds more correct. Will do.

> 
> > +       }
> > +
> > +       return pidns;
> > +}
> [...]
> > +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> > +               unsigned int, flags)
> > +{
> > +       struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> > +       struct pid *struct_pid;
> > +       pid_t result;
> > +
> > +       switch (cmd) {
> > +       case PIDCMD_QUERY_PIDNS:
> > +               if (pid != 0)
> > +                       return -EINVAL;
> > +               pid = 1;
> > +               /* fall through */
> > +       case PIDCMD_QUERY_PID:
> > +               if (flags != 0)
> > +                       return -EINVAL;
> > +               break;
> > +       case PIDCMD_GET_PIDFD:
> > +               if (flags & ~PIDCTL_CLOEXEC)
> > +                       return -EINVAL;
> > +               break;
> > +       default:
> > +               return -EOPNOTSUPP;
> > +       }
> > +
> > +       source_ns = get_pid_ns_by_fd(source);
> > +       result = PTR_ERR(source_ns);
> 
> I very much dislike using PTR_ERR() on pointers before checking
> whether they contain an error value or not. I understand that the
> result of this won't actually be used, but it still seems weird to
> have what is essentially a cast of a potentially valid pointer to a
> potentially smaller integer here.
> 
> Could you maybe move the PTR_ERR() into the error branch? Like so:

Will do!

> 
> if (IS_ERR(source_ns)) {
>   result = PTR_ERR(source_ns);
>   goto err_source;
> }
> 
> > +       if (IS_ERR(source_ns))
> > +               goto err_source;
> > +
> > +       target_ns = get_pid_ns_by_fd(target);
> > +       result = PTR_ERR(target_ns);
> > +       if (IS_ERR(target_ns))
> > +               goto err_target;
> > +
> > +       if (cmd == PIDCMD_QUERY_PIDNS) {
> > +               result = pidns_related(source_ns, target_ns);
> > +       } else {
> > +               rcu_read_lock();
> > +               struct_pid = find_pid_ns(pid, source_ns);
> 
> find_pid_ns() doesn't take a reference on its return value, the return
> value is only pinned into memory by the RCU read-side critical
> section...
> 
> > +               result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> > +               rcu_read_unlock();
> 
> ... which ends here, making struct_pid a dangling pointer...
> 
> > +
> > +               if (cmd == PIDCMD_GET_PIDFD) {
> > +                       int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> > +                       if (result > 0)
> > +                               result = pidfd_create_fd(struct_pid, cloexec);
> 
> ... and then here you continue to use struct_pid. That seems bogus.

I'll just take a reference to struct pid once I found it to prevent that
from happening.

> 
> > +                       else if (result == 0)
> > +                               result = -ENOENT;
> 
> You don't need to have flags for this for new syscalls, you can just
> make everything O_CLOEXEC; if someone wants to preserve an fd across
> execve(), they can just clear that bit with fcntl(). (I thiiink it was
> Andy Lutomirski who said that before regarding another syscall? But my
> memory of that is pretty foggy, might've been someone else.)

If that's the way going forward this is fine with me!

> 
> > +               }
> > +       }
> > +
> > +       if (target)
> > +               put_pid_ns(target_ns);
> > +err_target:
> > +       if (source)
> > +               put_pid_ns(source_ns);
> 
> I think you probably intended to check for "if (target >= 0)" and "if
> (source >= 0)" instead of these conditions, matching the condition in
> get_pid_ns_by_fd()? The current code looks as if it will leave the
> refcount one too high when used with target==0 or source==0, and leave
> the refcount one too low when used with target==-1 or source==-1.
> Anyway, as stated above, I think it would be simpler to
> unconditionally take a reference instead.

Yep.

> 
> > +err_source:
> > +       return result;
> > +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-25 17:20   ` Mika Penttilä
@ 2019-03-25 19:59     ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 19:59 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	joel, dancol

On Mon, Mar 25, 2019 at 05:20:54PM +0000, Mika Penttilä wrote:
> Hi!
> 
> 
> > +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> > +		unsigned int, flags)
> > +{
> > +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> > +	struct pid *struct_pid;
> > +	pid_t result;
> > +
> > +	switch (cmd) {
> > +	case PIDCMD_QUERY_PIDNS:
> > +		if (pid != 0)
> > +			return -EINVAL;
> > +		pid = 1;
> > +		/* fall through */
> > +	case PIDCMD_QUERY_PID:
> > +		if (flags != 0)
> > +			return -EINVAL;
> > +		break;
> > +	case PIDCMD_GET_PIDFD:
> > +		if (flags & ~PIDCTL_CLOEXEC)
> > +			return -EINVAL;
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	source_ns = get_pid_ns_by_fd(source);
> > +	result = PTR_ERR(source_ns);
> > +	if (IS_ERR(source_ns))
> > +		goto err_source;
> > +
> > +	target_ns = get_pid_ns_by_fd(target);
> > +	result = PTR_ERR(target_ns);
> > +	if (IS_ERR(target_ns))
> > +		goto err_target;
> > +
> > +	if (cmd == PIDCMD_QUERY_PIDNS) {
> > +		result = pidns_related(source_ns, target_ns);
> > +	} else {
> > +		rcu_read_lock();
> > +		struct_pid = find_pid_ns(pid, source_ns);
> > +		result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> 
> Should you do get_pid(struct_pid) here to keep it alive till 
> pidfd_create_fd() ?

Yes, indeed. You and Jann both pointed this out! Thank you.

> 
> > +		rcu_read_unlock();
> > +
> > +		if (cmd == PIDCMD_GET_PIDFD) {
> > +			int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> > +			if (result > 0)
> > +				result = pidfd_create_fd(struct_pid, cloexec);
> > +			else if (result == 0)
> > +				result = -ENOENT;
> > +		}
> > +	}
> > +
> > +	if (target)
> > +		put_pid_ns(target_ns);
> > +err_target:
> > +	if (source)
> > +		put_pid_ns(source_ns);
> > +err_source:
> > +	return result;
> > +}
> > +
> >  void __init pid_idr_init(void)
> >  {
> >  	/* Verify no one has done anything silly: */
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index aa6e72fb7c08..1c863fb3d55a 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
> >  	return &get_pid_ns(pid_ns)->ns;
> >  }
> >  
> > +/**
> > + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> > + * @ancestor:   pidns suspected to be the ancestor of @descendant
> > + * @descendant: pidns suspected to be the descendant of @ancestor
> > + *
> > + * Returns -1 if @ancestor is not an ancestor of @descendant,
> > + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> > + * is an ancestor of @descendant.
> > + */
> > +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
> > +{
> > +	if (ancestor == descendant)
> > +		return 0;
> > +
> > +	for (;;) {
> > +		if (!descendant)
> > +			return -1;
> > +		if (descendant == ancestor)
> > +			break;
> > +		descendant = descendant->parent;
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  static struct user_namespace *pidns_owner(struct ns_common *ns)
> >  {
> >  	return to_pid_ns(ns)->user_ns;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-25 18:28   ` Jonathan Kowalski
@ 2019-03-25 20:05     ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 20:05 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro,
	Joel Fernandes, Daniel Colascione

On Mon, Mar 25, 2019 at 06:28:53PM +0000, Jonathan Kowalski wrote:
> On Mon, Mar 25, 2019 at 4:21 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
> > pidfd_send_signal() becomes independent of procfs. This fullfils the
> > request made when we merged the pidfd_send_signal() patchset. The
> > pidfd_send_signal() syscall is now always available allowing for it to be
> > used by users without procfs mounted or even users without procfs support
> > compiled into the kernel.
> >
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> > Reviewed-by: David Howells <dhowells@redhat.com>
> > Acked-by: Serge Hallyn <serge@hallyn.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Alexey Dobriyan <adobriyan@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Jann Horn <jannh@google.com
> > Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> > Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
> > Cc: "Dmitry V. Levin" <ldv@altlinux.org>
> > Cc: Andy Lutomirsky <luto@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> >  kernel/signal.c | 20 +++++++++-----------
> >  kernel/sys_ni.c |  3 ---
> >  2 files changed, 9 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index b7953934aa99..d77183be1677 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
> >         return kill_something_info(sig, &info, pid);
> >  }
> >
> > -#ifdef CONFIG_PROC_FS
> >  /*
> >   * Verify that the signaler and signalee either are in the same pid namespace
> >   * or that the signaler's pid namespace is an ancestor of the signalee's pid
> > @@ -3521,16 +3520,13 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
> >   */
> >  static bool access_pidfd_pidns(struct pid *pid)
> 
> If it is agreed upon to remove the ability of using /proc/<PID> as a
> pidfd pidfd_send_signal accepts, is lifting this check acceptable?
> 
> The system call as is does not allow for a process to acquire a pidfd
> without being in an ancestor namespace or the same namespace. Thus,
> there are good reasons to allow for this to work and be able to work
> around the limitations imposed by pid namespaces if userspace
> explicitly decides to do so, by passing around the pidfd to some other
> process.

The original problem with this was - I think - that we hadn't settled
what certain values in struct siginfo_t should be set to when signaling
e.g. into ancestor pid namespaces. And if we did, we need to hook into
a widely used function in the kernel (I don't have the appropriate
thread handy atm) and we decided to punt on this until userspace really
needs this feature.

> 
> Also, you would need to translate this only once, when filling in the
> structure. The other process is bound to its pid ns as long as it is
> alive, therefore it would be simple without having to do anything
> fancy at read side (unlike unix sockets).
> 
> >  {
> > +       int ret;
> >         struct pid_namespace *active = task_active_pid_ns(current);
> >         struct pid_namespace *p = ns_of_pid(pid);
> >
> > -       for (;;) {
> > -               if (!p)
> > -                       return false;
> > -               if (p == active)
> > -                       break;
> > -               p = p->parent;
> > -       }
> > +       ret = pidnscmp(active, p);
> > +       if (ret < 0)
> > +               return false;
> >
> >         return true;
> >  }
> > @@ -3581,12 +3577,15 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >         if (flags)
> >                 return -EINVAL;
> >
> > -       f = fdget_raw(pidfd);
> > +       f = fdget(pidfd);
> >         if (!f.file)
> >                 return -EBADF;
> >
> >         /* Is this a pidfd? */
> > -       pid = tgid_pidfd_to_pid(f.file);
> > +       if (f.file->f_op == &pidfd_fops)
> > +               pid = f.file->private_data;
> > +       else
> > +               pid = tgid_pidfd_to_pid(f.file);
> >         if (IS_ERR(pid)) {
> >                 ret = PTR_ERR(pid);
> >                 goto err;
> > @@ -3625,7 +3624,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >         fdput(f);
> >         return ret;
> >  }
> > -#endif /* CONFIG_PROC_FS */
> >
> >  static int
> >  do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index d21f4befaea4..4d9ae5ea6caf 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
> >
> >  /* kernel/sched/core.c */
> >
> > -/* kernel/signal.c */
> > -COND_SYSCALL(pidfd_send_signal);
> > -
> >  /* kernel/sys.c */
> >  COND_SYSCALL(setregid);
> >  COND_SYSCALL(setgid);
> > --
> > 2.21.0
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 19:42           ` Jonathan Kowalski
@ 2019-03-25 20:14             ` Daniel Colascione
  2019-03-25 20:34               ` Jann Horn
  0 siblings, 1 reply; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 20:14 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Joel Fernandes, Christian Brauner, Jann Horn,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione <dancol@google.com> wrote:
> > > >
> > > > [..snip..]
> > > >
> > > > I don't like the idea of having one kind of pollfd be pollable and
> > > > another not. Such an interface would be confusing for users. If, as
> > > > you suggest below, we instead make the procfs-less FD the only thing
> > > > we call a "pidfd", then this proposal becomes less confusing and more
> > > > viable.
> > >
> > > That's certainly on the table, we remove the ability to open
> > > /proc/<PID> and use the dir fd with pidfd_send_signal. I'm in favor of
> > > this.
> > >
> > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > > > can be translated into a "pidfd" using another syscall or even a node, like
> > > > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > > > the previous threads.
> > > >
> > > > /proc/pid/handle, if I understand correctly, "translates" a
> > > > procfs-based pidfd to a non-pidfd one. The needed interface would have
> > > > to perform the opposite translation, providing a procfs directory for
> > > > a given pidfd.
> > > >
> > > > > And also for the translation the other way, add a syscall or modify
> > > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> > > > > to /proc/pid directory fds.
> > > >
> > > > This approach would work, but there's one subtlety to take into
> > > > account: which procfs? You'd want to take, as inputs, 1) the procfs
> > > > root you want, and 2) the pidfd, this way you could return to the user
> > > > a directory FD in the right procfs.
> > > >
> > >
> > > I don't think metadata access is in the scope of such a pidfd itself.
> >
> > It should be possible to write a race-free pkill(1) using pidfds.
> > Without metadata access tied to the pidfd somehow, that can't be done.
> >
> >
> > > > > Should we work on patches for these? Please let us know if this idea makes
> > > > > sense and thanks a lot for adding us to the review as well.
> > > >
> > > > I would strongly prefer that we not merge pidctl (or whatever it is)
> > > > without a story for metadata access, be it your suggestion or
> > > > something else.
> > >
> > > IMO, this is out of scope for a process handle. Hence, the need for
> > > metadata access musn't stall it as is.
> >
> > On the contrary, rather than metadata being out of scope, metadata
> > access is essential. Given a file handle (an FD), you can learn about
> > the file backing that handle by using fstat(2). Given a socket handle
> > (a socket FD), you can learn about the socket with getsockopt(2) and
> > ioctl(2). It would be strange not to be able, similarly, to learn
> > things about a process given a handle (a pidfd) to that process. I
> > want process handles to be "boring" in that they let users query and
> > manipulate processes mostly like they manipulate other resources.
> >
>
> Yes, but everything in /proc is not equivalent to an attribute, or an
> option, and depending on its configuration, you may not want to allow
> processes to even be able to see /proc for any PIDs other than those
> running as their own user (hidepid). This means, even if this new
> system call is added, to respect hidepid, it must, depending on if
> /proc is mounted (and what hidepid is set to, and what gid= is set
> to), return EPERM, because then there is a discrepancy between how the
> two entrypoints to acquire a process handle do access control.

That's why I proposed that this translation mechanism accept a procfs
root directory --- so you'd specify *which* procfs you want and let
the kernel apply whatever hidepid access restrictions it wants.

[snip]

> PIDs however are addressable with kill(2) even with hidepid enabled.
> For good reason, the capabilities you can exercise using a PID is
> limited in that case. The same logic applies to a process reference
> like pidfd.

Sure. I'm not proposing a mechanism that would grant anyone additional
access to anything --- I'm just suggesting that we provide a way to
"directly" open a procfs dirfd instead of having people parse an
fdinfo text file.

> I agree there should be some way (if you can take care of the hidepid
> gotcha) to return a dir fd of /proc entry, but I don't think it should
> be the pidfd itself.

It's fine that pidfds aren't procfs directory FDs so long as there's a
race-free way to go from the former to the latter.

> You can get it to work using the fdinfo thing for
> now.
>
> > > If you really need this, the pid this pidfd is mapped to can be
> > > exposed through fdinfo (which is perfectly fine for your case as you
> > > use /proc anyway). This means you can reliably determine if you're
> > > opening it for the same pid (and it hasn't been recycled) because this
> > > pidfd will be pollable for state change (in the future). Exposing it
> > > through fdinfo isn't a problem, pid ns is bound to the process during
> > > its lifetime, it's a process creation time property, so the value
> > > isn't going to change anyway. So you can do all metadata access you
> > > want subsequently.
> >
> > Thanks for the proposal. I'm a bit confused. Are you suggesting that
> > we report the numeric FD in fdinfo? Are you saying it should work
> > basically like this?
> >
>
> Yes, numeric PID that you would see from your namespace for the said
> process the pidfd is for.
> It could be -1 if this process is not in any
> of the namespaces (current or children) (which would happen, say if
> you pass it to some other pidns residing process, during fd
> installation on ithe target process we set that up properly).
>
> > int get_oom_score_adj(int pidfd) {
> >   unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd));
> >   string fdinfo = read_all(fdinfo_fd);
> >   numeric_pid = parse_fdinfo_for_line("pid");
> >   unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY);
> >   if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /*
> > LIVENESS CHECK */
> >   // We know the process named by pidfd was called NUMERIC_PID
> >   // in our namespace (because fdinfo told us) and we know that the
> > named process
> >   // was alive after we successfully opened its /proc/pid directory, therefore,
> >   // PROCDIR_FD and PIDFD must refer to the same underlying process.
> >   unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj");
> >   return parse_int(read_all(oom_adj_score_fd));
> > }
> >
> > While this interface functions correctly, I have two concerns: 1) the
> > number of system calls necessary is very large -- by my count,
> > starting with the pifd, we need six, not counting the follow-on
> > close(2) calls (which would bring the total to nine[1]),
>
> But hey, that's a bit rich if you're planning to collect metadata from
> /proc ;-).

Depends on which interface --- reading something like oom_score_adj is
pretty cheap.

> > and 2) it's
> > "fail unsafe": IMHO, most users in practice will skip the line marked
> > "LIVENESS CHECK", and as a result, their code will appear to work but
> > contain subtle race conditions. An explicit interface to translate
> > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> > descriptor would be both more efficient and fail-safe.
> >
> > [1] as a separate matter, it'd be nice to have a batch version of close(2).
>
> Since /proc is full of gunk,

People keep saying /proc is bad, but I haven't seen any serious
proposals for a clean replacement. :-)

> how about adding more to it and making
> the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
> of the /proc entry of the process it maps to, when one uses
> O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
> It would be equivalent to opening the proc entry with usual access
> restrictions (and hidepid made to work) but without the races, and
> because for processes outside your and children pid ns, it shouldn't
> work anyway, and since they wouldn't have their entry on this procfs
> instance, it would all just fit in nicely?

Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical
anyway, so that's okay.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 17:36   ` Joel Fernandes
  2019-03-25 17:53     ` Daniel Colascione
@ 2019-03-25 20:15     ` Christian Brauner
  2019-03-25 21:11       ` Joel Fernandes
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 20:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Colascione, Jann Horn, khlebnikov, Andy Lutomirski,
	David Howells, Serge E. Hallyn, Eric W. Biederman, Linux API,
	linux-kernel, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, bl0pbl33p,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	nagarathnam.muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > I quote Konstantins original patchset first that has already been acked and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall. Multiple people have asked when this patchset will be sent in
> > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > Muthusamy from Oracle [3].
> > >
> > > The intention of the original translate_pid() syscall was twofold:
> > > 1. Provide translation of pids between pid namespaces
> > > 2. Provide implicit pid namespace introspection
> > >
> > > Both functionalities are preserved. The latter task has been improved
> > > upon though. In the original version of the pachset passing pid as 1
> > > would allow to deterimine the relationship between the pid namespaces.
> > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > would report false negatives. For example, if pid 1 inside of the target
> > > pid namespace already died, it would report that the target pid
> > > namespace cannot be reached from the source pid namespace because it
> > > couldn't find the pid inside of the target pid namespace and thus
> > > falsely report to the user that the two pid namespaces are not related.
> > > This problem is simple to avoid. In the new version we simply walk the
> > > list of ancestors and check whether the namespace are related to each
> > > other. By doing it this way we can reliably report what the relationship
> > > between two pid namespace file descriptors looks like.
> > >
> > > Additionally, this syscall has been extended to allow the retrieval of
> > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > pidfds independent of procfs had already been requested in the
> > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > [5]. A use-case where a kernel is compiled without procfs but where
> > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > anon-inode based file descriptors are used that stash a reference to
> > > struct pid in file->private_data and drop that reference on close.
> > >
> > > With this translate_pid() has three closely related but still distinct
> > > functionalities. To clarify the semantics and to make it easier for
> > > userspace to use the syscall it has:
> > > - gained a command argument and three commands clearly reflecting the
> > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > >   PIDCMD_GET_PIDFD).
> > > - been renamed to pidctl()
> > 
> [snip]
> > Also, I'm still confused about how metadata access is supposed to work
> > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > You snipped out a portion of a previous email in which I asked about
> > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > place, we have two different kinds of file descriptors for processes,
> > one derived from procfs and one that's independent. The former works
> > with openat(2). The latter does not. To be very specific; if I'm
> > writing a function that accepts a pidfd and I get a pidfd that comes
> > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > smaps or oom_score_adj or statm for the named process in a race-free
> > manner?
> 
> This is true, that such usecase will not be supportable.  But the advantage
> on the other hand, is that suchs "pidfd" can be made pollable or readable in
> the future. Potentially allowing us to return exit status without a new
> syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> with proc.
> 
> But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> can be translated into a "pidfd" using another syscall or even a node, like
> /proc/pid/handle or something. I think this is what Christian suggested in
> the previous threads.

Andy - and Jann who I just talked to - have proposed solutions for this.
Jann's idea is similar to what you suggested, Joel. You could e.g. do an
ioctl() handler for /proc that would give you a dirfd back for a given
pidfd. The advantage is that pidfd_clone() can then give back pidfds
without having to care in what procfs the process is supposed to live.
That makes things a lot easier. But pidfds for the general case should
be anon inodes. It's clean, it's simple and it is way more secure.

> 
> And also for the translation the other way, add a syscall or modify
> translate_fd or something, to covert a anon_inode pidfd into a /proc/pid
> directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd.
> Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not
> to /proc/pid directory fds.
> 
> Should we work on patches for these? Please let us know if this idea makes
> sense and thanks a lot for adding us to the review as well.
> 
> Best,
> 
>  - Joel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:14             ` Daniel Colascione
@ 2019-03-25 20:34               ` Jann Horn
  2019-03-25 20:40                 ` Jonathan Kowalski
  2019-03-25 20:40                 ` Christian Brauner
  0 siblings, 2 replies; 42+ messages in thread
From: Jann Horn @ 2019-03-25 20:34 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jonathan Kowalski, Joel Fernandes, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione <dancol@google.com> wrote:
> On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione <dancol@google.com> wrote:
[...]
> > Yes, but everything in /proc is not equivalent to an attribute, or an
> > option, and depending on its configuration, you may not want to allow
> > processes to even be able to see /proc for any PIDs other than those
> > running as their own user (hidepid). This means, even if this new
> > system call is added, to respect hidepid, it must, depending on if
> > /proc is mounted (and what hidepid is set to, and what gid= is set
> > to), return EPERM, because then there is a discrepancy between how the
> > two entrypoints to acquire a process handle do access control.
>
> That's why I proposed that this translation mechanism accept a procfs
> root directory --- so you'd specify *which* procfs you want and let
> the kernel apply whatever hidepid access restrictions it wants.
[...]
> > > and 2) it's
> > > "fail unsafe": IMHO, most users in practice will skip the line marked
> > > "LIVENESS CHECK", and as a result, their code will appear to work but
> > > contain subtle race conditions. An explicit interface to translate
> > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> > > descriptor would be both more efficient and fail-safe.
> > >
> > > [1] as a separate matter, it'd be nice to have a batch version of close(2).
> >
> > Since /proc is full of gunk,
>
> People keep saying /proc is bad, but I haven't seen any serious
> proposals for a clean replacement. :-)
>
> > how about adding more to it and making
> > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
> > of the /proc entry of the process it maps to, when one uses
> > O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
> > It would be equivalent to opening the proc entry with usual access
> > restrictions (and hidepid made to work) but without the races, and
> > because for processes outside your and children pid ns, it shouldn't
> > work anyway, and since they wouldn't have their entry on this procfs
> > instance, it would all just fit in nicely?
>
> Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical
> anyway, so that's okay.

Please don't do that. /proc/$pid/fd refers to the set of file
descriptors the process has open, and semantically doesn't have much
to do with the identity of the process. If you want to have a procfs
directory entry for getting a pidfd, please add a new entry. (Although
I don't see the point in adding a new procfs entry for this when you
could instead have an ioctl or syscall operating on the procfs
directory fd.)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:34               ` Jann Horn
@ 2019-03-25 20:40                 ` Jonathan Kowalski
  2019-03-25 21:14                   ` Jonathan Kowalski
  2019-03-25 21:15                   ` Jann Horn
  2019-03-25 20:40                 ` Christian Brauner
  1 sibling, 2 replies; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 20:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Daniel Colascione, Joel Fernandes, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 8:34 PM Jann Horn <jannh@google.com> wrote:
>
>  [...SNIP...]
>
> Please don't do that. /proc/$pid/fd refers to the set of file
> descriptors the process has open, and semantically doesn't have much
> to do with the identity of the process. If you want to have a procfs
> directory entry for getting a pidfd, please add a new entry. (Although
> I don't see the point in adding a new procfs entry for this when you
> could instead have an ioctl or syscall operating on the procfs
> directory fd.)

There is no new entry. What I was saying (and I should have been
clearer) is that the existing entry for the fd when open'd with
O_DIRECTORY makes the kernel resolve the symlink to /proc/<PID> of the
process it maps to, so it would become:

  int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC);

This also means you cannot cross the filesystem boundry, the said
process needs to have a visible entry (which would mean hidepid= and
gid= based access controls are honored), and you can only open the
dirfd of a process in the current ns (as the PID will not map to an
existent process if the pidfd maps to a process not in the same or
children pid ns, in fdinfo it lists -1 in the pid field (we might not
even need fdinfo anymore)).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:34               ` Jann Horn
  2019-03-25 20:40                 ` Jonathan Kowalski
@ 2019-03-25 20:40                 ` Christian Brauner
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-25 20:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Daniel Colascione, Jonathan Kowalski, Joel Fernandes,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 09:34:00PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 9:15 PM Daniel Colascione <dancol@google.com> wrote:
> > On Mon, Mar 25, 2019 at 12:42 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> > > On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione <dancol@google.com> wrote:
> [...]
> > > Yes, but everything in /proc is not equivalent to an attribute, or an
> > > option, and depending on its configuration, you may not want to allow
> > > processes to even be able to see /proc for any PIDs other than those
> > > running as their own user (hidepid). This means, even if this new
> > > system call is added, to respect hidepid, it must, depending on if
> > > /proc is mounted (and what hidepid is set to, and what gid= is set
> > > to), return EPERM, because then there is a discrepancy between how the
> > > two entrypoints to acquire a process handle do access control.
> >
> > That's why I proposed that this translation mechanism accept a procfs
> > root directory --- so you'd specify *which* procfs you want and let
> > the kernel apply whatever hidepid access restrictions it wants.
> [...]
> > > > and 2) it's
> > > > "fail unsafe": IMHO, most users in practice will skip the line marked
> > > > "LIVENESS CHECK", and as a result, their code will appear to work but
> > > > contain subtle race conditions. An explicit interface to translate
> > > > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file
> > > > descriptor would be both more efficient and fail-safe.
> > > >
> > > > [1] as a separate matter, it'd be nice to have a batch version of close(2).
> > >
> > > Since /proc is full of gunk,
> >
> > People keep saying /proc is bad, but I haven't seen any serious
> > proposals for a clean replacement. :-)
> >
> > > how about adding more to it and making
> > > the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd
> > > of the /proc entry of the process it maps to, when one uses
> > > O_DIRECTORY while opening it? Otherwise, it behaves as it does today.
> > > It would be equivalent to opening the proc entry with usual access
> > > restrictions (and hidepid made to work) but without the races, and
> > > because for processes outside your and children pid ns, it shouldn't
> > > work anyway, and since they wouldn't have their entry on this procfs
> > > instance, it would all just fit in nicely?
> >
> > Thanks. That'll work. It's a bit magical, but /proc/self/fd is magical
> > anyway, so that's okay.
> 
> Please don't do that. /proc/$pid/fd refers to the set of file
> descriptors the process has open, and semantically doesn't have much
> to do with the identity of the process. If you want to have a procfs
> directory entry for getting a pidfd, please add a new entry. (Although
> I don't see the point in adding a new procfs entry for this when you
> could instead have an ioctl or syscall operating on the procfs
> directory fd.)

Very much agreed!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:15     ` Christian Brauner
@ 2019-03-25 21:11       ` Joel Fernandes
  2019-03-25 21:17         ` Daniel Colascione
  2019-03-25 21:19         ` Jann Horn
  0 siblings, 2 replies; 42+ messages in thread
From: Joel Fernandes @ 2019-03-25 21:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Colascione, Jann Horn, khlebnikov, Andy Lutomirski,
	David Howells, Serge E. Hallyn, Eric W. Biederman, Linux API,
	linux-kernel, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, bl0pbl33p,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	nagarathnam.muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > I quote Konstantins original patchset first that has already been acked and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > > Muthusamy from Oracle [3].
> > > >
> > > > The intention of the original translate_pid() syscall was twofold:
> > > > 1. Provide translation of pids between pid namespaces
> > > > 2. Provide implicit pid namespace introspection
> > > >
> > > > Both functionalities are preserved. The latter task has been improved
> > > > upon though. In the original version of the pachset passing pid as 1
> > > > would allow to deterimine the relationship between the pid namespaces.
> > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > would report false negatives. For example, if pid 1 inside of the target
> > > > pid namespace already died, it would report that the target pid
> > > > namespace cannot be reached from the source pid namespace because it
> > > > couldn't find the pid inside of the target pid namespace and thus
> > > > falsely report to the user that the two pid namespaces are not related.
> > > > This problem is simple to avoid. In the new version we simply walk the
> > > > list of ancestors and check whether the namespace are related to each
> > > > other. By doing it this way we can reliably report what the relationship
> > > > between two pid namespace file descriptors looks like.
> > > >
> > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > > pidfds independent of procfs had already been requested in the
> > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > anon-inode based file descriptors are used that stash a reference to
> > > > struct pid in file->private_data and drop that reference on close.
> > > >
> > > > With this translate_pid() has three closely related but still distinct
> > > > functionalities. To clarify the semantics and to make it easier for
> > > > userspace to use the syscall it has:
> > > > - gained a command argument and three commands clearly reflecting the
> > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > >   PIDCMD_GET_PIDFD).
> > > > - been renamed to pidctl()
> > > 
> > [snip]
> > > Also, I'm still confused about how metadata access is supposed to work
> > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > You snipped out a portion of a previous email in which I asked about
> > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > place, we have two different kinds of file descriptors for processes,
> > > one derived from procfs and one that's independent. The former works
> > > with openat(2). The latter does not. To be very specific; if I'm
> > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > smaps or oom_score_adj or statm for the named process in a race-free
> > > manner?
> > 
> > This is true, that such usecase will not be supportable.  But the advantage
> > on the other hand, is that suchs "pidfd" can be made pollable or readable in
> > the future. Potentially allowing us to return exit status without a new
> > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> > with proc.
> > 
> > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > can be translated into a "pidfd" using another syscall or even a node, like
> > /proc/pid/handle or something. I think this is what Christian suggested in
> > the previous threads.
> 
> Andy - and Jann who I just talked to - have proposed solutions for this.
> Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> ioctl() handler for /proc that would give you a dirfd back for a given
> pidfd. The advantage is that pidfd_clone() can then give back pidfds
> without having to care in what procfs the process is supposed to live.
> That makes things a lot easier. But pidfds for the general case should
> be anon inodes. It's clean, it's simple and it is way more secure.

That makes sense to me, it is clean and I agree let us do that.

Also for the "blocking on pid exit status" usecase, instead of adding a new
syscall like pidfd_wait, lets just make that a new IOCTL to the
file_operations of the anon_inode pidfd file. This will lets us specify
exactly what to wait on (wait on death or wait on zombie) and lets us avoid
having a new syscall and create new fd just for waiting. Let me know if you
disagree, but otherwise I am thinking of modifying my patches that way and
avoid adding a new syscall.

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:40                 ` Jonathan Kowalski
@ 2019-03-25 21:14                   ` Jonathan Kowalski
  2019-03-25 21:15                   ` Jann Horn
  1 sibling, 0 replies; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 21:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Daniel Colascione, Joel Fernandes, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

Also,  extending this further, instead of new ioctl flags over to
translate a tidfd one might introduce later for thread targetted
signals (which would still be a pidfd in the struct pid terms, but
with a bit set in its reference to target the selected TID in
particular), you could resolve this neatly to the proc entry of the
task itself, which would be subject to restrictions similar to a
regular open call, minus all the races involved.

This also means you can get rid of having to support the /proc/<PID>
dir fd in pidfd_send_signal, because there is no incentive to, any
longer. The kernel now has just one pidfd object, well scoped in its
purpose, and this "feature" is tied to procfs itself, disabling which
takes away the feature as well.

Otherwise, the ioctl will be conditionally available and/or work only
when procfs is present, and you'd tie procfs to pidfds eternally as
ABI.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 20:40                 ` Jonathan Kowalski
  2019-03-25 21:14                   ` Jonathan Kowalski
@ 2019-03-25 21:15                   ` Jann Horn
  1 sibling, 0 replies; 42+ messages in thread
From: Jann Horn @ 2019-03-25 21:15 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Daniel Colascione, Joel Fernandes, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 9:40 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> On Mon, Mar 25, 2019 at 8:34 PM Jann Horn <jannh@google.com> wrote:
> >
> >  [...SNIP...]
> >
> > Please don't do that. /proc/$pid/fd refers to the set of file
> > descriptors the process has open, and semantically doesn't have much
> > to do with the identity of the process. If you want to have a procfs
> > directory entry for getting a pidfd, please add a new entry. (Although
> > I don't see the point in adding a new procfs entry for this when you
> > could instead have an ioctl or syscall operating on the procfs
> > directory fd.)
>
> There is no new entry. What I was saying (and I should have been
> clearer) is that the existing entry for the fd when open'd with
> O_DIRECTORY makes the kernel resolve the symlink to /proc/<PID> of the
> process it maps to, so it would become:
>
>   int dirfd = open("/proc/self/fd/3", O_DIRECTORY|O_CLOEXEC);

That still seems really weird. This magically overloads O_DIRECTORY,
which means "fail if the thing is not a directory", to suddenly have
an entirely different meaning for one magical special type of file. On
top of that, unlike an ioctl or a new syscall, it doesn't convey
explicit intent and increases the risk of confused deputy issues.

> This also means you cannot cross the filesystem boundry, the said
> process needs to have a visible entry (which would mean hidepid= and
> gid= based access controls are honored), and you can only open the
> dirfd of a process in the current ns (as the PID will not map to an
> existent process if the pidfd maps to a process not in the same or
> children pid ns, in fdinfo it lists -1 in the pid field (we might not
> even need fdinfo anymore)).

AFAICS that doesn't have anything to do with whether you do this as a
syscall, as an ioctl, or as a jumped symlink. The kernel would have to
do the same security checks in any of those cases - only a classic,
non-jumped symlink would implicitly go through the existing permission
checks. And if you implement this with a non-jumped symlink, you get
races.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:11       ` Joel Fernandes
@ 2019-03-25 21:17         ` Daniel Colascione
  2019-03-25 21:19         ` Jann Horn
  1 sibling, 0 replies; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 21:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Christian Brauner, Jann Horn, Konstantin Khlebnikov,
	Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 2:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > > > Muthusamy from Oracle [3].
> > > > >
> > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > 1. Provide translation of pids between pid namespaces
> > > > > 2. Provide implicit pid namespace introspection
> > > > >
> > > > > Both functionalities are preserved. The latter task has been improved
> > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > would report false negatives. For example, if pid 1 inside of the target
> > > > > pid namespace already died, it would report that the target pid
> > > > > namespace cannot be reached from the source pid namespace because it
> > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > falsely report to the user that the two pid namespaces are not related.
> > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > list of ancestors and check whether the namespace are related to each
> > > > > other. By doing it this way we can reliably report what the relationship
> > > > > between two pid namespace file descriptors looks like.
> > > > >
> > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > > > pidfds independent of procfs had already been requested in the
> > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > struct pid in file->private_data and drop that reference on close.
> > > > >
> > > > > With this translate_pid() has three closely related but still distinct
> > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > userspace to use the syscall it has:
> > > > > - gained a command argument and three commands clearly reflecting the
> > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > >   PIDCMD_GET_PIDFD).
> > > > > - been renamed to pidctl()
> > > >
> > > [snip]
> > > > Also, I'm still confused about how metadata access is supposed to work
> > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > You snipped out a portion of a previous email in which I asked about
> > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > place, we have two different kinds of file descriptors for processes,
> > > > one derived from procfs and one that's independent. The former works
> > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > manner?
> > >
> > > This is true, that such usecase will not be supportable.  But the advantage
> > > on the other hand, is that suchs "pidfd" can be made pollable or readable in
> > > the future. Potentially allowing us to return exit status without a new
> > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> > > with proc.
> > >
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > Andy - and Jann who I just talked to - have proposed solutions for this.
> > Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> > ioctl() handler for /proc that would give you a dirfd back for a given
> > pidfd. The advantage is that pidfd_clone() can then give back pidfds
> > without having to care in what procfs the process is supposed to live.
> > That makes things a lot easier. But pidfds for the general case should
> > be anon inodes. It's clean, it's simple and it is way more secure.
>
> That makes sense to me, it is clean and I agree let us do that.
>
> Also for the "blocking on pid exit status" usecase, instead of adding a new
> syscall like pidfd_wait, lets just make that a new IOCTL to the

Please, no ioctls.

> file_operations of the anon_inode pidfd file. This will lets us specify
> exactly what to wait on (wait on death or wait on zombie) and lets us

I don't like per-open-file-description state. Ever try to set
O_NONBLOCK on standard input? It results in a broken terminal
configuration. pidfd wait mode would be similar. Processes and
intraprocess components share file descriptors all the time for
various reasons, and making the wait mode specific to the open file
description causes "spooky action at a distance" and bugs. If you need
a configurable wait mode, you should create a new open file
description that encodes that wait mode for its entire lifetime.

> avoid
> having a new syscall

Please stop using the "this lets us avoid making a new system call"
justification for interface design. System calls are cheap to add, and
going to lengths to avoid making a new system call frequently makes
interfaces worse in various ways.

> and create new fd just for waiting.

I think it's fine to make a new FD for waiting, especially if you only
need a new FD for a non-default wait mode.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:11       ` Joel Fernandes
  2019-03-25 21:17         ` Daniel Colascione
@ 2019-03-25 21:19         ` Jann Horn
  2019-03-25 21:43           ` Joel Fernandes
  1 sibling, 1 reply; 42+ messages in thread
From: Jann Horn @ 2019-03-25 21:19 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Christian Brauner, Daniel Colascione, Konstantin Khlebnikov,
	Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > > > Muthusamy from Oracle [3].
> > > > >
> > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > 1. Provide translation of pids between pid namespaces
> > > > > 2. Provide implicit pid namespace introspection
> > > > >
> > > > > Both functionalities are preserved. The latter task has been improved
> > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > would report false negatives. For example, if pid 1 inside of the target
> > > > > pid namespace already died, it would report that the target pid
> > > > > namespace cannot be reached from the source pid namespace because it
> > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > falsely report to the user that the two pid namespaces are not related.
> > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > list of ancestors and check whether the namespace are related to each
> > > > > other. By doing it this way we can reliably report what the relationship
> > > > > between two pid namespace file descriptors looks like.
> > > > >
> > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > > > pidfds independent of procfs had already been requested in the
> > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > struct pid in file->private_data and drop that reference on close.
> > > > >
> > > > > With this translate_pid() has three closely related but still distinct
> > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > userspace to use the syscall it has:
> > > > > - gained a command argument and three commands clearly reflecting the
> > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > >   PIDCMD_GET_PIDFD).
> > > > > - been renamed to pidctl()
> > > >
> > > [snip]
> > > > Also, I'm still confused about how metadata access is supposed to work
> > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > You snipped out a portion of a previous email in which I asked about
> > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > place, we have two different kinds of file descriptors for processes,
> > > > one derived from procfs and one that's independent. The former works
> > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > manner?
> > >
> > > This is true, that such usecase will not be supportable.  But the advantage
> > > on the other hand, is that suchs "pidfd" can be made pollable or readable in
> > > the future. Potentially allowing us to return exit status without a new
> > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> > > with proc.
> > >
> > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > can be translated into a "pidfd" using another syscall or even a node, like
> > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > the previous threads.
> >
> > Andy - and Jann who I just talked to - have proposed solutions for this.
> > Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> > ioctl() handler for /proc that would give you a dirfd back for a given
> > pidfd. The advantage is that pidfd_clone() can then give back pidfds
> > without having to care in what procfs the process is supposed to live.
> > That makes things a lot easier. But pidfds for the general case should
> > be anon inodes. It's clean, it's simple and it is way more secure.
>
> That makes sense to me, it is clean and I agree let us do that.
>
> Also for the "blocking on pid exit status" usecase, instead of adding a new
> syscall like pidfd_wait, lets just make that a new IOCTL to the
> file_operations of the anon_inode pidfd file. This will lets us specify
> exactly what to wait on (wait on death or wait on zombie) and lets us avoid
> having a new syscall and create new fd just for waiting. Let me know if you
> disagree, but otherwise I am thinking of modifying my patches that way and
> avoid adding a new syscall.

But often you don't just want to wait for a single thing to happen;
you want to wait for many things at once, and react as soon as any one
of them happens. This is why the kernel has epoll and all the other
"wait for event on FD" APIs. If waiting for a process isn't possible
with fd-based APIs like epoll, users of this API have to spin up
useless helper threads.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:19         ` Jann Horn
@ 2019-03-25 21:43           ` Joel Fernandes
  2019-03-25 21:54             ` Jonathan Kowalski
  0 siblings, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2019-03-25 21:43 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Daniel Colascione, Konstantin Khlebnikov,
	Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Mon, Mar 25, 2019 at 09:15:45PM +0100, Christian Brauner wrote:
> > > On Mon, Mar 25, 2019 at 01:36:14PM -0400, Joel Fernandes wrote:
> > > > On Mon, Mar 25, 2019 at 09:48:43AM -0700, Daniel Colascione wrote:
> > > > > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > syscall. Multiple people have asked when this patchset will be sent in
> > > > > > for merging (cf. [1], [2]). It has recently been revived by Nagarathnam
> > > > > > Muthusamy from Oracle [3].
> > > > > >
> > > > > > The intention of the original translate_pid() syscall was twofold:
> > > > > > 1. Provide translation of pids between pid namespaces
> > > > > > 2. Provide implicit pid namespace introspection
> > > > > >
> > > > > > Both functionalities are preserved. The latter task has been improved
> > > > > > upon though. In the original version of the pachset passing pid as 1
> > > > > > would allow to deterimine the relationship between the pid namespaces.
> > > > > > This is inherhently racy. If pid 1 inside a pid namespace has died it
> > > > > > would report false negatives. For example, if pid 1 inside of the target
> > > > > > pid namespace already died, it would report that the target pid
> > > > > > namespace cannot be reached from the source pid namespace because it
> > > > > > couldn't find the pid inside of the target pid namespace and thus
> > > > > > falsely report to the user that the two pid namespaces are not related.
> > > > > > This problem is simple to avoid. In the new version we simply walk the
> > > > > > list of ancestors and check whether the namespace are related to each
> > > > > > other. By doing it this way we can reliably report what the relationship
> > > > > > between two pid namespace file descriptors looks like.
> > > > > >
> > > > > > Additionally, this syscall has been extended to allow the retrieval of
> > > > > > pidfds independent of procfs. These pidfds can e.g. be used with the new
> > > > > > pidfd_send_signal() syscall we recently merged. The ability to retrieve
> > > > > > pidfds independent of procfs had already been requested in the
> > > > > > pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
> > > > > > [5]. A use-case where a kernel is compiled without procfs but where
> > > > > > pidfds are still useful has been outlined by Andy in [6]. Regular
> > > > > > anon-inode based file descriptors are used that stash a reference to
> > > > > > struct pid in file->private_data and drop that reference on close.
> > > > > >
> > > > > > With this translate_pid() has three closely related but still distinct
> > > > > > functionalities. To clarify the semantics and to make it easier for
> > > > > > userspace to use the syscall it has:
> > > > > > - gained a command argument and three commands clearly reflecting the
> > > > > >   distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
> > > > > >   PIDCMD_GET_PIDFD).
> > > > > > - been renamed to pidctl()
> > > > >
> > > > [snip]
> > > > > Also, I'm still confused about how metadata access is supposed to work
> > > > > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process,
> > > > > You snipped out a portion of a previous email in which I asked about
> > > > > your thoughts on this question. With the PIDCMD_GET_PIDFD command in
> > > > > place, we have two different kinds of file descriptors for processes,
> > > > > one derived from procfs and one that's independent. The former works
> > > > > with openat(2). The latter does not. To be very specific; if I'm
> > > > > writing a function that accepts a pidfd and I get a pidfd that comes
> > > > > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of
> > > > > smaps or oom_score_adj or statm for the named process in a race-free
> > > > > manner?
> > > >
> > > > This is true, that such usecase will not be supportable.  But the advantage
> > > > on the other hand, is that suchs "pidfd" can be made pollable or readable in
> > > > the future. Potentially allowing us to return exit status without a new
> > > > syscall (?). And we can add IOCTLs to the pidfd descriptor which we cannot do
> > > > with proc.
> > > >
> > > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd
> > > > can be translated into a "pidfd" using another syscall or even a node, like
> > > > /proc/pid/handle or something. I think this is what Christian suggested in
> > > > the previous threads.
> > >
> > > Andy - and Jann who I just talked to - have proposed solutions for this.
> > > Jann's idea is similar to what you suggested, Joel. You could e.g. do an
> > > ioctl() handler for /proc that would give you a dirfd back for a given
> > > pidfd. The advantage is that pidfd_clone() can then give back pidfds
> > > without having to care in what procfs the process is supposed to live.
> > > That makes things a lot easier. But pidfds for the general case should
> > > be anon inodes. It's clean, it's simple and it is way more secure.
> >
> > That makes sense to me, it is clean and I agree let us do that.
> >
> > Also for the "blocking on pid exit status" usecase, instead of adding a new
> > syscall like pidfd_wait, lets just make that a new IOCTL to the
> > file_operations of the anon_inode pidfd file. This will lets us specify
> > exactly what to wait on (wait on death or wait on zombie) and lets us avoid
> > having a new syscall and create new fd just for waiting. Let me know if you
> > disagree, but otherwise I am thinking of modifying my patches that way and
> > avoid adding a new syscall.
> 
> But often you don't just want to wait for a single thing to happen;
> you want to wait for many things at once, and react as soon as any one
> of them happens. This is why the kernel has epoll and all the other
> "wait for event on FD" APIs. If waiting for a process isn't possible
> with fd-based APIs like epoll, users of this API have to spin up
> useless helper threads.

This is true. I almost forgot about the polling requirement, sorry. So then a
new syscall it is.. About what to wait for, that can be a separate parameter
to pidfd_wait then.

Thanks.

 - Joel


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:43           ` Joel Fernandes
@ 2019-03-25 21:54             ` Jonathan Kowalski
  2019-03-25 22:07               ` Daniel Colascione
  2019-03-26  3:03               ` Joel Fernandes
  0 siblings, 2 replies; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 21:54 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jann Horn, Christian Brauner, Daniel Colascione,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > But often you don't just want to wait for a single thing to happen;
> > you want to wait for many things at once, and react as soon as any one
> > of them happens. This is why the kernel has epoll and all the other
> > "wait for event on FD" APIs. If waiting for a process isn't possible
> > with fd-based APIs like epoll, users of this API have to spin up
> > useless helper threads.
>
> This is true. I almost forgot about the polling requirement, sorry. So then a
> new syscall it is.. About what to wait for, that can be a separate parameter
> to pidfd_wait then.
>

This introduces a time window where the process changes state between
"get pidfd" and "setup waitfd", it would be simpler if the pidfd
itself supported .poll and on termination the exit status was made
readable from the file descriptor.

Further, in the clone4 patchset, there was a mechanism to autoreap
such a process so that it does not interfere with waiting a process
does normally. How do you intend to handle this case if anyone except
the parent is wanting to *wait* on the process (a second process,
without reaping, so as to not disrupt any waiting in the parent), and
for things like libraries that finally can manage their own set of
process when pidfd_clone is added, by excluding this process from the
process's normal wait handling logic.

Moreover, should anyone except the parent process be allowed to open a
readable pidfd (or waitfd), without additional capabilities?

> Thanks.
>
>  - Joel
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:54             ` Jonathan Kowalski
@ 2019-03-25 22:07               ` Daniel Colascione
  2019-03-25 22:37                 ` Jonathan Kowalski
  2019-03-26  3:03               ` Joel Fernandes
  1 sibling, 1 reply; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 22:07 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Joel Fernandes, Jann Horn, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > But often you don't just want to wait for a single thing to happen;
> > > you want to wait for many things at once, and react as soon as any one
> > > of them happens. This is why the kernel has epoll and all the other
> > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > with fd-based APIs like epoll, users of this API have to spin up
> > > useless helper threads.
> >
> > This is true. I almost forgot about the polling requirement, sorry. So then a
> > new syscall it is.. About what to wait for, that can be a separate parameter
> > to pidfd_wait then.
> >
>
> This introduces a time window where the process changes state between
> "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> itself supported .poll and on termination the exit status was made
> readable from the file descriptor.

I don't see a race here. Process state moves in one direction, so
there's no race. If you want the poll to end when a process exits and
the process exits before you poll, the poll should finish immediately.
If the process exits before you even create the polling FD, whatever
creates the polling FD can fail with ESRCH, which is what usually
happens when you try to do anything with a process that's gone. Either
way, whatever's trying to set up the poll knows the state of the
process and there's no possibility of a missed wakeup.

> Further, in the clone4 patchset, there was a mechanism to autoreap
> such a process so that it does not interfere with waiting a process
> does normally. How do you intend to handle this case if anyone except
> the parent is wanting to *wait* on the process (a second process,
> without reaping, so as to not disrupt any waiting in the parent), and
> for things like libraries that finally can manage their own set of
> process when pidfd_clone is added, by excluding this process from the
> process's normal wait handling logic.

I think that discussion is best had around pidfd_clone. pidfd waiting
functionality shouldn't affect wait* in any way nor affect a zombie
transition or reaping.

> Moreover, should anyone except the parent process be allowed to open a
> readable pidfd (or waitfd), without additional capabilities?

That's a separate discussion. See
https://lore.kernel.org/lkml/CAKOZueussVwABQaC+O9fW+MZayccvttKQZfWg0hh-cZ+1ykXig@mail.gmail.com/,
where we talked about permissions extensively. Anyone can hammer on
/proc today or hammer on kill(pid, 0) to learn about a process
exiting, so anyone should be able to wait for a process to die --- it
just automates what anyone can do manually. What's left is the
question of who should be able to learn a process's exit code. As I
mentioned in the linked email, process exit codes are public on
FreeBSD, and the simplest option is to make them public in Linux too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 22:07               ` Daniel Colascione
@ 2019-03-25 22:37                 ` Jonathan Kowalski
  2019-03-25 23:14                   ` Daniel Colascione
  0 siblings, 1 reply; 42+ messages in thread
From: Jonathan Kowalski @ 2019-03-25 22:37 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Jann Horn, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > But often you don't just want to wait for a single thing to happen;
> > > > you want to wait for many things at once, and react as soon as any one
> > > > of them happens. This is why the kernel has epoll and all the other
> > > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > > with fd-based APIs like epoll, users of this API have to spin up
> > > > useless helper threads.
> > >
> > > This is true. I almost forgot about the polling requirement, sorry. So then a
> > > new syscall it is.. About what to wait for, that can be a separate parameter
> > > to pidfd_wait then.
> > >
> >
> > This introduces a time window where the process changes state between
> > "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> > itself supported .poll and on termination the exit status was made
> > readable from the file descriptor.
>
> I don't see a race here. Process state moves in one direction, so
> there's no race. If you want the poll to end when a process exits and
> the process exits before you poll, the poll should finish immediately.
> If the process exits before you even create the polling FD, whatever
> creates the polling FD can fail with ESRCH, which is what usually
> happens when you try to do anything with a process that's gone. Either
> way, whatever's trying to set up the poll knows the state of the
> process and there's no possibility of a missed wakeup.
>

poll will possibly work and return immediately, but you won't be able
to read back anything, because for the kernel there will be no waiter
before you open one. If you make pidfd pollable and readable (for
parents, as an example), the poll ends immediately but there will
something to read from the fd.

> > Further, in the clone4 patchset, there was a mechanism to autoreap
> > such a process so that it does not interfere with waiting a process
> > does normally. How do you intend to handle this case if anyone except
> > the parent is wanting to *wait* on the process (a second process,
> > without reaping, so as to not disrupt any waiting in the parent), and
> > for things like libraries that finally can manage their own set of
> > process when pidfd_clone is added, by excluding this process from the
> > process's normal wait handling logic.
>
> I think that discussion is best had around pidfd_clone. pidfd waiting
> functionality shouldn't affect wait* in any way nor affect a zombie
> transition or reaping.

So this is more akin to waitfd and traditional wait* and friends, and
a single notification fd for possibly multiple children? I just wanted
to be sure one would (should) be able to use a pidfd obtained from
pidfd_clone without affecting the existing waitfd, and other
primitives. The same application's components should be able to host
their own set of children using such an API (think libraries) and
without affecting the process.

>
> > Moreover, should anyone except the parent process be allowed to open a
> > readable pidfd (or waitfd), without additional capabilities?
>
> That's a separate discussion. See
> https://lore.kernel.org/lkml/CAKOZueussVwABQaC+O9fW+MZayccvttKQZfWg0hh-cZ+1ykXig@mail.gmail.com/, where we talked about permissions extensively. Anyone can hammer on
> /proc today or hammer on kill(pid, 0) to learn about a process
> exiting, so anyone should be able to wait for a process to die --- it
> just automates what anyone can do manually. What's left is the
> question of who should be able to learn a process's exit code. As I
> mentioned in the linked email, process exit codes are public on
> FreeBSD, and the simplest option is to make them public in Linux too.

People might be using them in ways that convey information between the
parent and child something else shouldn't be able to know. They might
be relying on this assumption that it is private. I don't think
opening it up without requiring *some* privileges is safe.

Also, taking this further, if the structure being returned from a read
isn't just the exit code but something more elaborate (you have
mentioned siginfo previously), or even statistics like getrusage, I
would be concerned if anyone except those with CAP_SYS_PTRACE or so
should be able to obtain such a readable pidfd/waitfd.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 22:37                 ` Jonathan Kowalski
@ 2019-03-25 23:14                   ` Daniel Colascione
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Colascione @ 2019-03-25 23:14 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Joel Fernandes, Jann Horn, Christian Brauner,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 3:37 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 10:07 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 2:55 PM Jonathan Kowalski <bl0pbl33p@gmail.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > But often you don't just want to wait for a single thing to happen;
> > > > > you want to wait for many things at once, and react as soon as any one
> > > > > of them happens. This is why the kernel has epoll and all the other
> > > > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > > > with fd-based APIs like epoll, users of this API have to spin up
> > > > > useless helper threads.
> > > >
> > > > This is true. I almost forgot about the polling requirement, sorry. So then a
> > > > new syscall it is.. About what to wait for, that can be a separate parameter
> > > > to pidfd_wait then.
> > > >
> > >
> > > This introduces a time window where the process changes state between
> > > "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> > > itself supported .poll and on termination the exit status was made
> > > readable from the file descriptor.
> >
> > I don't see a race here. Process state moves in one direction, so
> > there's no race. If you want the poll to end when a process exits and
> > the process exits before you poll, the poll should finish immediately.
> > If the process exits before you even create the polling FD, whatever
> > creates the polling FD can fail with ESRCH, which is what usually
> > happens when you try to do anything with a process that's gone. Either
> > way, whatever's trying to set up the poll knows the state of the
> > process and there's no possibility of a missed wakeup.
> >
>
> poll will possibly work and return immediately, but you won't be able
> to read back anything, because for the kernel there will be no waiter
> before you open one. If you make pidfd pollable and readable (for
> parents, as an example), the poll ends immediately but there will
> something to read from the fd.

You can lose exit status this way. That's not a problem, though,
because *without* synchronization between the exiting process and the
waiting process, the waiting process can lose the race no matter what,
and *with* synchronization, there's no change of losing the
information. (It's just like a condition variable.) Today,
synchronization between a parent and child is automatic (because
zombies), but direct children work reliably with pidfd too: the
descriptor that pidfd_clone (or whatever it ends up being called)
returns will *always* be created before the child process and so will
*always* have exit status information available, even with some
non-SIGCHLD option applied.

> > > Further, in the clone4 patchset, there was a mechanism to autoreap
> > > such a process so that it does not interfere with waiting a process
> > > does normally. How do you intend to handle this case if anyone except
> > > the parent is wanting to *wait* on the process (a second process,
> > > without reaping, so as to not disrupt any waiting in the parent), and
> > > for things like libraries that finally can manage their own set of
> > > process when pidfd_clone is added, by excluding this process from the
> > > process's normal wait handling logic.
> >
> > I think that discussion is best had around pidfd_clone. pidfd waiting
> > functionality shouldn't affect wait* in any way nor affect a zombie
> > transition or reaping.
>
> So this is more akin to waitfd and traditional wait* and friends, and
> a single notification fd for possibly multiple children?

I don't know what gave you this idea. I'm talking about a model in
which you wait for one child with one open file description.

> I just wanted
> to be sure one would (should) be able to use a pidfd obtained from
> pidfd_clone without affecting the existing waitfd, and other
> primitives.

I don't think the use of pidfd *waiting*, by itself, should affect the
current fork/wait/zombie model of process exit delivery. That is, by
default, a process should transition to the zombie state, send SIGCHLD
to its parent, and then get reaped at exactly the same times it does
today, with exactly the same semantics we have today. Pidfd waiting
should be an "add on".

> The same application's components should be able to host
> their own set of children using such an API (think libraries) and
> without affecting the process.

Agreed. To be clear, you're talking about something slightly different
from pure pidfd waiting. There should be some way to *create* a
process in such a way that it's not visible to wait(-1), doesn't
generate SIGCHLD, and so on. It would be great for libraries to create
and manage worker processes transparently, even in the presence of
some other thread sitting on wait(-1). I don't think the specific form
of this option affects how pidfd waiting works though. Do you?

> > > Moreover, should anyone except the parent process be allowed to open a
> > > readable pidfd (or waitfd), without additional capabilities?
> >
> > That's a separate discussion. See
> > https://lore.kernel.org/lkml/CAKOZueussVwABQaC+O9fW+MZayccvttKQZfWg0hh-cZ+1ykXig@mail.gmail.com/, where we talked about permissions extensively. Anyone can hammer on
> > /proc today or hammer on kill(pid, 0) to learn about a process
> > exiting, so anyone should be able to wait for a process to die --- it
> > just automates what anyone can do manually. What's left is the
> > question of who should be able to learn a process's exit code. As I
> > mentioned in the linked email, process exit codes are public on
> > FreeBSD, and the simplest option is to make them public in Linux too.
>
> People might be using them in ways that convey information between the
> parent and child something else shouldn't be able to know.

In theory, yes, there's an information leak. In practice, I don't
think that it actually matters. Can you think of a concrete security
hole that would be opened by allowing anyone to inspect process exit
status? Having thought of it, have you filed for a FreeBSD CVE? :-)
That is, I'd expect that any security hole *in practice* created by
allowing public inspection of error states would have already appeared
on systems that allow public notification of error states (since most
programs are portable), so I don't think there's a problem here in
practice.

I'd much rather make exit codes knowable publicly than not: declaring
that public exit status is public sidesteps a lot of weird setuid
issues --- see the thread I linked --- and makes things like pkill(1)
more useful --- because they could report whether a process died due
to the signal pkill sent or due to some other reason.

> They might
> be relying on this assumption that it is private. I don't think
> opening it up without requiring *some* privileges is safe.

/proc/pid/stat already reports exit codes to non-parents when 1) the
process whose stat is being read has exited but has not yet been
reaped, and 2) the reader would pass a ptrace(2) exit status on the
zombie (or would have, before the zombie became a zombie). A similar
model might work as a compromise option for us here: a policy of "as
broad as /proc/pid/stat" discloses no new information except in very
minor edge case of a child of a process with SIGCHLD set to SIG_IGN.

I would still prefer completely public status information for
simplicity and flexibility reasons though.

> Also, taking this further, if the structure being returned from a read
> isn't just the exit code but something more elaborate (you have
> mentioned siginfo previously), or even statistics like getrusage, I
> would be concerned if anyone except those with CAP_SYS_PTRACE or so
> should be able to obtain such a readable pidfd/waitfd.

It seems reasonable to provide different waiters with different levels
of access.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 16:56 ` David Howells
  2019-03-25 16:58   ` Daniel Colascione
@ 2019-03-25 23:39   ` Andy Lutomirski
  1 sibling, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2019-03-25 23:39 UTC (permalink / raw)
  To: David Howells
  Cc: Daniel Colascione, Christian Brauner, Jann Horn,
	Konstantin Khlebnikov, Andy Lutomirski, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Mon, Mar 25, 2019 at 9:56 AM David Howells <dhowells@redhat.com> wrote:
>
> Daniel Colascione <dancol@google.com> wrote:
>
> > System calls are cheap.
>
> Only to a point.  x86_64 will have an issue when we hit syscall 512.  We're
> currently at 427.
>

I don't consider this to be a problem.  I have patches to make this
problem go away forever.  I just need to tidy them up a bit and
upstream them.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] pid: add pidctl()
  2019-03-25 21:54             ` Jonathan Kowalski
  2019-03-25 22:07               ` Daniel Colascione
@ 2019-03-26  3:03               ` Joel Fernandes
  1 sibling, 0 replies; 42+ messages in thread
From: Joel Fernandes @ 2019-03-26  3:03 UTC (permalink / raw)
  To: Jonathan Kowalski
  Cc: Jann Horn, Christian Brauner, Daniel Colascione,
	Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Dmitry V. Levin, Andrew Morton,
	Oleg Nesterov, Nagarathnam Muthusamy, Aleksa Sarai, Al Viro

On Mon, Mar 25, 2019 at 09:54:58PM +0000, Jonathan Kowalski wrote:
> On Mon, Mar 25, 2019 at 9:43 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Mar 25, 2019 at 10:19:26PM +0100, Jann Horn wrote:
> > > On Mon, Mar 25, 2019 at 10:11 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > But often you don't just want to wait for a single thing to happen;
> > > you want to wait for many things at once, and react as soon as any one
> > > of them happens. This is why the kernel has epoll and all the other
> > > "wait for event on FD" APIs. If waiting for a process isn't possible
> > > with fd-based APIs like epoll, users of this API have to spin up
> > > useless helper threads.
> >
> > This is true. I almost forgot about the polling requirement, sorry. So then a
> > new syscall it is.. About what to wait for, that can be a separate parameter
> > to pidfd_wait then.
> >
> 
> This introduces a time window where the process changes state between
> "get pidfd" and "setup waitfd", it would be simpler if the pidfd
> itself supported .poll and on termination the exit status was made
> readable from the file descriptor.

It is much cleaner to add a new pidfd_wait syscall for this as discussed in
the other thread. Adding .poll directly to the pidfd would seem to complicate
the blocking configuration. Do we block until the task is a zombie or until
it is dead?  That is not possible to specify easily. Also if we need to
return other types of information from the pidfd, not just exit state, then
it is not clear whether blocking on a pidfd just purely on exit status makes
sense. It is much cleaner to add a new pidfd_wait syscall giving it a pidfd,
specify what to block on (EXIT_DEAD or EXIT_ZOMBIE or both) and then return a
wait fd that can be read/blocked and returning all the needed information on
unblock.

> Further, in the clone4 patchset, there was a mechanism to autoreap
> such a process so that it does not interfere with waiting a process
> does normally. How do you intend to handle this case if anyone except
> the parent is wanting to *wait* on the process (a second process,
> without reaping, so as to not disrupt any waiting in the parent), and
> for things like libraries that finally can manage their own set of
> process when pidfd_clone is added, by excluding this process from the
> process's normal wait handling logic.

The pidfd_wait logic being discussed does not depend on or affects the
autoreap behavior. This wait is not the traditional wait family of calls.
It is for just getting notified about reading all the exit state of a task.
Once I post the code, it will be clear. And I'll CC you..

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-25 18:18   ` Jann Horn
  2019-03-25 19:58     ` Christian Brauner
@ 2019-03-26 16:07     ` Joel Fernandes
  2019-03-26 16:15       ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: Joel Fernandes @ 2019-03-26 16:07 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Andy Lutomirski, Konstantin Khlebnikov,
	David Howells, Serge E. Hallyn, Eric W. Biederman, Linux API,
	kernel list, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, bl0pbl33p,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	Nagarathnam Muthusamy, cyphar, Al Viro, Daniel Colascione

On Mon, Mar 25, 2019 at 07:18:42PM +0100, Jann Horn wrote:
> On Mon, Mar 25, 2019 at 5:21 PM Christian Brauner <christian@brauner.io> wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> [...]
> > +
> > +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> > +{
> > +       struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> > +
> > +       if (fd >= 0) {
> > +#ifdef CONFIG_PID_NS
> > +               struct ns_common *ns;
> > +               struct file *file = proc_ns_fget(fd);
> > +               if (IS_ERR(file))
> > +                       return ERR_CAST(file);
> > +
> > +               ns = get_proc_ns(file_inode(file));
> > +               if (ns->ops->type == CLONE_NEWPID)
> > +                       pidns = get_pid_ns(
> > +                               container_of(ns, struct pid_namespace, ns));
> 
> This increments the refcount of the pidns...
> 
> > +
> > +               fput(file);
> > +#endif
> > +       } else {
> > +               pidns = task_active_pid_ns(current);
> 
> ... but this doesn't. That's pretty subtle; could you please put a
> comment on top of this function that points this out? Or even better,
> change the function to always take a reference, so that the caller
> doesn't have to worry about figuring this out.
> 
> > +       }
> > +
> > +       return pidns;
> > +}
> [...]
> > +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> > +               unsigned int, flags)
> > +{
> > +       struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> > +       struct pid *struct_pid;
> > +       pid_t result;
> > +
> > +       switch (cmd) {
> > +       case PIDCMD_QUERY_PIDNS:
> > +               if (pid != 0)
> > +                       return -EINVAL;
> > +               pid = 1;
> > +               /* fall through */
> > +       case PIDCMD_QUERY_PID:
> > +               if (flags != 0)
> > +                       return -EINVAL;
> > +               break;
> > +       case PIDCMD_GET_PIDFD:
> > +               if (flags & ~PIDCTL_CLOEXEC)
> > +                       return -EINVAL;
> > +               break;
> > +       default:
> > +               return -EOPNOTSUPP;
> > +       }
> > +
> > +       source_ns = get_pid_ns_by_fd(source);
> > +       result = PTR_ERR(source_ns);
> 
> I very much dislike using PTR_ERR() on pointers before checking
> whether they contain an error value or not. I understand that the
> result of this won't actually be used, but it still seems weird to
> have what is essentially a cast of a potentially valid pointer to a
> potentially smaller integer here.
> 
> Could you maybe move the PTR_ERR() into the error branch? Like so:
> 
> if (IS_ERR(source_ns)) {
>   result = PTR_ERR(source_ns);
>   goto err_source;
> }

FWIW, thought of mentioning that once the get_pid_ns_by_fd can be modified to
always take a reference on the ns, a further simplifcation here could be:

if (IS_ERR(source_ns)) {
	result = PTR_ERR(source_ns);
	source_ns = NULL;
	goto error;
}

if (IS_ERR(target_ns)) {
	result = PTR_ERR(target_ns);
	target_ns = NULL;
	goto error;
}

And the error patch can be simplified as well which also avoids the "if (target)"
issues Jan mentioned in the error path:

error:
	if (source_ns)
		put_pid_ns(source_ns);
	if (target_ns)
		put_pid_ns(target_ns);
	return result;

 
> > +       if (IS_ERR(source_ns))
> > +               goto err_source;
> > +
> > +       target_ns = get_pid_ns_by_fd(target);
> > +       result = PTR_ERR(target_ns);
> > +       if (IS_ERR(target_ns))
> > +               goto err_target;
> > +
> > +       if (cmd == PIDCMD_QUERY_PIDNS) {
> > +               result = pidns_related(source_ns, target_ns);
> > +       } else {
> > +               rcu_read_lock();
> > +               struct_pid = find_pid_ns(pid, source_ns);
> 
> find_pid_ns() doesn't take a reference on its return value, the return
> value is only pinned into memory by the RCU read-side critical
> section...
> 
> > +               result = struct_pid ? pid_nr_ns(struct_pid, target_ns) : -ESRCH;
> > +               rcu_read_unlock();
> 
> ... which ends here, making struct_pid a dangling pointer...
> 
> > +
> > +               if (cmd == PIDCMD_GET_PIDFD) {
> > +                       int cloexec = (flags & PIDCTL_CLOEXEC) ? O_CLOEXEC : 0;
> > +                       if (result > 0)
> > +                               result = pidfd_create_fd(struct_pid, cloexec);
> 
> ... and then here you continue to use struct_pid. That seems bogus.

Absolutely.

> > +                       else if (result == 0)
> > +                               result = -ENOENT;
> 
> You don't need to have flags for this for new syscalls, you can just
> make everything O_CLOEXEC; if someone wants to preserve an fd across
> execve(), they can just clear that bit with fcntl(). (I thiiink it was
> Andy Lutomirski who said that before regarding another syscall? But my
> memory of that is pretty foggy, might've been someone else.)

Agreed, I also was going to say the same, about the flags.

thanks,

 - Joel


> 
> > +               }
> > +       }
> > +
> > +       if (target)
> > +               put_pid_ns(target_ns);
> > +err_target:
> > +       if (source)
> > +               put_pid_ns(source_ns);
> 
> I think you probably intended to check for "if (target >= 0)" and "if
> (source >= 0)" instead of these conditions, matching the condition in
> get_pid_ns_by_fd()? The current code looks as if it will leave the
> refcount one too high when used with target==0 or source==0, and leave
> the refcount one too low when used with target==-1 or source==-1.
> Anyway, as stated above, I think it would be simpler to
> unconditionally take a reference instead.
> 
> > +err_source:
> > +       return result;
> > +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] pid: add pidctl()
  2019-03-26 16:07     ` Joel Fernandes
@ 2019-03-26 16:15       ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2019-03-26 16:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jann Horn, Andy Lutomirski, Konstantin Khlebnikov, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, kernel list,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, bl0pbl33p, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy, cyphar,
	Al Viro, Daniel Colascione

> Agreed, I also was going to say the same, about the flags.

Please review the updated version I just sent out.

Christian

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2019-03-26 16:15 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25 16:20 [PATCH 0/4] pid: add pidctl() Christian Brauner
2019-03-25 16:20 ` [PATCH 1/4] Make anon_inodes unconditional Christian Brauner
2019-03-25 16:20 ` [PATCH 2/4] pid: add pidctl() Christian Brauner
2019-03-25 17:20   ` Mika Penttilä
2019-03-25 19:59     ` Christian Brauner
2019-03-25 18:18   ` Jann Horn
2019-03-25 19:58     ` Christian Brauner
2019-03-26 16:07     ` Joel Fernandes
2019-03-26 16:15       ` Christian Brauner
2019-03-25 16:20 ` [PATCH 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
2019-03-25 18:28   ` Jonathan Kowalski
2019-03-25 20:05     ` Christian Brauner
2019-03-25 18:39   ` Jann Horn
2019-03-25 19:41     ` Christian Brauner
2019-03-25 16:20 ` [PATCH 4/4] tests: add pidctl() tests Christian Brauner
2019-03-25 16:48 ` [PATCH 0/4] pid: add pidctl() Daniel Colascione
2019-03-25 17:05   ` Konstantin Khlebnikov
2019-03-25 17:07     ` Daniel Colascione
2019-03-25 17:36   ` Joel Fernandes
2019-03-25 17:53     ` Daniel Colascione
2019-03-25 18:19       ` Jonathan Kowalski
2019-03-25 18:57         ` Daniel Colascione
2019-03-25 19:42           ` Jonathan Kowalski
2019-03-25 20:14             ` Daniel Colascione
2019-03-25 20:34               ` Jann Horn
2019-03-25 20:40                 ` Jonathan Kowalski
2019-03-25 21:14                   ` Jonathan Kowalski
2019-03-25 21:15                   ` Jann Horn
2019-03-25 20:40                 ` Christian Brauner
2019-03-25 20:15     ` Christian Brauner
2019-03-25 21:11       ` Joel Fernandes
2019-03-25 21:17         ` Daniel Colascione
2019-03-25 21:19         ` Jann Horn
2019-03-25 21:43           ` Joel Fernandes
2019-03-25 21:54             ` Jonathan Kowalski
2019-03-25 22:07               ` Daniel Colascione
2019-03-25 22:37                 ` Jonathan Kowalski
2019-03-25 23:14                   ` Daniel Colascione
2019-03-26  3:03               ` Joel Fernandes
2019-03-25 16:56 ` David Howells
2019-03-25 16:58   ` Daniel Colascione
2019-03-25 23:39   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).