linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/4] pid: add pidctl()
@ 2019-03-26 15:55 Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

This is v1 of this patchset with various minor fixes which are listed in
the individual commits. Notably, pidfds are now O_CLOEXEC by default.

The pidctl() syscalls builds on, extends, and improves translate_pid()
[4] and serves as the natural connection between the pid-based and the
pidfd-based api.

I quote Konstantins original patchset first that has already been acked
and picked up by Eric before and whose functionality is preserved in
this syscall. Multiple people have asked when this patchset will be sent
in for merging (cf. [1], [2]). It has recently been revived by
Nagarathnam Muthusamy from Oracle [3].

The intention of the original translate_pid() syscall was twofold:
1. Provide translation of pids between pid namespaces especially for the
   case of deeply nested pid namespaces.
   The most obvious use-case is strace which has been waiting for this
   feature for a while.
2. Provide implicit pid namespace introspection

Both functionalities are preserved. The latter task has been improved
upon though. In the original version of the pachset passing pid as 1
would allow to deterimine the relationship between the pid namespaces.
This is inherhently racy. If pid 1 inside a pid namespace has died it
would report false negatives. For example, if pid 1 inside of the target
pid namespace already died, it would report that the target pid
namespace cannot be reached from the source pid namespace because it
couldn't find the pid inside of the target pid namespace and thus
falsely report to the user that the two pid namespaces are not related.
This problem is simple to avoid. In the new version we simply walk the
list of ancestors and check whether the namespace are related to each
other. By doing it this way we can reliably report what the relationship
between two pid namespace file descriptors looks like.

Additionally, this syscall has been extended to allow the retrieval of
pidfds independent of procfs. These pidfds can e.g. be used with the new
pidfd_send_signal() syscall we recently merged. The ability to retrieve
pidfds independent of procfs had already been requested in the
pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey
[5]. A use-case where a kernel is compiled without procfs but where
pidfds are still useful has been outlined by Andy in [6]. Regular
anon-inode based file descriptors are used that stash a reference to
struct pid in file->private_data and drop that reference on close.

With this pidctl() has three closely related functionalities that
provide a natural connection between the pid-based and the pidfd-based
api. To clarify the semantics and to make it easier for userspace to use
the syscall it has a command argument and three commands clearly
reflecting the functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS,
PIDCMD_GET_PIDFD).

Embedding the retrieval of pidfds into this syscall has two main
advantages:
- pidctl provides a natural and clean connection between the traditional
  pid-based and the newer pidfd-based process API
- allows the retrieval of pidfds for other pid namespaces while
  enforcing that
  - the caller must have been given access to two file descriptors
    referring to target and source pid namespace
  - the source pid namespace must be an ancestor of the target pid
    namespace
  - the pid must be translatable from the source pid namespace into the
    target pid namespace

Note that this patchset also includes Al's and David's commit to make anon
inodes unconditional. The original intention is to make it possible to use
anon inodes in core vfs functions. pidctl() has the same requirement so
David suggested I sent this in alongside this patch. Both are informed of
this.

The syscall comes with extensive testing for all functionalities.

/* References */
[1]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[2]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
[3]: https://lore.kernel.org/lkml/37b17950-b130-7933-99a1-4846c61c8555@oracle.com/
[4]: 3eb39f47934f9d5a3027fe00d906a45fe3a15fad
[5]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[6]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/

Thanks!
Christian

Christian Brauner (3):
  pid: add pidctl()
  signal: support pidctl() with pidfd_send_signal()
  tests: add pidctl() tests

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig                        |   1 -
 arch/arm64/kvm/Kconfig                      |   1 -
 arch/mips/kvm/Kconfig                       |   1 -
 arch/powerpc/kvm/Kconfig                    |   1 -
 arch/s390/kvm/Kconfig                       |   1 -
 arch/x86/Kconfig                            |   1 -
 arch/x86/entry/syscalls/syscall_32.tbl      |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 arch/x86/kvm/Kconfig                        |   1 -
 drivers/base/Kconfig                        |   1 -
 drivers/char/tpm/Kconfig                    |   1 -
 drivers/dma-buf/Kconfig                     |   1 -
 drivers/gpio/Kconfig                        |   1 -
 drivers/iio/Kconfig                         |   1 -
 drivers/infiniband/Kconfig                  |   1 -
 drivers/vfio/Kconfig                        |   1 -
 fs/Makefile                                 |   2 +-
 fs/notify/fanotify/Kconfig                  |   1 -
 fs/notify/inotify/Kconfig                   |   1 -
 include/linux/pid.h                         |   2 +
 include/linux/pid_namespace.h               |   8 +
 include/linux/syscalls.h                    |   2 +
 include/uapi/linux/wait.h                   |  14 +
 init/Kconfig                                |  10 -
 kernel/pid.c                                | 161 ++++++
 kernel/pid_namespace.c                      |  25 +
 kernel/signal.c                             |  29 +-
 kernel/sys_ni.c                             |   3 -
 tools/testing/selftests/pidfd/Makefile      |   2 +-
 tools/testing/selftests/pidfd/pidctl_test.c | 537 ++++++++++++++++++++
 30 files changed, 765 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidctl_test.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1 1/4] Make anon_inodes unconditional
  2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
@ 2019-03-26 15:55 ` Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

From: David Howells <dhowells@redhat.com>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and the pidctl() syscall.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidctl()]
Signed-off-by: Christian Brauner <christian@brauner.io>
---
 arch/arm/kvm/Kconfig       |  1 -
 arch/arm64/kvm/Kconfig     |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/Kconfig           |  1 -
 arch/x86/kvm/Kconfig       |  1 -
 drivers/base/Kconfig       |  1 -
 drivers/char/tpm/Kconfig   |  1 -
 drivers/dma-buf/Kconfig    |  1 -
 drivers/gpio/Kconfig       |  1 -
 drivers/iio/Kconfig        |  1 -
 drivers/infiniband/Kconfig |  1 -
 drivers/vfio/Kconfig       |  1 -
 fs/Makefile                |  2 +-
 fs/notify/fanotify/Kconfig |  1 -
 fs/notify/inotify/Kconfig  |  1 -
 init/Kconfig               | 10 ----------
 18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on MMU && OF
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select ARM_GIC
 	select ARM_GIC_V3
 	select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	depends on OF
 	select MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	depends on MIPS_FP_SUPPORT
 	select EXPORT_UASM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c1f9b3cf437c..18f2c954464e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
-	select ANON_INODES
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
-	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
 config DMA_SHARED_BUFFER
 	bool
 	default n
-	select ANON_INODES
 	select IRQ_WORK
 	help
 	  This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
 config TCG_VTPM_PROXY
 	tristate "VTPM Proxy Interface"
 	depends on TCG_TPM
-	select ANON_INODES
 	---help---
 	  This driver proxies for an emulated TPM (vTPM) running in userspace.
 	  A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
 config SYNC_FILE
 	bool "Explicit Synchronization Framework"
 	default n
-	select ANON_INODES
 	select DMA_SHARED_BUFFER
 	---help---
 	  The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
 
 menuconfig GPIOLIB
 	bool "GPIO Support"
-	select ANON_INODES
 	help
 	  This enables GPIO support through the generic GPIO library.
 	  You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
 
 menuconfig IIO
 	tristate "Industrial I/O support"
-	select ANON_INODES
 	help
 	  The industrial I/O subsystem provides a unified framework for
 	  drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
 
 config INFINIBAND_USER_ACCESS
 	tristate "InfiniBand userspace access (verbs and CM)"
-	select ANON_INODES
 	depends on MMU
 	---help---
 	  Userspace InfiniBand access support.  This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
-	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
 config FANOTIFY
 	bool "Filesystem wide access notification"
 	select FSNOTIFY
-	select ANON_INODES
 	select EXPORTFS
 	default n
 	---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	select ANON_INODES
 	select FSNOTIFY
 	default y
 	---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
@ 2019-03-26 15:55 ` Christian Brauner
  2019-03-26 16:17   ` Daniel Colascione
  2019-03-26 17:06   ` Joel Fernandes
  2019-03-26 15:55 ` [PATCH v1 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 4/4] tests: add pidctl() tests Christian Brauner
  3 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
I quote Konstantins original patchset first that has already been acked and
picked up by Eric before and whose functionality is preserved in this
syscall:

"Each process have different pids, one for each pid namespace it belongs.
 When interaction happens within single pid-ns translation isn't required.
 More complicated scenarios needs special handling.

 For example:
 - reading pid-files or logs written inside container with pid namespace
 - writing logs with internal pids outside container for pushing them into
 - attaching with ptrace to tasks from different pid namespace

 Generally speaking, any cross pid-ns API with pids needs translation.

 Currently there are several interfaces that could be used here:

 Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.

 Pids for nested pid namespaces are shown in file /proc/[pid]/status.
 In some cases pid translation could be easily done using this information.
 Backward translation requires scanning all tasks and becomes really
 complicated for deeper namespace nesting.

 Unix socket automatically translates pid attached to SCM_CREDENTIALS.
 This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
 into pid namespace, this expose process and could be insecure."

The original patchset allowed two distinct operations implicitly:
- discovering whether pid namespaces (pidns) have a parent-child
  relationship
- translating a pid from a source pidns into a target pidns

Both tasks are accomplished in the original patchset by passing a pid
along. If the pid argument is passed as 1 the relationship between two pid
namespaces can be discovered.
The syscall will gain a lot clearer syntax and will be easier to use for
userspace if the task it is asked to perform is passed through a
command argument. Additionally, it allows us to remove an intrinsic race
caused by using the pid argument as a way to discover the relationship
between pid namespaces.
This patch introduces three commands:

/* PIDCMD_QUERY_PID */
PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
Given a source pid namespace fd return the pid of the process in the target
namespace:
1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve callers pidns
  - return pid in callers pidns

2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
  - retrieve callers pidns
  - retrieve struct pid identifed by pid in callers pidns
  - retrieve pidns identified by target_fd
  - return pid in pidns identified by target_fd

3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by init task in pidns identified by source_fd
  - retrieve callers pidns
  - return pid of init task of pidns identified by source_fd in callers pidns

4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve pidns identified by target_fd
  - check whether struct pid can be found in pidns identified by target_fd
  - return pid in pidns identified by target_fd

/* PIDCMD_QUERY_PIDNS */
PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
namespaces.
In the original version of the pachset passing pid as 1 would allow to
deterimine the relationship between the pid namespaces. This is inherhently
racy. If pid 1 inside a pid namespace has died it would report false
negatives. For example, if pid 1 inside of the target pid namespace already
died, it would report that the target pid namespace cannot be reached from
the source pid namespace because it couldn't find the pid inside of the
target pid namespace and thus falsely report to the user that the two pid
namespaces are not related. This problem is simple to avoid. In the new
version we simply walk the list of ancestors and check whether the
namespace are related to each other. By doing it this way we can reliably
report what the relationship between two pid namespace file descriptors
looks like.

1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
   - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other

2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
   - pidns_of(ns_fd1) == pidns_of(ns_fd2)

3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
   - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)

4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
   - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)

These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
improve the functionality expressed implicitly in translate_pid() before.

/* PIDCMD_GET_PIDFD */
This command allows to retrieve file descriptors for processes and removes
the dependency of pidfds and thereby the pidfd_send_signal() syscall on
procfs. First, multiple people have expressed a desire to do this even when
pidfd_send_signal() was merged. It is even recorded in the commit message
for pidfd_send_signal() itself
(cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
Q-06: (Andrew Morton [1])
      Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
      so this is something that we will need to support anyway. Hence,
      there's no immediate need to add another syscalls just to make
      pidfd_send_signal() not dependent on the presence of procfs. However,
      adding a syscalls to get such file descriptors is planned for a
      future patchset (cf. [1]).
Alexey made a similar request (cf. [2]).
Additionally, Andy made an additional, idependent argument that we should
go forward with non-proc-dirfd file descriptors for the sake of security
and extensibility (cf. [3]).

The pidfds are not associated with a specific pid namespaces but rather
only with struct pid. What the pidctl() syscall enforces is that when a
caller wants to retrieve a pidfd file descriptor for a pid in a given
target pid namespace the caller 
- must have been given access to two file descriptors referring
  to target and source pid namespace
- the source pid namespace must be an ancestor of the target pid
  namespace
- the pid must be translatable from the source pid namespace into the
  target pid namespace

1. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve callers pidns
  - return pidfd

2. pidctl(PIDCMD_GET_PIDFD, pid, -1, target_fd, 0)
  - retrieve callers pidns
  - retrieve struct pid identifed by pid in callers pidns
  - retrieve pidns identified by target_fd
  - return pidfd

3. pidctl(PIDCMD_GET_PIDFD, 1, source_fd, -1, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by init task in pidns identified by
    source_fd
  - retrieve callers pidns
  - return pidfd

4. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, target_fd, 0)
  - retrieve pidns identified by source_fd
  - retrieve struct pid identifed by pid in pidns identified by source_fd
  - retrieve pidns identified by target_fd
  - check whether struct pid can be found in pidns identified by target_fd
  - return pidfd

These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
default and can be used with the pidfd_send_signal() syscall. They are not
dirfds and as such have the advantage that we can make them pollable or
readable in the future if we see a need to do so (which I'm pretty sure we
will eventually). Currently they do not support any advanced operations.

/* References */
[1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
[3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/
[4]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1:
- Jann Horn <jannh@google.com> in [1]:
  - always increment refcount for pidns in get_pid_ns_by_fd()
  - use PTR_ERR() after IS_ERR() not before
  - take reference on struct pid to avoid dangling pointer after rcu
    read-side critical section
  - simplify logic to put_pid_ns() when exiting functions
  - make pidfds O_CLOEXEC by default
  - remove useless no_llseek assignment in pidfd_fops
- Christian Brauner <christian@brauner.io>
  - return -ESRCH if process cannot be found in source pidns and -ENOENT if
    it cannot be found in target pidns for both PIDCMD_QUERY_PID and
    PIDCMD_QUERY_PIDNS to unify semantics

[changelog-1]: https://lore.kernel.org/lkml/CAG48ez24xPM4f7ty=8k5p6_gVWd-bEUSxca5Ngy5svHgarhz+w@mail.gmail.com
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/pid.h                    |   2 +
 include/linux/pid_namespace.h          |   8 ++
 include/linux/syscalls.h               |   2 +
 include/uapi/linux/wait.h              |  14 +++
 kernel/pid.c                           | 161 +++++++++++++++++++++++++
 kernel/pid_namespace.c                 |  25 ++++
 8 files changed, 214 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1f9607ed087c..afc5a8997140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -433,3 +433,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	pidctl			sys_pidctl			__ia32_sys_pidctl
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4378d4..8ef25d9a003d 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	pidctl			__x64_sys_pidctl
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..c5aae8151287 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -64,6 +64,8 @@ extern struct pid_namespace *copy_pid_ns(unsigned long flags,
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
+extern int pidnscmp(struct pid_namespace *ancestor,
+		    struct pid_namespace *descendant);
 
 #else /* !CONFIG_PID_NS */
 #include <linux/err.h>
@@ -94,6 +96,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
 	return 0;
 }
+
+static inline int pidnscmp(struct pid_namespace *ancestor,
+			   struct pid_namespace *descendant)
+{
+	return 0;
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806a561f..a4c8c59f7c8f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 				struct old_timex32 __user *tx);
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidctl(unsigned int cmd, pid_t pid, int source, int target,
+			   unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
index ac49a220cf2a..e9564ec06b07 100644
--- a/include/uapi/linux/wait.h
+++ b/include/uapi/linux/wait.h
@@ -18,5 +18,19 @@
 #define P_PID		1
 #define P_PGID		2
 
+/* Commands to pass to pidctl() */
+enum pidcmd {
+	PIDCMD_QUERY_PID   = 0, /* Get pid in target pid namespace */
+	PIDCMD_QUERY_PIDNS = 1, /* Determine relationship between pid namespaces */
+	PIDCMD_GET_PIDFD   = 2, /* Get pidfd for a process */
+};
+
+/* Return values of PIDCMD_QUERY_PIDNS */
+enum pidcmd_query_pidns {
+	PIDNS_UNRELATED          = 0, /* The pid namespaces are unrelated */
+	PIDNS_EQUAL              = 1, /* The pid namespaces are equal */
+	PIDNS_SOURCE_IS_ANCESTOR = 2, /* Source pid namespace is ancestor of target pid namespace */
+	PIDNS_TARGET_IS_ANCESTOR = 3, /* Target pid namespace is ancestor of source pid namespace */
+};
 
 #endif /* _UAPI_LINUX_WAIT_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..3213a137a63e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -26,6 +26,7 @@
  *
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/mm.h>
 #include <linux/export.h>
 #include <linux/slab.h>
@@ -40,6 +41,7 @@
 #include <linux/proc_fs.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/wait.h>
 
 struct pid init_struct_pid = {
 	.count 		= ATOMIC_INIT(1),
@@ -451,6 +453,165 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
 	return idr_get_next(&ns->idr, &nr);
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	if (pid) {
+		file->private_data = NULL;
+		put_pid(pid);
+	}
+
+	return 0;
+}
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+};
+
+static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
+{
+	int fd;
+
+	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), O_RDWR | o_flags);
+	if (fd < 0)
+		put_pid(pid);
+
+	return fd;
+}
+
+static struct pid_namespace *get_pid_ns_by_fd(int fd)
+{
+	struct pid_namespace *pidns = ERR_PTR(-EINVAL);
+
+	if (fd >= 0) {
+#ifdef CONFIG_PID_NS
+		struct ns_common *ns;
+		struct file *file = proc_ns_fget(fd);
+		if (IS_ERR(file))
+			return ERR_CAST(file);
+
+		ns = get_proc_ns(file_inode(file));
+		if (ns->ops->type == CLONE_NEWPID) {
+			pidns = container_of(ns, struct pid_namespace, ns);
+			get_pid_ns(pidns);
+		}
+
+		fput(file);
+#endif
+	} else {
+		pidns = task_active_pid_ns(current);
+		get_pid_ns(pidns);
+	}
+
+	return pidns;
+}
+
+static int pidns_related(struct pid_namespace *source,
+			 struct pid_namespace *target)
+{
+	int query;
+
+	query = pidnscmp(source, target);
+	switch (query) {
+	case 0:
+		return PIDNS_EQUAL;
+	case 1:
+		return PIDNS_SOURCE_IS_ANCESTOR;
+	}
+
+	query = pidnscmp(target, source);
+	if (query == 1)
+		return PIDNS_TARGET_IS_ANCESTOR;
+
+	return PIDNS_UNRELATED;
+}
+
+/*
+ * pidctl - perform operations on pids
+ * @cmd:    command to execute
+ * @pid:    pid for translation
+ * @source: pid-ns file descriptor or -1 for active namespace
+ * @target: pid-ns file descriptor or -1 for active namesapce
+ * @flags:  flags to pass
+ *
+ * If cmd is PIDCMD_QUERY_PID translates pid between pid namespaces
+ * identified by @source and @target. Returns pid if process has pid in
+ * @target, -ESRCH if process does not have a pid in @source, -ENOENT if
+ * process has no pid in @target.
+ *
+ * If cmd is PIDCMD_QUERY_PIDNS determines the relations between two pid
+ * namespaces. Returns 2 if @source is an ancestor pid namespace
+ * of @target, 1 if @source and @target refer to the same pid namespace,
+ * 3 if @target is an ancestor pid namespace of @source, 0 if they have
+ * no parent-child relationship in either direction.
+ *
+ * If cmd is PIDCMD_GET_PIDFD returns pidfd for process in @target pid
+ * namespace. Returns pidfd if process has pid in @target, -ESRCH if
+ * process does not have a pid in @source, -ENOENT if process does not
+ * have a pid in @target pid namespace.
+ *
+ */
+SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
+		unsigned int, flags)
+{
+	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
+	struct pid *struct_pid;
+	pid_t result;
+
+	if (flags)
+		return -EINVAL;
+
+	switch (cmd) {
+	case PIDCMD_QUERY_PID:
+		break;
+	case PIDCMD_QUERY_PIDNS:
+		if (pid)
+			return -EINVAL;
+		break;
+	case PIDCMD_GET_PIDFD:
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	source_ns = get_pid_ns_by_fd(source);
+	if (IS_ERR(source_ns))
+		return PTR_ERR(source_ns);
+
+	target_ns = get_pid_ns_by_fd(target);
+	if (IS_ERR(target_ns)) {
+		put_pid_ns(source_ns);
+		return PTR_ERR(target_ns);
+	}
+
+	if (cmd == PIDCMD_QUERY_PIDNS) {
+		result = pidns_related(source_ns, target_ns);
+	} else {
+		rcu_read_lock();
+		struct_pid = get_pid(find_pid_ns(pid, source_ns));
+		rcu_read_unlock();
+
+		if (struct_pid)
+			result = pid_nr_ns(struct_pid, target_ns);
+		else
+			result = -ESRCH;
+
+		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
+			result = pidfd_create_fd(struct_pid, O_CLOEXEC);
+
+		if (!result)
+			result = -ENOENT;
+
+		put_pid(struct_pid);
+	}
+
+	put_pid_ns(target_ns);
+	put_pid_ns(source_ns);
+
+	return result;
+}
+
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index aa6e72fb7c08..1c863fb3d55a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
 	return &get_pid_ns(pid_ns)->ns;
 }
 
+/**
+ * pidnscmp - Determine if @ancestor is ancestor of @descendant
+ * @ancestor:   pidns suspected to be the ancestor of @descendant
+ * @descendant: pidns suspected to be the descendant of @ancestor
+ *
+ * Returns -1 if @ancestor is not an ancestor of @descendant,
+ * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
+ * is an ancestor of @descendant.
+ */
+int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
+{
+	if (ancestor == descendant)
+		return 0;
+
+	for (;;) {
+		if (!descendant)
+			return -1;
+		if (descendant == ancestor)
+			break;
+		descendant = descendant->parent;
+	}
+
+	return 1;
+}
+
 static struct user_namespace *pidns_owner(struct ns_common *ns)
 {
 	return to_pid_ns(ns)->user_ns;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 3/4] signal: support pidctl() with pidfd_send_signal()
  2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
@ 2019-03-26 15:55 ` Christian Brauner
  2019-03-26 15:55 ` [PATCH v1 4/4] tests: add pidctl() tests Christian Brauner
  3 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

Let pidfd_send_signal() use pidfds retrieved via pidctl(). With this patch
pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1:
- Jann Horn <jannh@google.com> in [1]:
  - make access_pidfd_pidns() more readable
---
 kernel/signal.c | 29 ++++++++++++-----------------
 kernel/sys_ni.c |  3 ---
 2 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..7bdeda8333c8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,26 +3513,14 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 	return kill_something_info(sig, &info, pid);
 }
 
-#ifdef CONFIG_PROC_FS
 /*
  * Verify that the signaler and signalee either are in the same pid namespace
  * or that the signaler's pid namespace is an ancestor of the signalee's pid
  * namespace.
  */
-static bool access_pidfd_pidns(struct pid *pid)
+static inline bool access_pidfd_pidns(struct pid *pid)
 {
-	struct pid_namespace *active = task_active_pid_ns(current);
-	struct pid_namespace *p = ns_of_pid(pid);
-
-	for (;;) {
-		if (!p)
-			return false;
-		if (p == active)
-			break;
-		p = p->parent;
-	}
-
-	return true;
+	return pidnscmp(task_active_pid_ns(current), ns_of_pid(pid)) >= 0;
 }
 
 static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
@@ -3550,6 +3538,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 	return copy_siginfo_from_user(kinfo, info);
 }
 
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return tgid_pidfd_to_pid(file);
+}
+
 /**
  * sys_pidfd_send_signal - send a signal to a process through a task file
  *                          descriptor
@@ -3581,12 +3577,12 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	if (flags)
 		return -EINVAL;
 
-	f = fdget_raw(pidfd);
+	f = fdget(pidfd);
 	if (!f.file)
 		return -EBADF;
 
 	/* Is this a pidfd? */
-	pid = tgid_pidfd_to_pid(f.file);
+	pid = pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
 		goto err;
@@ -3625,7 +3621,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	fdput(f);
 	return ret;
 }
-#endif /* CONFIG_PROC_FS */
 
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
 
 /* kernel/sched/core.c */
 
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
 COND_SYSCALL(setgid);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 4/4] tests: add pidctl() tests
  2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
                   ` (2 preceding siblings ...)
  2019-03-26 15:55 ` [PATCH v1 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
@ 2019-03-26 15:55 ` Christian Brauner
  3 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 15:55 UTC (permalink / raw)
  To: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel
  Cc: arnd, keescook, adobriyan, tglx, mtk.manpages, bl0pbl33p, ldv,
	akpm, oleg, nagarathnam.muthusamy, cyphar, viro, joel, dancol,
	Christian Brauner

This adds test cases for all three subcommands and verifies that they
succeed and fail as expected.
Additionally, the tests verify that pidctl() pidfds are correctly useable
with pidfd_send_signal().

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1:
- Christian Brauner <christian@brauner.io>:
  - adapt to changing pidfds to CLOEXEC by default
---
 tools/testing/selftests/pidfd/Makefile      |   2 +-
 tools/testing/selftests/pidfd/pidctl_test.c | 537 ++++++++++++++++++++
 2 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/pidfd/pidctl_test.c

diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..29dfa29b3afa 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
 CFLAGS += -g -I../../../../usr/include/
 
-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidctl_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidctl_test.c b/tools/testing/selftests/pidfd/pidctl_test.c
new file mode 100644
index 000000000000..a39d3cd81089
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidctl_test.c
@@ -0,0 +1,537 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static int parent_pidns_fd = -1;
+static pid_t parent_pidns_pid = 0;
+
+static int child_pidns_fd = -1;
+static pid_t child_pidns_pid = 0;
+
+static int cousin_pidns_fd = -1;
+static pid_t cousin_pidns_pid = 0;
+
+static bool pidns_supported = false;
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+					unsigned int flags)
+{
+	return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static inline int sys_pidctl(unsigned int cmd, pid_t pid, int source,
+			     int target, unsigned int flags)
+{
+	return syscall(__NR_pidctl, cmd, pid, source, target, flags);
+}
+
+struct cr_clone_arg {
+	char stack[128] __attribute__((aligned(16)));
+	char stack_ptr[0];
+};
+
+static int child_pidns_creator(void *args)
+{
+	(void)prctl(PR_SET_PDEATHSIG, SIGKILL);
+	while (1)
+		sleep(5);
+
+	exit(0);
+}
+
+static int prepare_pid_namespaces(void)
+{
+	char path[512];
+	struct cr_clone_arg ca;
+	pid_t pid;
+
+	parent_pidns_fd = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd < 0) {
+		ksft_print_msg("failed to open current pid namespace");
+		return -1;
+	}
+	parent_pidns_pid = getpid();
+
+	pid = clone(child_pidns_creator, ca.stack_ptr, CLONE_NEWPID | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		ksft_print_msg("failed to clone child-pidns process in new pid namespace");
+		return -1;
+	}
+
+	snprintf(path, sizeof(path), "/proc/%d/ns/pid", pid);
+
+	child_pidns_fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (child_pidns_fd < 0) {
+		ksft_print_msg("failed to open pid namespace");
+		return -1;
+	}
+	child_pidns_pid = pid;
+
+	pid = clone(child_pidns_creator, ca.stack_ptr, CLONE_NEWPID | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		ksft_print_msg("failed to clone cousin-pidns process in new pid namespace");
+		return -1;
+	}
+
+	snprintf(path, sizeof(path), "/proc/%d/ns/pid", pid);
+
+	cousin_pidns_fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (cousin_pidns_fd < 0) {
+		ksft_print_msg("failed to open cousin pid namespace");
+		return -1;
+	}
+	cousin_pidns_pid = pid;
+
+	return 0;
+}
+
+static int test_pidcmd_query_pid(void)
+{
+	const char *test_name = "pidctl PIDCMD_QUERY_PID";
+	pid_t pid, self;
+	int parent_pidns_fd2;
+
+	self = getpid();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, -1, 1);
+	if (pid >= 0) {
+		ksft_print_msg("%s test %d: managed to pass invalid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, -1, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open current pid namespace\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd,
+			 parent_pidns_fd2, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, -1, parent_pidns_fd2, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd, -1, 0);
+	if (!pid || (pid != self)) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	close(parent_pidns_fd2);
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, parent_pidns_fd,
+			 child_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, self, child_pidns_fd,
+			 parent_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), self, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, child_pidns_pid, parent_pidns_fd,
+			 child_pidns_fd, 0);
+	if (pid != 1) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), child_pidns_pid, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, 1, child_pidns_fd, parent_pidns_fd,
+			 0);
+	if (pid != child_pidns_pid) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), 1, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, 1, child_pidns_fd, cousin_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), 1, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pid = sys_pidctl(PIDCMD_QUERY_PID, cousin_pidns_pid, child_pidns_fd,
+			 cousin_pidns_fd, 0);
+	if (pid >= 0 || ((pid < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: argument pid %d, translated pid %d\n",
+			       test_name, ksft_test_num(), cousin_pidns_pid, pid);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static int test_pidcmd_query_pidns(void)
+{
+	const char *test_name = "pidctl PIDCMD_QUERY_PIDNS";
+	int parent_pidns_fd2;
+	int query;
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, -1, -1, 1);
+	if (query >= 0) {
+		ksft_print_msg("%s test %d: managed to pass invalid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 1234, -1, -1, 0);
+	if (query >= 0)
+		ksft_print_msg("%s test %d: managed to pass invalid pid argument\n",
+			test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open second pid namespace file descriptor\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, parent_pidns_fd,
+			   parent_pidns_fd2, 0);
+	close(parent_pidns_fd2);
+	if (query != PIDNS_EQUAL) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespaces are identical %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, parent_pidns_fd,
+			   child_pidns_fd, 0);
+	if (query != PIDNS_SOURCE_IS_ANCESTOR) {
+		ksft_print_msg("%s test %d: failed to detect that source pid namespace is ancestor of target pid namespace %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   parent_pidns_fd, 0);
+	if (query != PIDNS_TARGET_IS_ANCESTOR) {
+		ksft_print_msg("%s test %d: failed to detect that target pid namespace is ancestor of source pid namespace %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   cousin_pidns_fd, 0);
+	if (query != PIDNS_UNRELATED) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespace are not related %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	query = sys_pidctl(PIDCMD_QUERY_PIDNS, 0, child_pidns_fd,
+			   cousin_pidns_fd, 0);
+	if (query != PIDNS_UNRELATED) {
+		ksft_print_msg("%s test %d: failed to detect that pid namespace are not related %d\n",
+			       test_name, ksft_test_num(), query);
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static int test_pidcmd_get_pidfd(void)
+{
+	const char *test_name = "pidctl PIDCMD_GET_PIDFD";
+	pid_t self;
+	int pidfd, parent_pidns_fd2;
+
+	self = getpid();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, -1, -1, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to pass valid flag\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	if (!pidns_supported)
+		goto out;
+
+	parent_pidns_fd2 = open("/proc/self/ns/pid", O_RDONLY | O_CLOEXEC);
+	if (parent_pidns_fd2 < 0) {
+		ksft_print_msg("%s test %d: Failed to open current pid namespace\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd,
+			   parent_pidns_fd2, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, -1, parent_pidns_fd2, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd, -1, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		close(parent_pidns_fd2);
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	close(parent_pidns_fd2);
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, parent_pidns_fd,
+			   child_pidns_fd, 0);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, self, child_pidns_fd,
+			   parent_pidns_fd, 0);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, 1, child_pidns_fd, parent_pidns_fd, 0);
+	if (pidfd < 0) {
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+		return -1;
+	}
+	close(pidfd);
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, 1, child_pidns_fd, cousin_pidns_fd, 0);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ENOENT))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+	pidfd = sys_pidctl(PIDCMD_GET_PIDFD, cousin_pidns_pid, child_pidns_fd,
+			   cousin_pidns_fd, 0);
+	if (pidfd >= 0 || ((pidfd < 0) && (errno != ESRCH))) {
+		ksft_print_msg("%s test %d: succeeded to retrieve pidfd but should've failed %s\n",
+			       test_name, ksft_test_num(), strerror(errno));
+		return -1;
+	}
+	ksft_inc_pass_cnt();
+
+out:
+	ksft_test_result_pass("%s test: passed\n", test_name);
+	return 0;
+}
+
+static void test_pidctl_pidfd_send_signal(void)
+{
+	const char *test_name = "pidctl with pidfd_send_signal";
+	int child_pidfd, cousin_pidfd, ret;
+
+	child_pidfd = sys_pidctl(PIDCMD_GET_PIDFD, child_pidns_pid, -1, -1, 0);
+	if (child_pidfd < 0)
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	ret = sys_pidfd_send_signal(child_pidfd, SIGKILL, NULL, 0);
+	if (ret < 0) {
+		kill(child_pidns_pid, SIGKILL);
+		ksft_print_msg("%s test %d: failed to send signal via pidfd\n",
+			       test_name, ksft_test_num());
+	}
+	ksft_inc_pass_cnt();
+
+	cousin_pidfd = sys_pidctl(PIDCMD_GET_PIDFD, cousin_pidns_pid, -1, -1, 0);
+	if (cousin_pidfd < 0)
+		ksft_print_msg("%s test %d: failed to retrieve pidfd\n",
+			       test_name, ksft_test_num());
+	ksft_inc_pass_cnt();
+
+	ret = sys_pidfd_send_signal(cousin_pidfd, SIGKILL, NULL, 0);
+	if (ret < 0) {
+		kill(cousin_pidfd, SIGKILL);
+		ksft_print_msg("%s test %d: failed to send signal via pidfd\n",
+			       test_name, ksft_test_num());
+	}
+	ksft_inc_pass_cnt();
+
+	ksft_test_result_pass("%s test: passed\n", test_name);
+}
+
+int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (ret != pid)
+		goto again;
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
+		return -1;
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	int ret;
+	pid_t pid;
+
+	pid = fork();
+	if (pid < 0)
+		ksft_exit_fail_msg("Failed to create new process\n");
+
+	if (pid == 0) {
+		if (unshare(CLONE_NEWPID) < 0)
+			exit(EXIT_FAILURE);
+
+		exit(EXIT_SUCCESS);
+	}
+
+	if (!wait_for_pid(pid))
+		pidns_supported = true;
+
+	ksft_print_header();
+
+	if (pidns_supported)
+		prepare_pid_namespaces();
+	else
+		ksft_print_msg(
+			"kernel does not support pid namespaces: skipping pid namespace parts of testsuite");
+
+	ret = test_pidcmd_query_pid();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_QUERY_PID tests failed");
+		goto on_error;
+	}
+
+	ret = test_pidcmd_query_pidns();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_QUERY_PIDNS tests failed");
+		goto on_error;
+	}
+
+	ret = test_pidcmd_get_pidfd();
+	if (ret < 0) {
+		ksft_print_msg("PIDCMD_GET_PIDFD tests failed");
+		goto on_error;
+	}
+
+	ret = 0;
+
+on_error:
+	if (pidns_supported)
+		test_pidctl_pidfd_send_signal();
+
+	if (parent_pidns_fd >= 0)
+		close(parent_pidns_fd);
+
+	if (child_pidns_fd >= 0)
+		close(child_pidns_fd);
+
+	if (cousin_pidns_fd >= 0)
+		close(cousin_pidns_fd);
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
@ 2019-03-26 16:17   ` Daniel Colascione
  2019-03-26 16:23     ` Christian Brauner
  2019-03-26 17:06   ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Colascione @ 2019-03-26 16:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

Thanks for the patch.

On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
>
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:

We still haven't had a much-needed conversation about splitting this
system call into smaller logical operations. It's important that we
address this point before this patch is merged and becomes permanent
kernel ABI.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:17   ` Daniel Colascione
@ 2019-03-26 16:23     ` Christian Brauner
  2019-03-26 16:31       ` Christian Brauner
  2019-03-26 16:33       ` Daniel Colascione
  0 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 16:23 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> Thanks for the patch.
> 
> On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> 
> We still haven't had a much-needed conversation about splitting this
> system call into smaller logical operations. It's important that we
> address this point before this patch is merged and becomes permanent
> kernel ABI.

I don't particularly mind splitting this into an additional syscall like
e.g.  pidfd_open() but then we have - and yes, I know you'll say
syscalls are cheap - translate_pid(), and pidfd_open(). What I like
about this rn is that it connects both apis in a single syscall
and allows pidfd retrieval across pid namespaces. So I guess we'll see
what other people think.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:23     ` Christian Brauner
@ 2019-03-26 16:31       ` Christian Brauner
  2019-03-26 16:34         ` Christian Brauner
  2019-03-26 16:33       ` Daniel Colascione
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 16:31 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > Thanks for the patch.
> > 
> > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > I quote Konstantins original patchset first that has already been acked and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> > 
> > We still haven't had a much-needed conversation about splitting this
> > system call into smaller logical operations. It's important that we
> > address this point before this patch is merged and becomes permanent
> > kernel ABI.
> 
> I don't particularly mind splitting this into an additional syscall like
> e.g.  pidfd_open() but then we have - and yes, I know you'll say
> syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> about this rn is that it connects both apis in a single syscall
> and allows pidfd retrieval across pid namespaces. So I guess we'll see
> what other people think.

There's something to be said for

pidfd_open(pid_t pid, int pidfd, unsigned int flags);

/* get pidfd */
int pidfd = pidfd_open(1234, -1, 0);

/* convert to procfd */
int procfd = pidfd_open(-1, 4, 0);

/* convert to pidfd */
int pidfd = pidfd_open(4, -1, 0);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:23     ` Christian Brauner
  2019-03-26 16:31       ` Christian Brauner
@ 2019-03-26 16:33       ` Daniel Colascione
  1 sibling, 0 replies; 29+ messages in thread
From: Daniel Colascione @ 2019-03-26 16:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 9:23 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > Thanks for the patch.
> >
> > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > I quote Konstantins original patchset first that has already been acked and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> >
> > We still haven't had a much-needed conversation about splitting this
> > system call into smaller logical operations. It's important that we
> > address this point before this patch is merged and becomes permanent
> > kernel ABI.
>
> I don't particularly mind splitting this into an additional syscall like
> e.g.  pidfd_open() but then we have - and yes, I know you'll say
> syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> about this rn is that it connects both apis in a single syscall
> and allows pidfd retrieval across pid namespaces. So I guess we'll see
> what other people think.

Thanks. I also appreciate a clean unification of related
functionality, but I'm concerned that this API in particular --- due
in part to its *ctl() name --- will become a catch-all facility for
doing *anything* with processes. (Granted, heavy use of a new, good,
and clean API would be a good problem to have.) This
single-system-call state of affairs would make it more awkward than
necessary to do system-call level logging (say, strace -e), enable or
disable tracing of specific operations with ftrace, apply some kinds
of SELinux policy, and so on, and the only advantage of the single
system call design that I can see right now is the logical
cleanliness.

I'd propose splitting the call, or if we can't do that, renaming it to
something else --- pidfd_query --- so that it's less likely to become
a catch-all operation holder.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:31       ` Christian Brauner
@ 2019-03-26 16:34         ` Christian Brauner
  2019-03-26 16:38           ` Daniel Colascione
  2019-03-26 16:42           ` Andy Lutomirski
  0 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 16:34 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > Thanks for the patch.
> > > 
> > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > I quote Konstantins original patchset first that has already been acked and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall:
> > > 
> > > We still haven't had a much-needed conversation about splitting this
> > > system call into smaller logical operations. It's important that we
> > > address this point before this patch is merged and becomes permanent
> > > kernel ABI.
> > 
> > I don't particularly mind splitting this into an additional syscall like
> > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > about this rn is that it connects both apis in a single syscall
> > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > what other people think.
> 
> There's something to be said for
> 
> pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> 
> /* get pidfd */
> int pidfd = pidfd_open(1234, -1, 0);
> 
> /* convert to procfd */
> int procfd = pidfd_open(-1, 4, 0);
> 
> /* convert to pidfd */
> int pidfd = pidfd_open(4, -1, 0);

probably rather:

int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
int pidfd = pidfd_open(1234, -1, 0);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:34         ` Christian Brauner
@ 2019-03-26 16:38           ` Daniel Colascione
  2019-03-26 16:43             ` Christian Brauner
  2019-03-26 16:42           ` Andy Lutomirski
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Colascione @ 2019-03-26 16:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > Thanks for the patch.
> > > >
> > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > >
> > > > We still haven't had a much-needed conversation about splitting this
> > > > system call into smaller logical operations. It's important that we
> > > > address this point before this patch is merged and becomes permanent
> > > > kernel ABI.
> > >
> > > I don't particularly mind splitting this into an additional syscall like
> > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > about this rn is that it connects both apis in a single syscall
> > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > what other people think.
> >
> > There's something to be said for
> >
> > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> >
> > /* get pidfd */
> > int pidfd = pidfd_open(1234, -1, 0);
> >
> > /* convert to procfd */
> > int procfd = pidfd_open(-1, 4, 0);
> >
> > /* convert to pidfd */
> > int pidfd = pidfd_open(4, -1, 0);
>
> probably rather:
>
> int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> int pidfd = pidfd_open(1234, -1, 0);

These three operations look like three related but distinct functions
to me, and in the second case, the "pidfd_open" name is a bit of a
misnomer. IMHO, the presence of an "operation name" field in any API
is usually a good indication that we're looking at a family of related
APIs, not a single coherent operation.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:34         ` Christian Brauner
  2019-03-26 16:38           ` Daniel Colascione
@ 2019-03-26 16:42           ` Andy Lutomirski
  2019-03-26 16:46             ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Andy Lutomirski @ 2019-03-26 16:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Daniel Colascione, Jann Horn, Konstantin Khlebnikov,
	Andy Lutomirski, David Howells, Serge E. Hallyn,
	Eric W. Biederman, Linux API, linux-kernel, Arnd Bergmann,
	Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > Thanks for the patch.
> > > >
> > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > >
> > > > We still haven't had a much-needed conversation about splitting this
> > > > system call into smaller logical operations. It's important that we
> > > > address this point before this patch is merged and becomes permanent
> > > > kernel ABI.
> > >
> > > I don't particularly mind splitting this into an additional syscall like
> > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > about this rn is that it connects both apis in a single syscall
> > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > what other people think.
> >
> > There's something to be said for
> >
> > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> >
> > /* get pidfd */
> > int pidfd = pidfd_open(1234, -1, 0);
> >
> > /* convert to procfd */
> > int procfd = pidfd_open(-1, 4, 0);
> >
> > /* convert to pidfd */
> > int pidfd = pidfd_open(4, -1, 0);
>
> probably rather:
>
> int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);

Do you mean:

int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);

or do you have some other solution in mind to avoid the security problem?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:38           ` Daniel Colascione
@ 2019-03-26 16:43             ` Christian Brauner
  2019-03-26 16:50               ` Daniel Colascione
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 16:43 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > Thanks for the patch.
> > > > >
> > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > >
> > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > syscall:
> > > > >
> > > > > We still haven't had a much-needed conversation about splitting this
> > > > > system call into smaller logical operations. It's important that we
> > > > > address this point before this patch is merged and becomes permanent
> > > > > kernel ABI.
> > > >
> > > > I don't particularly mind splitting this into an additional syscall like
> > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > about this rn is that it connects both apis in a single syscall
> > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > what other people think.
> > >
> > > There's something to be said for
> > >
> > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > >
> > > /* get pidfd */
> > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > /* convert to procfd */
> > > int procfd = pidfd_open(-1, 4, 0);
> > >
> > > /* convert to pidfd */
> > > int pidfd = pidfd_open(4, -1, 0);
> >
> > probably rather:
> >
> > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > int pidfd = pidfd_open(1234, -1, 0);
> 
> These three operations look like three related but distinct functions
> to me, and in the second case, the "pidfd_open" name is a bit of a
> misnomer. IMHO, the presence of an "operation name" field in any API
> is usually a good indication that we're looking at a family of related
> APIs, not a single coherent operation.

So I'm happy to accommodate the need for a clean api even though I
disagree that what we have in pidctl() is unclean.
But I will not start sending a pile of syscalls. There is nothing
necessarily wrong to group related APIs together. By these standards the
new mount API would need to be like 30 different syscalls, same for
keyring management.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:42           ` Andy Lutomirski
@ 2019-03-26 16:46             ` Christian Brauner
  2019-03-26 17:00               ` Daniel Colascione
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 16:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Jann Horn, Konstantin Khlebnikov,
	David Howells, Serge E. Hallyn, Eric W. Biederman, Linux API,
	linux-kernel, Arnd Bergmann, Kees Cook, Alexey Dobriyan,
	Thomas Gleixner, Michael Kerrisk-manpages, Jonathan Kowalski,
	Dmitry V. Levin, Andrew Morton, Oleg Nesterov,
	Nagarathnam Muthusamy, Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 09:42:59AM -0700, Andy Lutomirski wrote:
> On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > Thanks for the patch.
> > > > >
> > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > >
> > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > syscall:
> > > > >
> > > > > We still haven't had a much-needed conversation about splitting this
> > > > > system call into smaller logical operations. It's important that we
> > > > > address this point before this patch is merged and becomes permanent
> > > > > kernel ABI.
> > > >
> > > > I don't particularly mind splitting this into an additional syscall like
> > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > about this rn is that it connects both apis in a single syscall
> > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > what other people think.
> > >
> > > There's something to be said for
> > >
> > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > >
> > > /* get pidfd */
> > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > /* convert to procfd */
> > > int procfd = pidfd_open(-1, 4, 0);
> > >
> > > /* convert to pidfd */
> > > int pidfd = pidfd_open(4, -1, 0);
> >
> > probably rather:
> >
> > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> 
> Do you mean:
> 
> int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);
> 
> or do you have some other solution in mind to avoid the security problem?

Yes, we need the proc root obviously. I just jotted this down.

We probably would need where one of the fds can refer to the proc root.

pidfd_open(pid_t, int fd, int fd, 0)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:43             ` Christian Brauner
@ 2019-03-26 16:50               ` Daniel Colascione
  2019-03-26 17:05                 ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Colascione @ 2019-03-26 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 9:44 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > > int pidfd = pidfd_open(1234, -1, 0);
> >
> > These three operations look like three related but distinct functions
> > to me, and in the second case, the "pidfd_open" name is a bit of a
> > misnomer. IMHO, the presence of an "operation name" field in any API
> > is usually a good indication that we're looking at a family of related
> > APIs, not a single coherent operation.
>
> So I'm happy to accommodate the need for a clean api even though I
> disagree that what we have in pidctl() is unclean.
> But I will not start sending a pile of syscalls. There is nothing
> necessarily wrong to group related APIs together.

In the email I sent just now, I identified several specific technical
disadvantages arising from unnecessary grouping of system calls. We
have historical evidence in the form of socketcall that this grouping
tends to be regrettable. I don't recall your identifying any
offsetting technical advantages. Did I miss something?

> By these standards the
> new mount API would need to be like 30 different syscalls, same for
> keyring management.

Can you please point out the problem that would arise from splitting
the mount and keyring APIs this way? One could have made the same
argument about grouping socket operations, and this socket-operation
grouping ended up being a mistake.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:46             ` Christian Brauner
@ 2019-03-26 17:00               ` Daniel Colascione
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Colascione @ 2019-03-26 17:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Jann Horn, Konstantin Khlebnikov, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 9:46 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 09:42:59AM -0700, Andy Lutomirski wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> >
> > Do you mean:
> >
> > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);
> >
> > or do you have some other solution in mind to avoid the security problem?
>
> Yes, we need the proc root obviously. I just jotted this down.
>
> We probably would need where one of the fds can refer to the proc root.
>
> pidfd_open(pid_t, int fd, int fd, 0)

Indeed. This is precisely the pidfd-procfd translation API I proposed
in the last paragraph of [1].

[1] https://lore.kernel.org/lkml/CAKOZuetCFgu0B53+mGmQ3+539MPT_tiu-PACx2ATvihHrrmUKg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 16:50               ` Daniel Colascione
@ 2019-03-26 17:05                 ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 17:05 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, linux-kernel,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Gleixner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Joel Fernandes

On Tue, Mar 26, 2019 at 09:50:28AM -0700, Daniel Colascione wrote:
> On Tue, Mar 26, 2019 at 9:44 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> > > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > > Thanks for the patch.
> > > > > > >
> > > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > >
> > > > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > > > syscall:
> > > > > > >
> > > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > > system call into smaller logical operations. It's important that we
> > > > > > > address this point before this patch is merged and becomes permanent
> > > > > > > kernel ABI.
> > > > > >
> > > > > > I don't particularly mind splitting this into an additional syscall like
> > > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > > about this rn is that it connects both apis in a single syscall
> > > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > > what other people think.
> > > > >
> > > > > There's something to be said for
> > > > >
> > > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > > >
> > > > > /* get pidfd */
> > > > > int pidfd = pidfd_open(1234, -1, 0);
> > > > >
> > > > > /* convert to procfd */
> > > > > int procfd = pidfd_open(-1, 4, 0);
> > > > >
> > > > > /* convert to pidfd */
> > > > > int pidfd = pidfd_open(4, -1, 0);
> > > >
> > > > probably rather:
> > > >
> > > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > > > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > These three operations look like three related but distinct functions
> > > to me, and in the second case, the "pidfd_open" name is a bit of a
> > > misnomer. IMHO, the presence of an "operation name" field in any API
> > > is usually a good indication that we're looking at a family of related
> > > APIs, not a single coherent operation.
> >
> > So I'm happy to accommodate the need for a clean api even though I
> > disagree that what we have in pidctl() is unclean.
> > But I will not start sending a pile of syscalls. There is nothing
> > necessarily wrong to group related APIs together.
> 
> In the email I sent just now, I identified several specific technical
> disadvantages arising from unnecessary grouping of system calls. We
> have historical evidence in the form of socketcall that this grouping
> tends to be regrettable. I don't recall your identifying any
> offsetting technical advantages. Did I miss something?
> 
> > By these standards the
> > new mount API would need to be like 30 different syscalls, same for
> > keyring management.
> 
> Can you please point out the problem that would arise from splitting
> the mount and keyring APIs this way? One could have made the same
> argument about grouping socket operations, and this socket-operation
> grouping ended up being a mistake.

The main reasons why I am not responding to such mails is that I don't
want long tangents about very generic issues. If you can find support
from people that prefer to split this into three separate syscalls:

pidfd_open()
pidfd_procfd()
procfd_pidfd()

I'm happy to do it this way. But it seems we can find a compromise, e.g.
by having

pidfd_open(pid_t pid, int fd, int fd, unsigned int flags)

and avoid that whole email waterfall.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
  2019-03-26 16:17   ` Daniel Colascione
@ 2019-03-26 17:06   ` Joel Fernandes
  2019-03-26 17:08     ` Christian Brauner
  2019-03-26 17:22     ` Christian Brauner
  1 sibling, 2 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 17:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:
> 
> "Each process have different pids, one for each pid namespace it belongs.
>  When interaction happens within single pid-ns translation isn't required.
>  More complicated scenarios needs special handling.
> 
>  For example:
>  - reading pid-files or logs written inside container with pid namespace
>  - writing logs with internal pids outside container for pushing them into
>  - attaching with ptrace to tasks from different pid namespace
> 
>  Generally speaking, any cross pid-ns API with pids needs translation.
> 
>  Currently there are several interfaces that could be used here:
> 
>  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> 
>  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
>  In some cases pid translation could be easily done using this information.
>  Backward translation requires scanning all tasks and becomes really
>  complicated for deeper namespace nesting.
> 
>  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>  into pid namespace, this expose process and could be insecure."
> 
> The original patchset allowed two distinct operations implicitly:
> - discovering whether pid namespaces (pidns) have a parent-child
>   relationship
> - translating a pid from a source pidns into a target pidns
> 
> Both tasks are accomplished in the original patchset by passing a pid
> along. If the pid argument is passed as 1 the relationship between two pid
> namespaces can be discovered.
> The syscall will gain a lot clearer syntax and will be easier to use for
> userspace if the task it is asked to perform is passed through a
> command argument. Additionally, it allows us to remove an intrinsic race
> caused by using the pid argument as a way to discover the relationship
> between pid namespaces.
> This patch introduces three commands:
> 
> /* PIDCMD_QUERY_PID */
> PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> Given a source pid namespace fd return the pid of the process in the target
> namespace:

Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
IMO (see below).

> 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve callers pidns
>   - return pid in callers pidns
> 
> 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
>   - retrieve callers pidns
>   - retrieve struct pid identifed by pid in callers pidns
>   - retrieve pidns identified by target_fd
>   - return pid in pidns identified by target_fd
> 
> 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by init task in pidns identified by source_fd
>   - retrieve callers pidns
>   - return pid of init task of pidns identified by source_fd in callers pidns
> 
> 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve pidns identified by target_fd
>   - check whether struct pid can be found in pidns identified by target_fd
>   - return pid in pidns identified by target_fd
> 
> /* PIDCMD_QUERY_PIDNS */
> PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> namespaces.
> In the original version of the pachset passing pid as 1 would allow to
> deterimine the relationship between the pid namespaces. This is inherhently
> racy. If pid 1 inside a pid namespace has died it would report false
> negatives. For example, if pid 1 inside of the target pid namespace already
> died, it would report that the target pid namespace cannot be reached from
> the source pid namespace because it couldn't find the pid inside of the
> target pid namespace and thus falsely report to the user that the two pid
> namespaces are not related. This problem is simple to avoid. In the new
> version we simply walk the list of ancestors and check whether the
> namespace are related to each other. By doing it this way we can reliably
> report what the relationship between two pid namespace file descriptors
> looks like.
> 
> 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
>    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> 
> 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
>    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> 
> 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
>    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> 
> 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
>    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)

Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.

Again QUERY is ambigious here. Above you called QUERY to translate something,
now you're calling QUERY to mean compare something. I suggest to be explicit
about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.

Arguably, 2 syscalls for this is cleaner:
pid_compare_namespaces(ns_fd1, ns_fd2);
pid_translate(pid, ns_fd1, nds_fd2);


> These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> improve the functionality expressed implicitly in translate_pid() before.
> 
> /* PIDCMD_GET_PIDFD */

And this can be a third syscall:
pidfd_translate(pid, ns_fd1, ns_fd2).

I am actually supportive of Daniel's view that by combining too many
arguments into a single syscall, becomes confusing and sometimes some
arguments have to be forced to 0 in the single shoe-horned syscall. Like you
don't need a pid to compare to pid-ns, so user has to set that to 0.

More comments below...

> This command allows to retrieve file descriptors for processes and removes
> the dependency of pidfds and thereby the pidfd_send_signal() syscall on
> procfs. First, multiple people have expressed a desire to do this even when
> pidfd_send_signal() was merged. It is even recorded in the commit message
> for pidfd_send_signal() itself
> (cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
> Q-06: (Andrew Morton [1])
>       Is there a cleaner way of obtaining the fd? Another syscall perhaps.
> A-06: Userspace can already trivially retrieve file descriptors from procfs
>       so this is something that we will need to support anyway. Hence,
>       there's no immediate need to add another syscalls just to make
>       pidfd_send_signal() not dependent on the presence of procfs. However,
>       adding a syscalls to get such file descriptors is planned for a
>       future patchset (cf. [1]).
> Alexey made a similar request (cf. [2]).
> Additionally, Andy made an additional, idependent argument that we should
> go forward with non-proc-dirfd file descriptors for the sake of security
> and extensibility (cf. [3]).
> 
> The pidfds are not associated with a specific pid namespaces but rather
> only with struct pid. What the pidctl() syscall enforces is that when a
> caller wants to retrieve a pidfd file descriptor for a pid in a given
> target pid namespace the caller 
> - must have been given access to two file descriptors referring
>   to target and source pid namespace
> - the source pid namespace must be an ancestor of the target pid
>   namespace
> - the pid must be translatable from the source pid namespace into the
>   target pid namespace
> 
> 1. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve callers pidns
>   - return pidfd
> 
> 2. pidctl(PIDCMD_GET_PIDFD, pid, -1, target_fd, 0)
>   - retrieve callers pidns
>   - retrieve struct pid identifed by pid in callers pidns
>   - retrieve pidns identified by target_fd
>   - return pidfd
> 
> 3. pidctl(PIDCMD_GET_PIDFD, 1, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by init task in pidns identified by
>     source_fd
>   - retrieve callers pidns
>   - return pidfd
> 
> 4. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, target_fd, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve pidns identified by target_fd
>   - check whether struct pid can be found in pidns identified by target_fd
>   - return pidfd
> 
> These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
> default and can be used with the pidfd_send_signal() syscall. They are not
> dirfds and as such have the advantage that we can make them pollable or
> readable in the future if we see a need to do so (which I'm pretty sure we
> will eventually). Currently they do not support any advanced operations.
> 
> /* References */
> [1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
> [2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
> [3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/
> [4]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
[snip]
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e446806a561f..a4c8c59f7c8f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
>  				struct old_timex32 __user *tx);
>  asmlinkage long sys_syncfs(int fd);
>  asmlinkage long sys_setns(int fd, int nstype);
> +asmlinkage long sys_pidctl(unsigned int cmd, pid_t pid, int source, int target,
> +			   unsigned int flags);
>  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
>  			     unsigned int vlen, unsigned flags);
>  asmlinkage long sys_process_vm_readv(pid_t pid,
> diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> index ac49a220cf2a..e9564ec06b07 100644
> --- a/include/uapi/linux/wait.h
> +++ b/include/uapi/linux/wait.h
> @@ -18,5 +18,19 @@
>  #define P_PID		1
>  #define P_PGID		2
>  
> +/* Commands to pass to pidctl() */
> +enum pidcmd {
> +	PIDCMD_QUERY_PID   = 0, /* Get pid in target pid namespace */
> +	PIDCMD_QUERY_PIDNS = 1, /* Determine relationship between pid namespaces */
> +	PIDCMD_GET_PIDFD   = 2, /* Get pidfd for a process */
> +};
> +
> +/* Return values of PIDCMD_QUERY_PIDNS */
> +enum pidcmd_query_pidns {
> +	PIDNS_UNRELATED          = 0, /* The pid namespaces are unrelated */
> +	PIDNS_EQUAL              = 1, /* The pid namespaces are equal */
> +	PIDNS_SOURCE_IS_ANCESTOR = 2, /* Source pid namespace is ancestor of target pid namespace */
> +	PIDNS_TARGET_IS_ANCESTOR = 3, /* Target pid namespace is ancestor of source pid namespace */
> +};
>  
>  #endif /* _UAPI_LINUX_WAIT_H */
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..3213a137a63e 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -26,6 +26,7 @@
>   *
>   */
>  
> +#include <linux/anon_inodes.h>
>  #include <linux/mm.h>
>  #include <linux/export.h>
>  #include <linux/slab.h>
> @@ -40,6 +41,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/sched/task.h>
>  #include <linux/idr.h>
> +#include <linux/wait.h>
>  
>  struct pid init_struct_pid = {
>  	.count 		= ATOMIC_INIT(1),
> @@ -451,6 +453,165 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>  	return idr_get_next(&ns->idr, &nr);
>  }
>  
> +static int pidfd_release(struct inode *inode, struct file *file)
> +{
> +	struct pid *pid = file->private_data;
> +
> +	if (pid) {
> +		file->private_data = NULL;
> +		put_pid(pid);
> +	}
> +
> +	return 0;
> +}
> +
> +const struct file_operations pidfd_fops = {
> +	.release = pidfd_release,
> +};
> +
> +static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
> +{
> +	int fd;
> +
> +	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), O_RDWR | o_flags);
> +	if (fd < 0)
> +		put_pid(pid);
> +
> +	return fd;
> +}
> +
> +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> +{
> +	struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> +
> +	if (fd >= 0) {
> +#ifdef CONFIG_PID_NS
> +		struct ns_common *ns;
> +		struct file *file = proc_ns_fget(fd);
> +		if (IS_ERR(file))
> +			return ERR_CAST(file);
> +
> +		ns = get_proc_ns(file_inode(file));
> +		if (ns->ops->type == CLONE_NEWPID) {
> +			pidns = container_of(ns, struct pid_namespace, ns);
> +			get_pid_ns(pidns);
> +		}
> +
> +		fput(file);
> +#endif
> +	} else {
> +		pidns = task_active_pid_ns(current);
> +		get_pid_ns(pidns);
> +	}
> +
> +	return pidns;
> +}
> +
> +static int pidns_related(struct pid_namespace *source,
> +			 struct pid_namespace *target)
> +{
> +	int query;
> +
> +	query = pidnscmp(source, target);
> +	switch (query) {
> +	case 0:
> +		return PIDNS_EQUAL;
> +	case 1:
> +		return PIDNS_SOURCE_IS_ANCESTOR;
> +	}
> +
> +	query = pidnscmp(target, source);
> +	if (query == 1)
> +		return PIDNS_TARGET_IS_ANCESTOR;
> +
> +	return PIDNS_UNRELATED;
> +}
> +
> +/*
> + * pidctl - perform operations on pids
> + * @cmd:    command to execute
> + * @pid:    pid for translation
> + * @source: pid-ns file descriptor or -1 for active namespace
> + * @target: pid-ns file descriptor or -1 for active namesapce
> + * @flags:  flags to pass
> + *
> + * If cmd is PIDCMD_QUERY_PID translates pid between pid namespaces
> + * identified by @source and @target. Returns pid if process has pid in
> + * @target, -ESRCH if process does not have a pid in @source, -ENOENT if
> + * process has no pid in @target.
> + *
> + * If cmd is PIDCMD_QUERY_PIDNS determines the relations between two pid
> + * namespaces. Returns 2 if @source is an ancestor pid namespace
> + * of @target, 1 if @source and @target refer to the same pid namespace,
> + * 3 if @target is an ancestor pid namespace of @source, 0 if they have
> + * no parent-child relationship in either direction.
> + *
> + * If cmd is PIDCMD_GET_PIDFD returns pidfd for process in @target pid
> + * namespace. Returns pidfd if process has pid in @target, -ESRCH if
> + * process does not have a pid in @source, -ENOENT if process does not
> + * have a pid in @target pid namespace.
> + *
> + */
> +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> +		unsigned int, flags)

flags seems not needed since it is unused, but I get it that you may want to
have flags in the future? If yes, give example of future flags?

> +{
> +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;

Setting these to NULL is no longer needed.

> +	struct pid *struct_pid;
> +	pid_t result;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	switch (cmd) {
> +	case PIDCMD_QUERY_PID:
> +		break;
> +	case PIDCMD_QUERY_PIDNS:
> +		if (pid)
> +			return -EINVAL;
> +		break;
> +	case PIDCMD_GET_PIDFD:
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	source_ns = get_pid_ns_by_fd(source);
> +	if (IS_ERR(source_ns))
> +		return PTR_ERR(source_ns);
> +
> +	target_ns = get_pid_ns_by_fd(target);
> +	if (IS_ERR(target_ns)) {
> +		put_pid_ns(source_ns);
> +		return PTR_ERR(target_ns);
> +	}
> +
> +	if (cmd == PIDCMD_QUERY_PIDNS) {
> +		result = pidns_related(source_ns, target_ns);
> +	} else {
> +		rcu_read_lock();
> +		struct_pid = get_pid(find_pid_ns(pid, source_ns));
> +		rcu_read_unlock();
> +
> +		if (struct_pid)
> +			result = pid_nr_ns(struct_pid, target_ns);
> +		else
> +			result = -ESRCH;
> +
> +		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> +			result = pidfd_create_fd(struct_pid, O_CLOEXEC);

pidfd_create_fd already does put_pid on errors..

> +
> +		if (!result)
> +			result = -ENOENT;
> +
> +		put_pid(struct_pid);

so on error you would put_pid twice which seems odd..  I would suggest, don't
release the pid ref from within pidfd_create_fd, release the ref from the
caller. Speaking of which, I added to my list to convert the pid->count to
refcount_t at some point :)

> +	}
> +
> +	put_pid_ns(target_ns);
> +	put_pid_ns(source_ns);

This part looks more clean than before so good.

thanks,

 - Joel


> +	return result;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>  	/* Verify no one has done anything silly: */
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index aa6e72fb7c08..1c863fb3d55a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
>  	return &get_pid_ns(pid_ns)->ns;
>  }
>  
> +/**
> + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> + * @ancestor:   pidns suspected to be the ancestor of @descendant
> + * @descendant: pidns suspected to be the descendant of @ancestor
> + *
> + * Returns -1 if @ancestor is not an ancestor of @descendant,
> + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> + * is an ancestor of @descendant.
> + */
> +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
> +{
> +	if (ancestor == descendant)
> +		return 0;
> +
> +	for (;;) {
> +		if (!descendant)
> +			return -1;
> +		if (descendant == ancestor)
> +			break;
> +		descendant = descendant->parent;
> +	}
> +
> +	return 1;
> +}
> +
>  static struct user_namespace *pidns_owner(struct ns_common *ns)
>  {
>  	return to_pid_ns(ns)->user_ns;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:06   ` Joel Fernandes
@ 2019-03-26 17:08     ` Christian Brauner
  2019-03-26 17:15       ` Joel Fernandes
  2019-03-26 17:22     ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 17:08 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> > 
> > "Each process have different pids, one for each pid namespace it belongs.
> >  When interaction happens within single pid-ns translation isn't required.
> >  More complicated scenarios needs special handling.
> > 
> >  For example:
> >  - reading pid-files or logs written inside container with pid namespace
> >  - writing logs with internal pids outside container for pushing them into
> >  - attaching with ptrace to tasks from different pid namespace
> > 
> >  Generally speaking, any cross pid-ns API with pids needs translation.
> > 
> >  Currently there are several interfaces that could be used here:
> > 
> >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > 
> >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> >  In some cases pid translation could be easily done using this information.
> >  Backward translation requires scanning all tasks and becomes really
> >  complicated for deeper namespace nesting.
> > 
> >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> >  into pid namespace, this expose process and could be insecure."
> > 
> > The original patchset allowed two distinct operations implicitly:
> > - discovering whether pid namespaces (pidns) have a parent-child
> >   relationship
> > - translating a pid from a source pidns into a target pidns
> > 
> > Both tasks are accomplished in the original patchset by passing a pid
> > along. If the pid argument is passed as 1 the relationship between two pid
> > namespaces can be discovered.
> > The syscall will gain a lot clearer syntax and will be easier to use for
> > userspace if the task it is asked to perform is passed through a
> > command argument. Additionally, it allows us to remove an intrinsic race
> > caused by using the pid argument as a way to discover the relationship
> > between pid namespaces.
> > This patch introduces three commands:
> > 
> > /* PIDCMD_QUERY_PID */
> > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > Given a source pid namespace fd return the pid of the process in the target
> > namespace:
> 
> Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> IMO (see below).
> 
> > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid in callers pidns
> > 
> > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid of init task of pidns identified by source_fd in callers pidns
> > 
> > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > /* PIDCMD_QUERY_PIDNS */
> > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > namespaces.
> > In the original version of the pachset passing pid as 1 would allow to
> > deterimine the relationship between the pid namespaces. This is inherhently
> > racy. If pid 1 inside a pid namespace has died it would report false
> > negatives. For example, if pid 1 inside of the target pid namespace already
> > died, it would report that the target pid namespace cannot be reached from
> > the source pid namespace because it couldn't find the pid inside of the
> > target pid namespace and thus falsely report to the user that the two pid
> > namespaces are not related. This problem is simple to avoid. In the new
> > version we simply walk the list of ancestors and check whether the
> > namespace are related to each other. By doing it this way we can reliably
> > report what the relationship between two pid namespace file descriptors
> > looks like.
> > 
> > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > 
> > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > 
> > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > 
> > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> 
> Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> 
> Again QUERY is ambigious here. Above you called QUERY to translate something,
> now you're calling QUERY to mean compare something. I suggest to be explicit
> about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> 
> Arguably, 2 syscalls for this is cleaner:
> pid_compare_namespaces(ns_fd1, ns_fd2);
> pid_translate(pid, ns_fd1, nds_fd2);
> 
> 
> > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > improve the functionality expressed implicitly in translate_pid() before.
> > 
> > /* PIDCMD_GET_PIDFD */
> 
> And this can be a third syscall:
> pidfd_translate(pid, ns_fd1, ns_fd2).
> 
> I am actually supportive of Daniel's view that by combining too many
> arguments into a single syscall, becomes confusing and sometimes some
> arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> don't need a pid to compare to pid-ns, so user has to set that to 0.
> 
> More comments below...
> 
> > This command allows to retrieve file descriptors for processes and removes
> > the dependency of pidfds and thereby the pidfd_send_signal() syscall on
> > procfs. First, multiple people have expressed a desire to do this even when
> > pidfd_send_signal() was merged. It is even recorded in the commit message
> > for pidfd_send_signal() itself
> > (cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
> > Q-06: (Andrew Morton [1])
> >       Is there a cleaner way of obtaining the fd? Another syscall perhaps.
> > A-06: Userspace can already trivially retrieve file descriptors from procfs
> >       so this is something that we will need to support anyway. Hence,
> >       there's no immediate need to add another syscalls just to make
> >       pidfd_send_signal() not dependent on the presence of procfs. However,
> >       adding a syscalls to get such file descriptors is planned for a
> >       future patchset (cf. [1]).
> > Alexey made a similar request (cf. [2]).
> > Additionally, Andy made an additional, idependent argument that we should
> > go forward with non-proc-dirfd file descriptors for the sake of security
> > and extensibility (cf. [3]).
> > 
> > The pidfds are not associated with a specific pid namespaces but rather
> > only with struct pid. What the pidctl() syscall enforces is that when a
> > caller wants to retrieve a pidfd file descriptor for a pid in a given
> > target pid namespace the caller 
> > - must have been given access to two file descriptors referring
> >   to target and source pid namespace
> > - the source pid namespace must be an ancestor of the target pid
> >   namespace
> > - the pid must be translatable from the source pid namespace into the
> >   target pid namespace
> > 
> > 1. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pidfd
> > 
> > 2. pidctl(PIDCMD_GET_PIDFD, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pidfd
> > 
> > 3. pidctl(PIDCMD_GET_PIDFD, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by
> >     source_fd
> >   - retrieve callers pidns
> >   - return pidfd
> > 
> > 4. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pidfd
> > 
> > These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
> > default and can be used with the pidfd_send_signal() syscall. They are not
> > dirfds and as such have the advantage that we can make them pollable or
> > readable in the future if we see a need to do so (which I'm pretty sure we
> > will eventually). Currently they do not support any advanced operations.
> > 
> > /* References */
> > [1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
> > [2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
> > [3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/
> > [4]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
> [snip]
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e446806a561f..a4c8c59f7c8f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
> >  				struct old_timex32 __user *tx);
> >  asmlinkage long sys_syncfs(int fd);
> >  asmlinkage long sys_setns(int fd, int nstype);
> > +asmlinkage long sys_pidctl(unsigned int cmd, pid_t pid, int source, int target,
> > +			   unsigned int flags);
> >  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
> >  			     unsigned int vlen, unsigned flags);
> >  asmlinkage long sys_process_vm_readv(pid_t pid,
> > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > index ac49a220cf2a..e9564ec06b07 100644
> > --- a/include/uapi/linux/wait.h
> > +++ b/include/uapi/linux/wait.h
> > @@ -18,5 +18,19 @@
> >  #define P_PID		1
> >  #define P_PGID		2
> >  
> > +/* Commands to pass to pidctl() */
> > +enum pidcmd {
> > +	PIDCMD_QUERY_PID   = 0, /* Get pid in target pid namespace */
> > +	PIDCMD_QUERY_PIDNS = 1, /* Determine relationship between pid namespaces */
> > +	PIDCMD_GET_PIDFD   = 2, /* Get pidfd for a process */
> > +};
> > +
> > +/* Return values of PIDCMD_QUERY_PIDNS */
> > +enum pidcmd_query_pidns {
> > +	PIDNS_UNRELATED          = 0, /* The pid namespaces are unrelated */
> > +	PIDNS_EQUAL              = 1, /* The pid namespaces are equal */
> > +	PIDNS_SOURCE_IS_ANCESTOR = 2, /* Source pid namespace is ancestor of target pid namespace */
> > +	PIDNS_TARGET_IS_ANCESTOR = 3, /* Target pid namespace is ancestor of source pid namespace */
> > +};
> >  
> >  #endif /* _UAPI_LINUX_WAIT_H */
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..3213a137a63e 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -26,6 +26,7 @@
> >   *
> >   */
> >  
> > +#include <linux/anon_inodes.h>
> >  #include <linux/mm.h>
> >  #include <linux/export.h>
> >  #include <linux/slab.h>
> > @@ -40,6 +41,7 @@
> >  #include <linux/proc_fs.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/idr.h>
> > +#include <linux/wait.h>
> >  
> >  struct pid init_struct_pid = {
> >  	.count 		= ATOMIC_INIT(1),
> > @@ -451,6 +453,165 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >  	return idr_get_next(&ns->idr, &nr);
> >  }
> >  
> > +static int pidfd_release(struct inode *inode, struct file *file)
> > +{
> > +	struct pid *pid = file->private_data;
> > +
> > +	if (pid) {
> > +		file->private_data = NULL;
> > +		put_pid(pid);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +const struct file_operations pidfd_fops = {
> > +	.release = pidfd_release,
> > +};
> > +
> > +static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
> > +{
> > +	int fd;
> > +
> > +	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), O_RDWR | o_flags);
> > +	if (fd < 0)
> > +		put_pid(pid);
> > +
> > +	return fd;
> > +}
> > +
> > +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> > +{
> > +	struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> > +
> > +	if (fd >= 0) {
> > +#ifdef CONFIG_PID_NS
> > +		struct ns_common *ns;
> > +		struct file *file = proc_ns_fget(fd);
> > +		if (IS_ERR(file))
> > +			return ERR_CAST(file);
> > +
> > +		ns = get_proc_ns(file_inode(file));
> > +		if (ns->ops->type == CLONE_NEWPID) {
> > +			pidns = container_of(ns, struct pid_namespace, ns);
> > +			get_pid_ns(pidns);
> > +		}
> > +
> > +		fput(file);
> > +#endif
> > +	} else {
> > +		pidns = task_active_pid_ns(current);
> > +		get_pid_ns(pidns);
> > +	}
> > +
> > +	return pidns;
> > +}
> > +
> > +static int pidns_related(struct pid_namespace *source,
> > +			 struct pid_namespace *target)
> > +{
> > +	int query;
> > +
> > +	query = pidnscmp(source, target);
> > +	switch (query) {
> > +	case 0:
> > +		return PIDNS_EQUAL;
> > +	case 1:
> > +		return PIDNS_SOURCE_IS_ANCESTOR;
> > +	}
> > +
> > +	query = pidnscmp(target, source);
> > +	if (query == 1)
> > +		return PIDNS_TARGET_IS_ANCESTOR;
> > +
> > +	return PIDNS_UNRELATED;
> > +}
> > +
> > +/*
> > + * pidctl - perform operations on pids
> > + * @cmd:    command to execute
> > + * @pid:    pid for translation
> > + * @source: pid-ns file descriptor or -1 for active namespace
> > + * @target: pid-ns file descriptor or -1 for active namesapce
> > + * @flags:  flags to pass
> > + *
> > + * If cmd is PIDCMD_QUERY_PID translates pid between pid namespaces
> > + * identified by @source and @target. Returns pid if process has pid in
> > + * @target, -ESRCH if process does not have a pid in @source, -ENOENT if
> > + * process has no pid in @target.
> > + *
> > + * If cmd is PIDCMD_QUERY_PIDNS determines the relations between two pid
> > + * namespaces. Returns 2 if @source is an ancestor pid namespace
> > + * of @target, 1 if @source and @target refer to the same pid namespace,
> > + * 3 if @target is an ancestor pid namespace of @source, 0 if they have
> > + * no parent-child relationship in either direction.
> > + *
> > + * If cmd is PIDCMD_GET_PIDFD returns pidfd for process in @target pid
> > + * namespace. Returns pidfd if process has pid in @target, -ESRCH if
> > + * process does not have a pid in @source, -ENOENT if process does not
> > + * have a pid in @target pid namespace.
> > + *
> > + */
> > +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> > +		unsigned int, flags)
> 
> flags seems not needed since it is unused, but I get it that you may want to
> have flags in the future? If yes, give example of future flags?
> 
> > +{
> > +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> 
> Setting these to NULL is no longer needed.
> 
> > +	struct pid *struct_pid;
> > +	pid_t result;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	switch (cmd) {
> > +	case PIDCMD_QUERY_PID:
> > +		break;
> > +	case PIDCMD_QUERY_PIDNS:
> > +		if (pid)
> > +			return -EINVAL;
> > +		break;
> > +	case PIDCMD_GET_PIDFD:
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	source_ns = get_pid_ns_by_fd(source);
> > +	if (IS_ERR(source_ns))
> > +		return PTR_ERR(source_ns);
> > +
> > +	target_ns = get_pid_ns_by_fd(target);
> > +	if (IS_ERR(target_ns)) {
> > +		put_pid_ns(source_ns);
> > +		return PTR_ERR(target_ns);
> > +	}
> > +
> > +	if (cmd == PIDCMD_QUERY_PIDNS) {
> > +		result = pidns_related(source_ns, target_ns);
> > +	} else {
> > +		rcu_read_lock();
> > +		struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > +		rcu_read_unlock();
> > +
> > +		if (struct_pid)
> > +			result = pid_nr_ns(struct_pid, target_ns);
> > +		else
> > +			result = -ESRCH;
> > +
> > +		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > +			result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> 
> pidfd_create_fd already does put_pid on errors..

it also takes its own reference

> 
> > +
> > +		if (!result)
> > +			result = -ENOENT;
> > +
> > +		put_pid(struct_pid);
> 
> so on error you would put_pid twice which seems odd..  I would suggest, don't
> release the pid ref from within pidfd_create_fd, release the ref from the
> caller. Speaking of which, I added to my list to convert the pid->count to
> refcount_t at some point :)

as i said, pidfd_create_fd takes its own reference

> 
> > +	}
> > +
> > +	put_pid_ns(target_ns);
> > +	put_pid_ns(source_ns);
> 
> This part looks more clean than before so good.
> 
> thanks,
> 
>  - Joel
> 
> 
> > +	return result;
> > +}
> > +
> >  void __init pid_idr_init(void)
> >  {
> >  	/* Verify no one has done anything silly: */
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index aa6e72fb7c08..1c863fb3d55a 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
> >  	return &get_pid_ns(pid_ns)->ns;
> >  }
> >  
> > +/**
> > + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> > + * @ancestor:   pidns suspected to be the ancestor of @descendant
> > + * @descendant: pidns suspected to be the descendant of @ancestor
> > + *
> > + * Returns -1 if @ancestor is not an ancestor of @descendant,
> > + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> > + * is an ancestor of @descendant.
> > + */
> > +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
> > +{
> > +	if (ancestor == descendant)
> > +		return 0;
> > +
> > +	for (;;) {
> > +		if (!descendant)
> > +			return -1;
> > +		if (descendant == ancestor)
> > +			break;
> > +		descendant = descendant->parent;
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  static struct user_namespace *pidns_owner(struct ns_common *ns)
> >  {
> >  	return to_pid_ns(ns)->user_ns;
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:08     ` Christian Brauner
@ 2019-03-26 17:15       ` Joel Fernandes
  2019-03-26 17:17         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 17:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
[snip]
> > > +	struct pid *struct_pid;
> > > +	pid_t result;
> > > +
> > > +	if (flags)
> > > +		return -EINVAL;
> > > +
> > > +	switch (cmd) {
> > > +	case PIDCMD_QUERY_PID:
> > > +		break;
> > > +	case PIDCMD_QUERY_PIDNS:
> > > +		if (pid)
> > > +			return -EINVAL;
> > > +		break;
> > > +	case PIDCMD_GET_PIDFD:
> > > +		break;
> > > +	default:
> > > +		return -EOPNOTSUPP;
> > > +	}
> > > +
> > > +	source_ns = get_pid_ns_by_fd(source);
> > > +	if (IS_ERR(source_ns))
> > > +		return PTR_ERR(source_ns);
> > > +
> > > +	target_ns = get_pid_ns_by_fd(target);
> > > +	if (IS_ERR(target_ns)) {
> > > +		put_pid_ns(source_ns);
> > > +		return PTR_ERR(target_ns);
> > > +	}
> > > +
> > > +	if (cmd == PIDCMD_QUERY_PIDNS) {
> > > +		result = pidns_related(source_ns, target_ns);
> > > +	} else {
> > > +		rcu_read_lock();
> > > +		struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > > +		rcu_read_unlock();
> > > +
> > > +		if (struct_pid)
> > > +			result = pid_nr_ns(struct_pid, target_ns);
> > > +		else
> > > +			result = -ESRCH;
> > > +
> > > +		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > > +			result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> > 
> > pidfd_create_fd already does put_pid on errors..
> 
> it also takes its own reference
> 
> > 
> > > +
> > > +		if (!result)
> > > +			result = -ENOENT;
> > > +
> > > +		put_pid(struct_pid);
> > 
> > so on error you would put_pid twice which seems odd..  I would suggest, don't
> > release the pid ref from within pidfd_create_fd, release the ref from the
> > caller. Speaking of which, I added to my list to convert the pid->count to
> > refcount_t at some point :)
> 
> as i said, pidfd_create_fd takes its own reference

Oh. That was easy to miss. Fair enough. I take that comment back.

Please also reply to the other comments I posted, thanks. Generally on LKML,
I have seen there is an expectation to reply to all reviewer's review
comments even if you agree with them. This helps keep the review going
smoothly. Just my 2 cents.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:15       ` Joel Fernandes
@ 2019-03-26 17:17         ` Christian Brauner
  2019-03-26 17:18           ` Joel Fernandes
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 17:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 01:15:25PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
> [snip]
> > > > +	struct pid *struct_pid;
> > > > +	pid_t result;
> > > > +
> > > > +	if (flags)
> > > > +		return -EINVAL;
> > > > +
> > > > +	switch (cmd) {
> > > > +	case PIDCMD_QUERY_PID:
> > > > +		break;
> > > > +	case PIDCMD_QUERY_PIDNS:
> > > > +		if (pid)
> > > > +			return -EINVAL;
> > > > +		break;
> > > > +	case PIDCMD_GET_PIDFD:
> > > > +		break;
> > > > +	default:
> > > > +		return -EOPNOTSUPP;
> > > > +	}
> > > > +
> > > > +	source_ns = get_pid_ns_by_fd(source);
> > > > +	if (IS_ERR(source_ns))
> > > > +		return PTR_ERR(source_ns);
> > > > +
> > > > +	target_ns = get_pid_ns_by_fd(target);
> > > > +	if (IS_ERR(target_ns)) {
> > > > +		put_pid_ns(source_ns);
> > > > +		return PTR_ERR(target_ns);
> > > > +	}
> > > > +
> > > > +	if (cmd == PIDCMD_QUERY_PIDNS) {
> > > > +		result = pidns_related(source_ns, target_ns);
> > > > +	} else {
> > > > +		rcu_read_lock();
> > > > +		struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > > > +		rcu_read_unlock();
> > > > +
> > > > +		if (struct_pid)
> > > > +			result = pid_nr_ns(struct_pid, target_ns);
> > > > +		else
> > > > +			result = -ESRCH;
> > > > +
> > > > +		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > > > +			result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> > > 
> > > pidfd_create_fd already does put_pid on errors..
> > 
> > it also takes its own reference
> > 
> > > 
> > > > +
> > > > +		if (!result)
> > > > +			result = -ENOENT;
> > > > +
> > > > +		put_pid(struct_pid);
> > > 
> > > so on error you would put_pid twice which seems odd..  I would suggest, don't
> > > release the pid ref from within pidfd_create_fd, release the ref from the
> > > caller. Speaking of which, I added to my list to convert the pid->count to
> > > refcount_t at some point :)
> > 
> > as i said, pidfd_create_fd takes its own reference
> 
> Oh. That was easy to miss. Fair enough. I take that comment back.
> 
> Please also reply to the other comments I posted, thanks. Generally on LKML,
> I have seen there is an expectation to reply to all reviewer's review
> comments even if you agree with them. This helps keep the review going
> smoothly. Just my 2 cents.

I tend to do it in multiple mails depending on whether or not I need to
think about a comment or not.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:17         ` Christian Brauner
@ 2019-03-26 17:18           ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 17:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Konstantin Khlebnikov, Andy Lutomirski, David Howells,
	Serge E. Hallyn, Eric W. Biederman, Linux API, LKML,
	Arnd Bergmann, Kees Cook, Alexey Dobriyan, Thomas Glexiner,
	Michael Kerrisk-manpages, Jonathan Kowalski, Dmitry V. Levin,
	Andrew Morton, Oleg Nesterov, Nagarathnam Muthusamy,
	Aleksa Sarai, Al Viro, Daniel Colascione

On Tue, Mar 26, 2019 at 1:17 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Mar 26, 2019 at 01:15:25PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
> > [snip]
> > > >
> > > > > +
> > > > > +               if (!result)
> > > > > +                       result = -ENOENT;
> > > > > +
> > > > > +               put_pid(struct_pid);
> > > >
> > > > so on error you would put_pid twice which seems odd..  I would suggest, don't
> > > > release the pid ref from within pidfd_create_fd, release the ref from the
> > > > caller. Speaking of which, I added to my list to convert the pid->count to
> > > > refcount_t at some point :)
> > >
> > > as i said, pidfd_create_fd takes its own reference
> >
> > Oh. That was easy to miss. Fair enough. I take that comment back.
> >
> > Please also reply to the other comments I posted, thanks. Generally on LKML,
> > I have seen there is an expectation to reply to all reviewer's review
> > comments even if you agree with them. This helps keep the review going
> > smoothly. Just my 2 cents.
>
> I tend to do it in multiple mails depending on whether or not I need to
> think about a comment or not.

Ok, that's also fine with me. thanks,

 - Joel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:06   ` Joel Fernandes
  2019-03-26 17:08     ` Christian Brauner
@ 2019-03-26 17:22     ` Christian Brauner
  2019-03-26 18:10       ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 17:22 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> > 
> > "Each process have different pids, one for each pid namespace it belongs.
> >  When interaction happens within single pid-ns translation isn't required.
> >  More complicated scenarios needs special handling.
> > 
> >  For example:
> >  - reading pid-files or logs written inside container with pid namespace
> >  - writing logs with internal pids outside container for pushing them into
> >  - attaching with ptrace to tasks from different pid namespace
> > 
> >  Generally speaking, any cross pid-ns API with pids needs translation.
> > 
> >  Currently there are several interfaces that could be used here:
> > 
> >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > 
> >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> >  In some cases pid translation could be easily done using this information.
> >  Backward translation requires scanning all tasks and becomes really
> >  complicated for deeper namespace nesting.
> > 
> >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> >  into pid namespace, this expose process and could be insecure."
> > 
> > The original patchset allowed two distinct operations implicitly:
> > - discovering whether pid namespaces (pidns) have a parent-child
> >   relationship
> > - translating a pid from a source pidns into a target pidns
> > 
> > Both tasks are accomplished in the original patchset by passing a pid
> > along. If the pid argument is passed as 1 the relationship between two pid
> > namespaces can be discovered.
> > The syscall will gain a lot clearer syntax and will be easier to use for
> > userspace if the task it is asked to perform is passed through a
> > command argument. Additionally, it allows us to remove an intrinsic race
> > caused by using the pid argument as a way to discover the relationship
> > between pid namespaces.
> > This patch introduces three commands:
> > 
> > /* PIDCMD_QUERY_PID */
> > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > Given a source pid namespace fd return the pid of the process in the target
> > namespace:
> 
> Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> IMO (see below).

Yes, doesn't matter to me too much what we call it.

> 
> > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid in callers pidns
> > 
> > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid of init task of pidns identified by source_fd in callers pidns
> > 
> > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > /* PIDCMD_QUERY_PIDNS */
> > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > namespaces.
> > In the original version of the pachset passing pid as 1 would allow to
> > deterimine the relationship between the pid namespaces. This is inherhently
> > racy. If pid 1 inside a pid namespace has died it would report false
> > negatives. For example, if pid 1 inside of the target pid namespace already
> > died, it would report that the target pid namespace cannot be reached from
> > the source pid namespace because it couldn't find the pid inside of the
> > target pid namespace and thus falsely report to the user that the two pid
> > namespaces are not related. This problem is simple to avoid. In the new
> > version we simply walk the list of ancestors and check whether the
> > namespace are related to each other. By doing it this way we can reliably
> > report what the relationship between two pid namespace file descriptors
> > looks like.
> > 
> > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > 
> > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > 
> > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > 
> > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> 
> Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.

Same.

> 
> Again QUERY is ambigious here. Above you called QUERY to translate something,
> now you're calling QUERY to mean compare something. I suggest to be explicit
> about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> 
> Arguably, 2 syscalls for this is cleaner:
> pid_compare_namespaces(ns_fd1, ns_fd2);
> pid_translate(pid, ns_fd1, nds_fd2);

I don't think we want to send out pid_compare_namespaces() as a separate
syscall. If that's the consensus I'd rather just drop this functionality
completely.

> 
> 
> > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > improve the functionality expressed implicitly in translate_pid() before.
> > 
> > /* PIDCMD_GET_PIDFD */
> 
> And this can be a third syscall:
> pidfd_translate(pid, ns_fd1, ns_fd2).

Sigh, yes. But I honestly want other people other than us to come out
and say "yes, please send a PR to Linus with three separate syscalls for
very closely related functionality". There's a difference between "this
is how we would ideally like to do it" and "this is what is common
practice and will likely get accepted". But I'm really not opposed to it
per se.

> 
> I am actually supportive of Daniel's view that by combining too many
> arguments into a single syscall, becomes confusing and sometimes some
> arguments have to be forced to 0 in the single shoe-horned syscall. Like you

There's a difference between an ioctl() and say seccomp() which this is
close to:
int seccomp(unsigned int operation, unsigned int flags, void *args);
The point is that the functionality is closely related not just randomly
unrelated stuff. But as I said I'm more than willing to compromise.

> don't need a pid to compare to pid-ns, so user has to set that to 0.

> 
> More comments below...
> 
> > This command allows to retrieve file descriptors for processes and removes
> > the dependency of pidfds and thereby the pidfd_send_signal() syscall on
> > procfs. First, multiple people have expressed a desire to do this even when
> > pidfd_send_signal() was merged. It is even recorded in the commit message
> > for pidfd_send_signal() itself
> > (cf. commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad):
> > Q-06: (Andrew Morton [1])
> >       Is there a cleaner way of obtaining the fd? Another syscall perhaps.
> > A-06: Userspace can already trivially retrieve file descriptors from procfs
> >       so this is something that we will need to support anyway. Hence,
> >       there's no immediate need to add another syscalls just to make
> >       pidfd_send_signal() not dependent on the presence of procfs. However,
> >       adding a syscalls to get such file descriptors is planned for a
> >       future patchset (cf. [1]).
> > Alexey made a similar request (cf. [2]).
> > Additionally, Andy made an additional, idependent argument that we should
> > go forward with non-proc-dirfd file descriptors for the sake of security
> > and extensibility (cf. [3]).
> > 
> > The pidfds are not associated with a specific pid namespaces but rather
> > only with struct pid. What the pidctl() syscall enforces is that when a
> > caller wants to retrieve a pidfd file descriptor for a pid in a given
> > target pid namespace the caller 
> > - must have been given access to two file descriptors referring
> >   to target and source pid namespace
> > - the source pid namespace must be an ancestor of the target pid
> >   namespace
> > - the pid must be translatable from the source pid namespace into the
> >   target pid namespace
> > 
> > 1. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pidfd
> > 
> > 2. pidctl(PIDCMD_GET_PIDFD, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pidfd
> > 
> > 3. pidctl(PIDCMD_GET_PIDFD, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by
> >     source_fd
> >   - retrieve callers pidns
> >   - return pidfd
> > 
> > 4. pidctl(PIDCMD_GET_PIDFD, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pidfd
> > 
> > These pidfds are allocated using anon_inode_getfd(), are O_CLOEXEC by
> > default and can be used with the pidfd_send_signal() syscall. They are not
> > dirfds and as such have the advantage that we can make them pollable or
> > readable in the future if we see a need to do so (which I'm pretty sure we
> > will eventually). Currently they do not support any advanced operations.
> > 
> > /* References */
> > [1]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
> > [2]: https://lore.kernel.org/lkml/20190320203910.GA2842@avx2/
> > [3]: https://lore.kernel.org/lkml/CALCETrXO=V=+qEdLDVPf8eCgLZiB9bOTrUfe0V-U-tUZoeoRDA@mail.gmail.com/
> > [4]: https://lore.kernel.org/lkml/20181109034919.GA21681@altlinux.org/
> [snip]
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e446806a561f..a4c8c59f7c8f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -929,6 +929,8 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
> >  				struct old_timex32 __user *tx);
> >  asmlinkage long sys_syncfs(int fd);
> >  asmlinkage long sys_setns(int fd, int nstype);
> > +asmlinkage long sys_pidctl(unsigned int cmd, pid_t pid, int source, int target,
> > +			   unsigned int flags);
> >  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
> >  			     unsigned int vlen, unsigned flags);
> >  asmlinkage long sys_process_vm_readv(pid_t pid,
> > diff --git a/include/uapi/linux/wait.h b/include/uapi/linux/wait.h
> > index ac49a220cf2a..e9564ec06b07 100644
> > --- a/include/uapi/linux/wait.h
> > +++ b/include/uapi/linux/wait.h
> > @@ -18,5 +18,19 @@
> >  #define P_PID		1
> >  #define P_PGID		2
> >  
> > +/* Commands to pass to pidctl() */
> > +enum pidcmd {
> > +	PIDCMD_QUERY_PID   = 0, /* Get pid in target pid namespace */
> > +	PIDCMD_QUERY_PIDNS = 1, /* Determine relationship between pid namespaces */
> > +	PIDCMD_GET_PIDFD   = 2, /* Get pidfd for a process */
> > +};
> > +
> > +/* Return values of PIDCMD_QUERY_PIDNS */
> > +enum pidcmd_query_pidns {
> > +	PIDNS_UNRELATED          = 0, /* The pid namespaces are unrelated */
> > +	PIDNS_EQUAL              = 1, /* The pid namespaces are equal */
> > +	PIDNS_SOURCE_IS_ANCESTOR = 2, /* Source pid namespace is ancestor of target pid namespace */
> > +	PIDNS_TARGET_IS_ANCESTOR = 3, /* Target pid namespace is ancestor of source pid namespace */
> > +};
> >  
> >  #endif /* _UAPI_LINUX_WAIT_H */
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..3213a137a63e 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -26,6 +26,7 @@
> >   *
> >   */
> >  
> > +#include <linux/anon_inodes.h>
> >  #include <linux/mm.h>
> >  #include <linux/export.h>
> >  #include <linux/slab.h>
> > @@ -40,6 +41,7 @@
> >  #include <linux/proc_fs.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/idr.h>
> > +#include <linux/wait.h>
> >  
> >  struct pid init_struct_pid = {
> >  	.count 		= ATOMIC_INIT(1),
> > @@ -451,6 +453,165 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >  	return idr_get_next(&ns->idr, &nr);
> >  }
> >  
> > +static int pidfd_release(struct inode *inode, struct file *file)
> > +{
> > +	struct pid *pid = file->private_data;
> > +
> > +	if (pid) {
> > +		file->private_data = NULL;
> > +		put_pid(pid);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +const struct file_operations pidfd_fops = {
> > +	.release = pidfd_release,
> > +};
> > +
> > +static int pidfd_create_fd(struct pid *pid, unsigned int o_flags)
> > +{
> > +	int fd;
> > +
> > +	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid), O_RDWR | o_flags);
> > +	if (fd < 0)
> > +		put_pid(pid);
> > +
> > +	return fd;
> > +}
> > +
> > +static struct pid_namespace *get_pid_ns_by_fd(int fd)
> > +{
> > +	struct pid_namespace *pidns = ERR_PTR(-EINVAL);
> > +
> > +	if (fd >= 0) {
> > +#ifdef CONFIG_PID_NS
> > +		struct ns_common *ns;
> > +		struct file *file = proc_ns_fget(fd);
> > +		if (IS_ERR(file))
> > +			return ERR_CAST(file);
> > +
> > +		ns = get_proc_ns(file_inode(file));
> > +		if (ns->ops->type == CLONE_NEWPID) {
> > +			pidns = container_of(ns, struct pid_namespace, ns);
> > +			get_pid_ns(pidns);
> > +		}
> > +
> > +		fput(file);
> > +#endif
> > +	} else {
> > +		pidns = task_active_pid_ns(current);
> > +		get_pid_ns(pidns);
> > +	}
> > +
> > +	return pidns;
> > +}
> > +
> > +static int pidns_related(struct pid_namespace *source,
> > +			 struct pid_namespace *target)
> > +{
> > +	int query;
> > +
> > +	query = pidnscmp(source, target);
> > +	switch (query) {
> > +	case 0:
> > +		return PIDNS_EQUAL;
> > +	case 1:
> > +		return PIDNS_SOURCE_IS_ANCESTOR;
> > +	}
> > +
> > +	query = pidnscmp(target, source);
> > +	if (query == 1)
> > +		return PIDNS_TARGET_IS_ANCESTOR;
> > +
> > +	return PIDNS_UNRELATED;
> > +}
> > +
> > +/*
> > + * pidctl - perform operations on pids
> > + * @cmd:    command to execute
> > + * @pid:    pid for translation
> > + * @source: pid-ns file descriptor or -1 for active namespace
> > + * @target: pid-ns file descriptor or -1 for active namesapce
> > + * @flags:  flags to pass
> > + *
> > + * If cmd is PIDCMD_QUERY_PID translates pid between pid namespaces
> > + * identified by @source and @target. Returns pid if process has pid in
> > + * @target, -ESRCH if process does not have a pid in @source, -ENOENT if
> > + * process has no pid in @target.
> > + *
> > + * If cmd is PIDCMD_QUERY_PIDNS determines the relations between two pid
> > + * namespaces. Returns 2 if @source is an ancestor pid namespace
> > + * of @target, 1 if @source and @target refer to the same pid namespace,
> > + * 3 if @target is an ancestor pid namespace of @source, 0 if they have
> > + * no parent-child relationship in either direction.
> > + *
> > + * If cmd is PIDCMD_GET_PIDFD returns pidfd for process in @target pid
> > + * namespace. Returns pidfd if process has pid in @target, -ESRCH if
> > + * process does not have a pid in @source, -ENOENT if process does not
> > + * have a pid in @target pid namespace.
> > + *
> > + */
> > +SYSCALL_DEFINE5(pidctl, unsigned int, cmd, pid_t, pid, int, source, int, target,
> > +		unsigned int, flags)
> 
> flags seems not needed since it is unused, but I get it that you may want to
> have flags in the future? If yes, give example of future flags?
> 
> > +{
> > +	struct pid_namespace *source_ns = NULL, *target_ns = NULL;
> 
> Setting these to NULL is no longer needed.
> 
> > +	struct pid *struct_pid;
> > +	pid_t result;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	switch (cmd) {
> > +	case PIDCMD_QUERY_PID:
> > +		break;
> > +	case PIDCMD_QUERY_PIDNS:
> > +		if (pid)
> > +			return -EINVAL;
> > +		break;
> > +	case PIDCMD_GET_PIDFD:
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	source_ns = get_pid_ns_by_fd(source);
> > +	if (IS_ERR(source_ns))
> > +		return PTR_ERR(source_ns);
> > +
> > +	target_ns = get_pid_ns_by_fd(target);
> > +	if (IS_ERR(target_ns)) {
> > +		put_pid_ns(source_ns);
> > +		return PTR_ERR(target_ns);
> > +	}
> > +
> > +	if (cmd == PIDCMD_QUERY_PIDNS) {
> > +		result = pidns_related(source_ns, target_ns);
> > +	} else {
> > +		rcu_read_lock();
> > +		struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > +		rcu_read_unlock();
> > +
> > +		if (struct_pid)
> > +			result = pid_nr_ns(struct_pid, target_ns);
> > +		else
> > +			result = -ESRCH;
> > +
> > +		if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > +			result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> 
> pidfd_create_fd already does put_pid on errors..
> 
> > +
> > +		if (!result)
> > +			result = -ENOENT;
> > +
> > +		put_pid(struct_pid);
> 
> so on error you would put_pid twice which seems odd..  I would suggest, don't
> release the pid ref from within pidfd_create_fd, release the ref from the
> caller. Speaking of which, I added to my list to convert the pid->count to
> refcount_t at some point :)
> 
> > +	}
> > +
> > +	put_pid_ns(target_ns);
> > +	put_pid_ns(source_ns);
> 
> This part looks more clean than before so good.
> 
> thanks,
> 
>  - Joel
> 
> 
> > +	return result;
> > +}
> > +
> >  void __init pid_idr_init(void)
> >  {
> >  	/* Verify no one has done anything silly: */
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index aa6e72fb7c08..1c863fb3d55a 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -429,6 +429,31 @@ static struct ns_common *pidns_get_parent(struct ns_common *ns)
> >  	return &get_pid_ns(pid_ns)->ns;
> >  }
> >  
> > +/**
> > + * pidnscmp - Determine if @ancestor is ancestor of @descendant
> > + * @ancestor:   pidns suspected to be the ancestor of @descendant
> > + * @descendant: pidns suspected to be the descendant of @ancestor
> > + *
> > + * Returns -1 if @ancestor is not an ancestor of @descendant,
> > + * 0 if @ancestor is the same pidns as @descendant, 1 if @ancestor
> > + * is an ancestor of @descendant.
> > + */
> > +int pidnscmp(struct pid_namespace *ancestor, struct pid_namespace *descendant)
> > +{
> > +	if (ancestor == descendant)
> > +		return 0;
> > +
> > +	for (;;) {
> > +		if (!descendant)
> > +			return -1;
> > +		if (descendant == ancestor)
> > +			break;
> > +		descendant = descendant->parent;
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  static struct user_namespace *pidns_owner(struct ns_common *ns)
> >  {
> >  	return to_pid_ns(ns)->user_ns;
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 17:22     ` Christian Brauner
@ 2019-03-26 18:10       ` Joel Fernandes
  2019-03-26 18:19         ` Christian Brauner
  2019-03-26 19:19         ` Christian Brauner
  0 siblings, 2 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 18:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > I quote Konstantins original patchset first that has already been acked and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> > > 
> > > "Each process have different pids, one for each pid namespace it belongs.
> > >  When interaction happens within single pid-ns translation isn't required.
> > >  More complicated scenarios needs special handling.
> > > 
> > >  For example:
> > >  - reading pid-files or logs written inside container with pid namespace
> > >  - writing logs with internal pids outside container for pushing them into
> > >  - attaching with ptrace to tasks from different pid namespace
> > > 
> > >  Generally speaking, any cross pid-ns API with pids needs translation.
> > > 
> > >  Currently there are several interfaces that could be used here:
> > > 
> > >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > 
> > >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > >  In some cases pid translation could be easily done using this information.
> > >  Backward translation requires scanning all tasks and becomes really
> > >  complicated for deeper namespace nesting.
> > > 
> > >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > >  into pid namespace, this expose process and could be insecure."
> > > 
> > > The original patchset allowed two distinct operations implicitly:
> > > - discovering whether pid namespaces (pidns) have a parent-child
> > >   relationship
> > > - translating a pid from a source pidns into a target pidns
> > > 
> > > Both tasks are accomplished in the original patchset by passing a pid
> > > along. If the pid argument is passed as 1 the relationship between two pid
> > > namespaces can be discovered.
> > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > userspace if the task it is asked to perform is passed through a
> > > command argument. Additionally, it allows us to remove an intrinsic race
> > > caused by using the pid argument as a way to discover the relationship
> > > between pid namespaces.
> > > This patch introduces three commands:
> > > 
> > > /* PIDCMD_QUERY_PID */
> > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > Given a source pid namespace fd return the pid of the process in the target
> > > namespace:
> > 
> > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > IMO (see below).
> 
> Yes, doesn't matter to me too much what we call it.

Ok.

> > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > >   - retrieve pidns identified by source_fd
> > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > >   - retrieve callers pidns
> > >   - return pid in callers pidns
> > > 
> > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > >   - retrieve callers pidns
> > >   - retrieve struct pid identifed by pid in callers pidns
> > >   - retrieve pidns identified by target_fd
> > >   - return pid in pidns identified by target_fd
> > > 
> > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > >   - retrieve pidns identified by source_fd
> > >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> > >   - retrieve callers pidns
> > >   - return pid of init task of pidns identified by source_fd in callers pidns
> > > 
> > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > >   - retrieve pidns identified by source_fd
> > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > >   - retrieve pidns identified by target_fd
> > >   - check whether struct pid can be found in pidns identified by target_fd
> > >   - return pid in pidns identified by target_fd
> > > 
> > > /* PIDCMD_QUERY_PIDNS */
> > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > namespaces.
> > > In the original version of the pachset passing pid as 1 would allow to
> > > deterimine the relationship between the pid namespaces. This is inherhently
> > > racy. If pid 1 inside a pid namespace has died it would report false
> > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > died, it would report that the target pid namespace cannot be reached from
> > > the source pid namespace because it couldn't find the pid inside of the
> > > target pid namespace and thus falsely report to the user that the two pid
> > > namespaces are not related. This problem is simple to avoid. In the new
> > > version we simply walk the list of ancestors and check whether the
> > > namespace are related to each other. By doing it this way we can reliably
> > > report what the relationship between two pid namespace file descriptors
> > > looks like.
> > > 
> > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > 
> > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > 
> > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > 
> > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > 
> > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> 
> Same.

Ok, glad we agree on it.

> > 
> > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > now you're calling QUERY to mean compare something. I suggest to be explicit
> > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > 
> > Arguably, 2 syscalls for this is cleaner:
> > pid_compare_namespaces(ns_fd1, ns_fd2);
> > pid_translate(pid, ns_fd1, nds_fd2);
> 
> I don't think we want to send out pid_compare_namespaces() as a separate
> syscall. If that's the consensus I'd rather just drop this functionality
> completely.

I mean, if pid-namespace fd comparison functionality is not needed in
userspace, why add it all. I suggest to drop it if not needed and can be
considered for addition later when needed.

Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.

> > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > improve the functionality expressed implicitly in translate_pid() before.
> > > 
> > > /* PIDCMD_GET_PIDFD */
> > 
> > And this can be a third syscall:
> > pidfd_translate(pid, ns_fd1, ns_fd2).
> 
> Sigh, yes. But I honestly want other people other than us to come out
> and say "yes, please send a PR to Linus with three separate syscalls for
> very closely related functionality". There's a difference between "this
> is how we would ideally like to do it" and "this is what is common
> practice and will likely get accepted". But I'm really not opposed to it
> per se.

Ok.

> > I am actually supportive of Daniel's view that by combining too many
> > arguments into a single syscall, becomes confusing and sometimes some
> > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> 
> There's a difference between an ioctl() and say seccomp() which this is
> close to:
> int seccomp(unsigned int operation, unsigned int flags, void *args);
> The point is that the functionality is closely related not just randomly
> unrelated stuff. But as I said I'm more than willing to compromise.

Sounds great, yeah whatever makes sense.

Thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 18:10       ` Joel Fernandes
@ 2019-03-26 18:19         ` Christian Brauner
  2019-03-26 18:47           ` Joel Fernandes
  2019-03-26 19:19         ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 18:19 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 02:10:12PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > I quote Konstantins original patchset first that has already been acked and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall:
> > > > 
> > > > "Each process have different pids, one for each pid namespace it belongs.
> > > >  When interaction happens within single pid-ns translation isn't required.
> > > >  More complicated scenarios needs special handling.
> > > > 
> > > >  For example:
> > > >  - reading pid-files or logs written inside container with pid namespace
> > > >  - writing logs with internal pids outside container for pushing them into
> > > >  - attaching with ptrace to tasks from different pid namespace
> > > > 
> > > >  Generally speaking, any cross pid-ns API with pids needs translation.
> > > > 
> > > >  Currently there are several interfaces that could be used here:
> > > > 
> > > >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > > 
> > > >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > > >  In some cases pid translation could be easily done using this information.
> > > >  Backward translation requires scanning all tasks and becomes really
> > > >  complicated for deeper namespace nesting.
> > > > 
> > > >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > > >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > > >  into pid namespace, this expose process and could be insecure."
> > > > 
> > > > The original patchset allowed two distinct operations implicitly:
> > > > - discovering whether pid namespaces (pidns) have a parent-child
> > > >   relationship
> > > > - translating a pid from a source pidns into a target pidns
> > > > 
> > > > Both tasks are accomplished in the original patchset by passing a pid
> > > > along. If the pid argument is passed as 1 the relationship between two pid
> > > > namespaces can be discovered.
> > > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > > userspace if the task it is asked to perform is passed through a
> > > > command argument. Additionally, it allows us to remove an intrinsic race
> > > > caused by using the pid argument as a way to discover the relationship
> > > > between pid namespaces.
> > > > This patch introduces three commands:
> > > > 
> > > > /* PIDCMD_QUERY_PID */
> > > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > > Given a source pid namespace fd return the pid of the process in the target
> > > > namespace:
> > > 
> > > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > > IMO (see below).
> > 
> > Yes, doesn't matter to me too much what we call it.
> 
> Ok.
> 
> > > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > >   - retrieve callers pidns
> > > >   - return pid in callers pidns
> > > > 
> > > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > > >   - retrieve callers pidns
> > > >   - retrieve struct pid identifed by pid in callers pidns
> > > >   - retrieve pidns identified by target_fd
> > > >   - return pid in pidns identified by target_fd
> > > > 
> > > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> > > >   - retrieve callers pidns
> > > >   - return pid of init task of pidns identified by source_fd in callers pidns
> > > > 
> > > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > >   - retrieve pidns identified by target_fd
> > > >   - check whether struct pid can be found in pidns identified by target_fd
> > > >   - return pid in pidns identified by target_fd
> > > > 
> > > > /* PIDCMD_QUERY_PIDNS */
> > > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > > namespaces.
> > > > In the original version of the pachset passing pid as 1 would allow to
> > > > deterimine the relationship between the pid namespaces. This is inherhently
> > > > racy. If pid 1 inside a pid namespace has died it would report false
> > > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > > died, it would report that the target pid namespace cannot be reached from
> > > > the source pid namespace because it couldn't find the pid inside of the
> > > > target pid namespace and thus falsely report to the user that the two pid
> > > > namespaces are not related. This problem is simple to avoid. In the new
> > > > version we simply walk the list of ancestors and check whether the
> > > > namespace are related to each other. By doing it this way we can reliably
> > > > report what the relationship between two pid namespace file descriptors
> > > > looks like.
> > > > 
> > > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > > >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > > 
> > > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > > >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > > 
> > > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > > >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > > 
> > > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > > >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > > 
> > > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> > 
> > Same.
> 
> Ok, glad we agree on it.
> 
> > > 
> > > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > > now you're calling QUERY to mean compare something. I suggest to be explicit
> > > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > > 
> > > Arguably, 2 syscalls for this is cleaner:
> > > pid_compare_namespaces(ns_fd1, ns_fd2);
> > > pid_translate(pid, ns_fd1, nds_fd2);
> > 
> > I don't think we want to send out pid_compare_namespaces() as a separate
> > syscall. If that's the consensus I'd rather just drop this functionality
> > completely.
> 
> I mean, if pid-namespace fd comparison functionality is not needed in
> userspace, why add it all. I suggest to drop it if not needed and can be
> considered for addition later when needed.
> 
> Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.
> 
> > > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > > improve the functionality expressed implicitly in translate_pid() before.
> > > > 
> > > > /* PIDCMD_GET_PIDFD */
> > > 
> > > And this can be a third syscall:
> > > pidfd_translate(pid, ns_fd1, ns_fd2).
> > 
> > Sigh, yes. But I honestly want other people other than us to come out
> > and say "yes, please send a PR to Linus with three separate syscalls for
> > very closely related functionality". There's a difference between "this
> > is how we would ideally like to do it" and "this is what is common
> > practice and will likely get accepted". But I'm really not opposed to it
> > per se.
> 
> Ok.
> 
> > > I am actually supportive of Daniel's view that by combining too many
> > > arguments into a single syscall, becomes confusing and sometimes some
> > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > 
> > There's a difference between an ioctl() and say seccomp() which this is
> > close to:
> > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > The point is that the functionality is closely related not just randomly
> > unrelated stuff. But as I said I'm more than willing to compromise.
> 
> Sounds great, yeah whatever makes sense.

In case I haven't said this enough: I really appreciate the discussion
and in general the help on this. That probably sometimes gets lost in
mails sometimes. :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 18:19         ` Christian Brauner
@ 2019-03-26 18:47           ` Joel Fernandes
  0 siblings, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 18:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 07:19:30PM +0100, Christian Brauner wrote:
[snip]
> > > > I am actually supportive of Daniel's view that by combining too many
> > > > arguments into a single syscall, becomes confusing and sometimes some
> > > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > > 
> > > There's a difference between an ioctl() and say seccomp() which this is
> > > close to:
> > > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > > The point is that the functionality is closely related not just randomly
> > > unrelated stuff. But as I said I'm more than willing to compromise.
> > 
> > Sounds great, yeah whatever makes sense.
> 
> In case I haven't said this enough: I really appreciate the discussion
> and in general the help on this. That probably sometimes gets lost in
> mails sometimes. :)

I appreciate you saying that and thanks for the work on this :)

 - Joel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 18:10       ` Joel Fernandes
  2019-03-26 18:19         ` Christian Brauner
@ 2019-03-26 19:19         ` Christian Brauner
  2019-03-26 19:51           ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 19:19 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 02:10:12PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > I quote Konstantins original patchset first that has already been acked and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall:
> > > > 
> > > > "Each process have different pids, one for each pid namespace it belongs.
> > > >  When interaction happens within single pid-ns translation isn't required.
> > > >  More complicated scenarios needs special handling.
> > > > 
> > > >  For example:
> > > >  - reading pid-files or logs written inside container with pid namespace
> > > >  - writing logs with internal pids outside container for pushing them into
> > > >  - attaching with ptrace to tasks from different pid namespace
> > > > 
> > > >  Generally speaking, any cross pid-ns API with pids needs translation.
> > > > 
> > > >  Currently there are several interfaces that could be used here:
> > > > 
> > > >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > > 
> > > >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > > >  In some cases pid translation could be easily done using this information.
> > > >  Backward translation requires scanning all tasks and becomes really
> > > >  complicated for deeper namespace nesting.
> > > > 
> > > >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > > >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > > >  into pid namespace, this expose process and could be insecure."
> > > > 
> > > > The original patchset allowed two distinct operations implicitly:
> > > > - discovering whether pid namespaces (pidns) have a parent-child
> > > >   relationship
> > > > - translating a pid from a source pidns into a target pidns
> > > > 
> > > > Both tasks are accomplished in the original patchset by passing a pid
> > > > along. If the pid argument is passed as 1 the relationship between two pid
> > > > namespaces can be discovered.
> > > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > > userspace if the task it is asked to perform is passed through a
> > > > command argument. Additionally, it allows us to remove an intrinsic race
> > > > caused by using the pid argument as a way to discover the relationship
> > > > between pid namespaces.
> > > > This patch introduces three commands:
> > > > 
> > > > /* PIDCMD_QUERY_PID */
> > > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > > Given a source pid namespace fd return the pid of the process in the target
> > > > namespace:
> > > 
> > > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > > IMO (see below).
> > 
> > Yes, doesn't matter to me too much what we call it.
> 
> Ok.
> 
> > > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > >   - retrieve callers pidns
> > > >   - return pid in callers pidns
> > > > 
> > > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > > >   - retrieve callers pidns
> > > >   - retrieve struct pid identifed by pid in callers pidns
> > > >   - retrieve pidns identified by target_fd
> > > >   - return pid in pidns identified by target_fd
> > > > 
> > > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> > > >   - retrieve callers pidns
> > > >   - return pid of init task of pidns identified by source_fd in callers pidns
> > > > 
> > > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > > >   - retrieve pidns identified by source_fd
> > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > >   - retrieve pidns identified by target_fd
> > > >   - check whether struct pid can be found in pidns identified by target_fd
> > > >   - return pid in pidns identified by target_fd
> > > > 
> > > > /* PIDCMD_QUERY_PIDNS */
> > > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > > namespaces.
> > > > In the original version of the pachset passing pid as 1 would allow to
> > > > deterimine the relationship between the pid namespaces. This is inherhently
> > > > racy. If pid 1 inside a pid namespace has died it would report false
> > > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > > died, it would report that the target pid namespace cannot be reached from
> > > > the source pid namespace because it couldn't find the pid inside of the
> > > > target pid namespace and thus falsely report to the user that the two pid
> > > > namespaces are not related. This problem is simple to avoid. In the new
> > > > version we simply walk the list of ancestors and check whether the
> > > > namespace are related to each other. By doing it this way we can reliably
> > > > report what the relationship between two pid namespace file descriptors
> > > > looks like.
> > > > 
> > > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > > >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > > 
> > > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > > >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > > 
> > > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > > >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > > 
> > > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > > >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > > 
> > > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> > 
> > Same.
> 
> Ok, glad we agree on it.
> 
> > > 
> > > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > > now you're calling QUERY to mean compare something. I suggest to be explicit
> > > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > > 
> > > Arguably, 2 syscalls for this is cleaner:
> > > pid_compare_namespaces(ns_fd1, ns_fd2);
> > > pid_translate(pid, ns_fd1, nds_fd2);
> > 
> > I don't think we want to send out pid_compare_namespaces() as a separate
> > syscall. If that's the consensus I'd rather just drop this functionality
> > completely.
> 
> I mean, if pid-namespace fd comparison functionality is not needed in
> userspace, why add it all. I suggest to drop it if not needed and can be
> considered for addition later when needed.
> 
> Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.
> 
> > > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > > improve the functionality expressed implicitly in translate_pid() before.
> > > > 
> > > > /* PIDCMD_GET_PIDFD */
> > > 
> > > And this can be a third syscall:
> > > pidfd_translate(pid, ns_fd1, ns_fd2).
> > 
> > Sigh, yes. But I honestly want other people other than us to come out
> > and say "yes, please send a PR to Linus with three separate syscalls for
> > very closely related functionality". There's a difference between "this
> > is how we would ideally like to do it" and "this is what is common
> > practice and will likely get accepted". But I'm really not opposed to it
> > per se.
> 
> Ok.
> 
> > > I am actually supportive of Daniel's view that by combining too many
> > > arguments into a single syscall, becomes confusing and sometimes some
> > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > 
> > There's a difference between an ioctl() and say seccomp() which this is
> > close to:
> > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > The point is that the functionality is closely related not just randomly
> > unrelated stuff. But as I said I'm more than willing to compromise.
> 
> Sounds great, yeah whatever makes sense.

One option is to do the following which removes the cmd argument all
together in favor of a flag GET_PIDFD:

+SYSCALL_DEFINE4(pidfd_translate, pid_t, pid, int, source, int, target,
+               unsigned int, flags)
+{
+       struct pid_namespace *source_ns, *target_ns;
+       struct pid *struct_pid;
+       pid_t result;
+
+       if (flags & ~GET_PIDFD)
+               return -EINVAL;
+
+       source_ns = get_pid_ns_by_fd(source);
+       if (IS_ERR(source_ns))
+               return PTR_ERR(source_ns);
+
+       target_ns = get_pid_ns_by_fd(target);
+       if (IS_ERR(target_ns)) {
+               put_pid_ns(source_ns);
+               return PTR_ERR(target_ns);
+       }
+
+       rcu_read_lock();
+       struct_pid = get_pid(find_pid_ns(pid, source_ns));
+       rcu_read_unlock();
+
+       if (struct_pid)
+               result = pid_nr_ns(struct_pid, target_ns);
+       else
+               result = -ESRCH;
+
+       if ((flags & GET_PIDFD) && (result > 0))
+               result = pidfd_create_fd(struct_pid, O_CLOEXEC);
+
+       if (!result)
+               result = -ENOENT;
+
+       put_pid(struct_pid);
+       put_pid_ns(target_ns);
+       put_pid_ns(source_ns);
+
+       return result;
+}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 19:19         ` Christian Brauner
@ 2019-03-26 19:51           ` Joel Fernandes
  2019-03-26 19:57             ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2019-03-26 19:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 08:19:43PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 02:10:12PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > > > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > > > 
> > > > > "Each process have different pids, one for each pid namespace it belongs.
> > > > >  When interaction happens within single pid-ns translation isn't required.
> > > > >  More complicated scenarios needs special handling.
> > > > > 
> > > > >  For example:
> > > > >  - reading pid-files or logs written inside container with pid namespace
> > > > >  - writing logs with internal pids outside container for pushing them into
> > > > >  - attaching with ptrace to tasks from different pid namespace
> > > > > 
> > > > >  Generally speaking, any cross pid-ns API with pids needs translation.
> > > > > 
> > > > >  Currently there are several interfaces that could be used here:
> > > > > 
> > > > >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > > > 
> > > > >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > > > >  In some cases pid translation could be easily done using this information.
> > > > >  Backward translation requires scanning all tasks and becomes really
> > > > >  complicated for deeper namespace nesting.
> > > > > 
> > > > >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > > > >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > > > >  into pid namespace, this expose process and could be insecure."
> > > > > 
> > > > > The original patchset allowed two distinct operations implicitly:
> > > > > - discovering whether pid namespaces (pidns) have a parent-child
> > > > >   relationship
> > > > > - translating a pid from a source pidns into a target pidns
> > > > > 
> > > > > Both tasks are accomplished in the original patchset by passing a pid
> > > > > along. If the pid argument is passed as 1 the relationship between two pid
> > > > > namespaces can be discovered.
> > > > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > > > userspace if the task it is asked to perform is passed through a
> > > > > command argument. Additionally, it allows us to remove an intrinsic race
> > > > > caused by using the pid argument as a way to discover the relationship
> > > > > between pid namespaces.
> > > > > This patch introduces three commands:
> > > > > 
> > > > > /* PIDCMD_QUERY_PID */
> > > > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > > > Given a source pid namespace fd return the pid of the process in the target
> > > > > namespace:
> > > > 
> > > > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > > > IMO (see below).
> > > 
> > > Yes, doesn't matter to me too much what we call it.
> > 
> > Ok.
> > 
> > > > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > > > >   - retrieve pidns identified by source_fd
> > > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > >   - retrieve callers pidns
> > > > >   - return pid in callers pidns
> > > > > 
> > > > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > > > >   - retrieve callers pidns
> > > > >   - retrieve struct pid identifed by pid in callers pidns
> > > > >   - retrieve pidns identified by target_fd
> > > > >   - return pid in pidns identified by target_fd
> > > > > 
> > > > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > > > >   - retrieve pidns identified by source_fd
> > > > >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> > > > >   - retrieve callers pidns
> > > > >   - return pid of init task of pidns identified by source_fd in callers pidns
> > > > > 
> > > > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > > > >   - retrieve pidns identified by source_fd
> > > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > >   - retrieve pidns identified by target_fd
> > > > >   - check whether struct pid can be found in pidns identified by target_fd
> > > > >   - return pid in pidns identified by target_fd
> > > > > 
> > > > > /* PIDCMD_QUERY_PIDNS */
> > > > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > > > namespaces.
> > > > > In the original version of the pachset passing pid as 1 would allow to
> > > > > deterimine the relationship between the pid namespaces. This is inherhently
> > > > > racy. If pid 1 inside a pid namespace has died it would report false
> > > > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > > > died, it would report that the target pid namespace cannot be reached from
> > > > > the source pid namespace because it couldn't find the pid inside of the
> > > > > target pid namespace and thus falsely report to the user that the two pid
> > > > > namespaces are not related. This problem is simple to avoid. In the new
> > > > > version we simply walk the list of ancestors and check whether the
> > > > > namespace are related to each other. By doing it this way we can reliably
> > > > > report what the relationship between two pid namespace file descriptors
> > > > > looks like.
> > > > > 
> > > > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > > > >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > > > 
> > > > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > > > >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > > > 
> > > > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > > > >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > > > 
> > > > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > > > >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > > > 
> > > > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> > > 
> > > Same.
> > 
> > Ok, glad we agree on it.
> > 
> > > > 
> > > > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > > > now you're calling QUERY to mean compare something. I suggest to be explicit
> > > > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > > > 
> > > > Arguably, 2 syscalls for this is cleaner:
> > > > pid_compare_namespaces(ns_fd1, ns_fd2);
> > > > pid_translate(pid, ns_fd1, nds_fd2);
> > > 
> > > I don't think we want to send out pid_compare_namespaces() as a separate
> > > syscall. If that's the consensus I'd rather just drop this functionality
> > > completely.
> > 
> > I mean, if pid-namespace fd comparison functionality is not needed in
> > userspace, why add it all. I suggest to drop it if not needed and can be
> > considered for addition later when needed.
> > 
> > Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.
> > 
> > > > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > > > improve the functionality expressed implicitly in translate_pid() before.
> > > > > 
> > > > > /* PIDCMD_GET_PIDFD */
> > > > 
> > > > And this can be a third syscall:
> > > > pidfd_translate(pid, ns_fd1, ns_fd2).
> > > 
> > > Sigh, yes. But I honestly want other people other than us to come out
> > > and say "yes, please send a PR to Linus with three separate syscalls for
> > > very closely related functionality". There's a difference between "this
> > > is how we would ideally like to do it" and "this is what is common
> > > practice and will likely get accepted". But I'm really not opposed to it
> > > per se.
> > 
> > Ok.
> > 
> > > > I am actually supportive of Daniel's view that by combining too many
> > > > arguments into a single syscall, becomes confusing and sometimes some
> > > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > > 
> > > There's a difference between an ioctl() and say seccomp() which this is
> > > close to:
> > > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > > The point is that the functionality is closely related not just randomly
> > > unrelated stuff. But as I said I'm more than willing to compromise.
> > 
> > Sounds great, yeah whatever makes sense.
> 
> One option is to do the following which removes the cmd argument all
> together in favor of a flag GET_PIDFD:
> 
> +SYSCALL_DEFINE4(pidfd_translate, pid_t, pid, int, source, int, target,
> +               unsigned int, flags)
> +{

Yes that's also fine. But would it not be better to just add the flags as an
extra parameter to translate_pid syscall, telling it what to translate to
(pidfd or pid)?

thanks,

 - Joel



> +       struct pid_namespace *source_ns, *target_ns;
> +       struct pid *struct_pid;
> +       pid_t result;
> +
> +       if (flags & ~GET_PIDFD)
> +               return -EINVAL;
> +
> +       source_ns = get_pid_ns_by_fd(source);
> +       if (IS_ERR(source_ns))
> +               return PTR_ERR(source_ns);
> +
> +       target_ns = get_pid_ns_by_fd(target);
> +       if (IS_ERR(target_ns)) {
> +               put_pid_ns(source_ns);
> +               return PTR_ERR(target_ns);
> +       }
> +
> +       rcu_read_lock();
> +       struct_pid = get_pid(find_pid_ns(pid, source_ns));
> +       rcu_read_unlock();
> +
> +       if (struct_pid)
> +               result = pid_nr_ns(struct_pid, target_ns);
> +       else
> +               result = -ESRCH;
> +
> +       if ((flags & GET_PIDFD) && (result > 0))
> +               result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> +
> +       if (!result)
> +               result = -ENOENT;
> +
> +       put_pid(struct_pid);
> +       put_pid_ns(target_ns);
> +       put_pid_ns(source_ns);
> +
> +       return result;
> +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/4] pid: add pidctl()
  2019-03-26 19:51           ` Joel Fernandes
@ 2019-03-26 19:57             ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2019-03-26 19:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: jannh, khlebnikov, luto, dhowells, serge, ebiederm, linux-api,
	linux-kernel, arnd, keescook, adobriyan, tglx, mtk.manpages,
	bl0pbl33p, ldv, akpm, oleg, nagarathnam.muthusamy, cyphar, viro,
	dancol

On Tue, Mar 26, 2019 at 03:51:51PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 08:19:43PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 02:10:12PM -0400, Joel Fernandes wrote:
> > > On Tue, Mar 26, 2019 at 06:22:33PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> > > > > On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > > > > > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been acked and
> > > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > > syscall:
> > > > > > 
> > > > > > "Each process have different pids, one for each pid namespace it belongs.
> > > > > >  When interaction happens within single pid-ns translation isn't required.
> > > > > >  More complicated scenarios needs special handling.
> > > > > > 
> > > > > >  For example:
> > > > > >  - reading pid-files or logs written inside container with pid namespace
> > > > > >  - writing logs with internal pids outside container for pushing them into
> > > > > >  - attaching with ptrace to tasks from different pid namespace
> > > > > > 
> > > > > >  Generally speaking, any cross pid-ns API with pids needs translation.
> > > > > > 
> > > > > >  Currently there are several interfaces that could be used here:
> > > > > > 
> > > > > >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > > > > > 
> > > > > >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> > > > > >  In some cases pid translation could be easily done using this information.
> > > > > >  Backward translation requires scanning all tasks and becomes really
> > > > > >  complicated for deeper namespace nesting.
> > > > > > 
> > > > > >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> > > > > >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> > > > > >  into pid namespace, this expose process and could be insecure."
> > > > > > 
> > > > > > The original patchset allowed two distinct operations implicitly:
> > > > > > - discovering whether pid namespaces (pidns) have a parent-child
> > > > > >   relationship
> > > > > > - translating a pid from a source pidns into a target pidns
> > > > > > 
> > > > > > Both tasks are accomplished in the original patchset by passing a pid
> > > > > > along. If the pid argument is passed as 1 the relationship between two pid
> > > > > > namespaces can be discovered.
> > > > > > The syscall will gain a lot clearer syntax and will be easier to use for
> > > > > > userspace if the task it is asked to perform is passed through a
> > > > > > command argument. Additionally, it allows us to remove an intrinsic race
> > > > > > caused by using the pid argument as a way to discover the relationship
> > > > > > between pid namespaces.
> > > > > > This patch introduces three commands:
> > > > > > 
> > > > > > /* PIDCMD_QUERY_PID */
> > > > > > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > > > > > Given a source pid namespace fd return the pid of the process in the target
> > > > > > namespace:
> > > > > 
> > > > > Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> > > > > IMO (see below).
> > > > 
> > > > Yes, doesn't matter to me too much what we call it.
> > > 
> > > Ok.
> > > 
> > > > > > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> > > > > >   - retrieve pidns identified by source_fd
> > > > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > > >   - retrieve callers pidns
> > > > > >   - return pid in callers pidns
> > > > > > 
> > > > > > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> > > > > >   - retrieve callers pidns
> > > > > >   - retrieve struct pid identifed by pid in callers pidns
> > > > > >   - retrieve pidns identified by target_fd
> > > > > >   - return pid in pidns identified by target_fd
> > > > > > 
> > > > > > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> > > > > >   - retrieve pidns identified by source_fd
> > > > > >   - retrieve struct pid identifed by init task in pidns identified by source_fd
> > > > > >   - retrieve callers pidns
> > > > > >   - return pid of init task of pidns identified by source_fd in callers pidns
> > > > > > 
> > > > > > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> > > > > >   - retrieve pidns identified by source_fd
> > > > > >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> > > > > >   - retrieve pidns identified by target_fd
> > > > > >   - check whether struct pid can be found in pidns identified by target_fd
> > > > > >   - return pid in pidns identified by target_fd
> > > > > > 
> > > > > > /* PIDCMD_QUERY_PIDNS */
> > > > > > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > > > > > namespaces.
> > > > > > In the original version of the pachset passing pid as 1 would allow to
> > > > > > deterimine the relationship between the pid namespaces. This is inherhently
> > > > > > racy. If pid 1 inside a pid namespace has died it would report false
> > > > > > negatives. For example, if pid 1 inside of the target pid namespace already
> > > > > > died, it would report that the target pid namespace cannot be reached from
> > > > > > the source pid namespace because it couldn't find the pid inside of the
> > > > > > target pid namespace and thus falsely report to the user that the two pid
> > > > > > namespaces are not related. This problem is simple to avoid. In the new
> > > > > > version we simply walk the list of ancestors and check whether the
> > > > > > namespace are related to each other. By doing it this way we can reliably
> > > > > > report what the relationship between two pid namespace file descriptors
> > > > > > looks like.
> > > > > > 
> > > > > > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> > > > > >    - pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > > > > > 
> > > > > > 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
> > > > > >    - pidns_of(ns_fd1) == pidns_of(ns_fd2)
> > > > > > 
> > > > > > 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
> > > > > >    - pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> > > > > > 
> > > > > > 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
> > > > > >    - pidns_of(ns_fd2) is ancestor of pidns_of(ns_fd1)
> > > > > 
> > > > > Why not call this PIDCMD_COMPARE_PIDNS, since a comparison is what you're doing.
> > > > 
> > > > Same.
> > > 
> > > Ok, glad we agree on it.
> > > 
> > > > > 
> > > > > Again QUERY is ambigious here. Above you called QUERY to translate something,
> > > > > now you're calling QUERY to mean compare something. I suggest to be explicit
> > > > > about the operation PIDCMD_<OPERATION>_<OPERAND-TYPE>.
> > > > > 
> > > > > Arguably, 2 syscalls for this is cleaner:
> > > > > pid_compare_namespaces(ns_fd1, ns_fd2);
> > > > > pid_translate(pid, ns_fd1, nds_fd2);
> > > > 
> > > > I don't think we want to send out pid_compare_namespaces() as a separate
> > > > syscall. If that's the consensus I'd rather just drop this functionality
> > > > completely.
> > > 
> > > I mean, if pid-namespace fd comparison functionality is not needed in
> > > userspace, why add it all. I suggest to drop it if not needed and can be
> > > considered for addition later when needed.
> > > 
> > > Then you're just left with GET_PID and TRANSLATE_PID which we know we do need.
> > > 
> > > > > > These two commands - PIDCMD_QUERY_PID and PIDCMD_QUERY_PIDNS - cover and
> > > > > > improve the functionality expressed implicitly in translate_pid() before.
> > > > > > 
> > > > > > /* PIDCMD_GET_PIDFD */
> > > > > 
> > > > > And this can be a third syscall:
> > > > > pidfd_translate(pid, ns_fd1, ns_fd2).
> > > > 
> > > > Sigh, yes. But I honestly want other people other than us to come out
> > > > and say "yes, please send a PR to Linus with three separate syscalls for
> > > > very closely related functionality". There's a difference between "this
> > > > is how we would ideally like to do it" and "this is what is common
> > > > practice and will likely get accepted". But I'm really not opposed to it
> > > > per se.
> > > 
> > > Ok.
> > > 
> > > > > I am actually supportive of Daniel's view that by combining too many
> > > > > arguments into a single syscall, becomes confusing and sometimes some
> > > > > arguments have to be forced to 0 in the single shoe-horned syscall. Like you
> > > > 
> > > > There's a difference between an ioctl() and say seccomp() which this is
> > > > close to:
> > > > int seccomp(unsigned int operation, unsigned int flags, void *args);
> > > > The point is that the functionality is closely related not just randomly
> > > > unrelated stuff. But as I said I'm more than willing to compromise.
> > > 
> > > Sounds great, yeah whatever makes sense.
> > 
> > One option is to do the following which removes the cmd argument all
> > together in favor of a flag GET_PIDFD:
> > 
> > +SYSCALL_DEFINE4(pidfd_translate, pid_t, pid, int, source, int, target,
> > +               unsigned int, flags)
> > +{
> 
> Yes that's also fine. But would it not be better to just add the flags as an

Also fine. It's just a sketch. I'm working on another suggestion that
might be better.

> extra parameter to translate_pid syscall, telling it what to translate to
> (pidfd or pid)?
> 
> thanks,
> 
>  - Joel
> 
> 
> 
> > +       struct pid_namespace *source_ns, *target_ns;
> > +       struct pid *struct_pid;
> > +       pid_t result;
> > +
> > +       if (flags & ~GET_PIDFD)
> > +               return -EINVAL;
> > +
> > +       source_ns = get_pid_ns_by_fd(source);
> > +       if (IS_ERR(source_ns))
> > +               return PTR_ERR(source_ns);
> > +
> > +       target_ns = get_pid_ns_by_fd(target);
> > +       if (IS_ERR(target_ns)) {
> > +               put_pid_ns(source_ns);
> > +               return PTR_ERR(target_ns);
> > +       }
> > +
> > +       rcu_read_lock();
> > +       struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > +       rcu_read_unlock();
> > +
> > +       if (struct_pid)
> > +               result = pid_nr_ns(struct_pid, target_ns);
> > +       else
> > +               result = -ESRCH;
> > +
> > +       if ((flags & GET_PIDFD) && (result > 0))
> > +               result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> > +
> > +       if (!result)
> > +               result = -ENOENT;
> > +
> > +       put_pid(struct_pid);
> > +       put_pid_ns(target_ns);
> > +       put_pid_ns(source_ns);
> > +
> > +       return result;
> > +}

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-03-26 19:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-26 15:55 [PATCH v1 0/4] pid: add pidctl() Christian Brauner
2019-03-26 15:55 ` [PATCH v1 1/4] Make anon_inodes unconditional Christian Brauner
2019-03-26 15:55 ` [PATCH v1 2/4] pid: add pidctl() Christian Brauner
2019-03-26 16:17   ` Daniel Colascione
2019-03-26 16:23     ` Christian Brauner
2019-03-26 16:31       ` Christian Brauner
2019-03-26 16:34         ` Christian Brauner
2019-03-26 16:38           ` Daniel Colascione
2019-03-26 16:43             ` Christian Brauner
2019-03-26 16:50               ` Daniel Colascione
2019-03-26 17:05                 ` Christian Brauner
2019-03-26 16:42           ` Andy Lutomirski
2019-03-26 16:46             ` Christian Brauner
2019-03-26 17:00               ` Daniel Colascione
2019-03-26 16:33       ` Daniel Colascione
2019-03-26 17:06   ` Joel Fernandes
2019-03-26 17:08     ` Christian Brauner
2019-03-26 17:15       ` Joel Fernandes
2019-03-26 17:17         ` Christian Brauner
2019-03-26 17:18           ` Joel Fernandes
2019-03-26 17:22     ` Christian Brauner
2019-03-26 18:10       ` Joel Fernandes
2019-03-26 18:19         ` Christian Brauner
2019-03-26 18:47           ` Joel Fernandes
2019-03-26 19:19         ` Christian Brauner
2019-03-26 19:51           ` Joel Fernandes
2019-03-26 19:57             ` Christian Brauner
2019-03-26 15:55 ` [PATCH v1 3/4] signal: support pidctl() with pidfd_send_signal() Christian Brauner
2019-03-26 15:55 ` [PATCH v1 4/4] tests: add pidctl() tests Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).