linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
@ 2020-07-15 14:49 Adrian Reber
  2020-07-15 14:49 ` [PATCH v5 1/6] " Adrian Reber
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This is v5 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
changes to v4 are:

 * split into more patches to have the introduction of
   CAP_CHECKPOINT_RESTORE and the actual usage in different
   patches
 * reduce the /proc/self/exe patch to only be about
   CAP_CHECKPOINT_RESTORE

Adrian Reber (5):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  pid: use checkpoint_restore_ns_capable() for set_tid
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow checkpoint/restore capable processes to change exe link

 fs/proc/base.c                                |   8 +-
 include/linux/capability.h                    |   6 +
 include/uapi/linux/capability.h               |   9 +-
 kernel/pid.c                                  |   2 +-
 kernel/pid_namespace.c                        |   2 +-
 kernel/sys.c                                  |  12 +-
 security/selinux/include/classmap.h           |   5 +-
 tools/testing/selftests/clone3/Makefile       |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
 9 files changed, 236 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 1/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 15:06   ` Christian Brauner
  2020-07-15 14:49 ` [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid Adrian Reber
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
 include/linux/capability.h          | 6 ++++++
 include/uapi/linux/capability.h     | 9 ++++++++-
 security/selinux/include/classmap.h | 5 +++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+	return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+		ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF			39
 
-#define CAP_LAST_CAP         CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE	40
+
+#define CAP_LAST_CAP         CAP_CHECKPOINT_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index e54d62d529f1..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+		"checkpoint_restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
  2020-07-15 14:49 ` [PATCH v5 1/6] " Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 15:08   ` Christian Brauner
  2020-07-15 14:49 ` [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid Adrian Reber
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
 kernel/pid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..a9cbab0194d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			if (tid != 1 && !tmp->child_reaper)
 				goto out_free;
 			retval = -EPERM;
-			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+			if (!checkpoint_restore_ns_capable(tmp->user_ns))
 				goto out_free;
 			set_tid_size--;
 		}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
  2020-07-15 14:49 ` [PATCH v5 1/6] " Adrian Reber
  2020-07-15 14:49 ` [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 15:08   ` Christian Brauner
  2020-07-15 14:49 ` [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Adrian Reber
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
 kernel/pid_namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	struct ctl_table tmp = *table;
 	int ret, next;
 
-	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
 		return -EPERM;
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
                   ` (2 preceding siblings ...)
  2020-07-15 14:49 ` [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 21:17   ` Cyrill Gorcunov
  2020-07-16  8:51   ` Christian Brauner
  2020-07-15 14:49 ` [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link Adrian Reber
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
 fs/proc/base.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65893686d1f1..cada783f229e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2194,16 +2194,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
 			struct inode *inode,
 		        struct delayed_call *done)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
 		return ERR_PTR(-EPERM);
 
 	return proc_pid_get_link(dentry, inode, done);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
                   ` (3 preceding siblings ...)
  2020-07-15 14:49 ` [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 15:20   ` Christian Brauner
  2020-07-15 14:49 ` [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
  2020-07-18  3:24 ` [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Serge E. Hallyn
  6 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>

Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.

This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
---
 kernel/sys.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..dd59b9142b1d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,12 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 
 	if (prctl_map.exe_fd != (u32)-1) {
 		/*
-		 * Make sure the caller has the rights to
-		 * change /proc/pid/exe link: only local sys admin should
-		 * be allowed to.
+		 * Check if the current user is checkpoint/restore capable.
+		 * At the time of this writing, it checks for CAP_SYS_ADMIN
+		 * or CAP_CHECKPOINT_RESTORE.
+		 * Note that a user with access to ptrace can masquerade an
+		 * arbitrary program as any executable, even setuid ones.
 		 */
-		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-			return -EINVAL;
+		if (!checkpoint_restore_ns_capable(current_user_ns()))
+			return -EPERM;
 
 		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
 		if (error)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
                   ` (4 preceding siblings ...)
  2020-07-15 14:49 ` [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link Adrian Reber
@ 2020-07-15 14:49 ` Adrian Reber
  2020-07-15 15:24   ` Christian Brauner
  2020-07-18  3:24 ` [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Serge E. Hallyn
  6 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber <areber@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
---
 tools/testing/selftests/clone3/Makefile       |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+	clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index 000000000000..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+	fflush(stdout);
+	fflush(stderr);
+	_exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+	int status;
+	pid_t pid = -1;
+
+	struct clone_args args = {
+		.exit_signal = SIGCHLD,
+		.set_tid = ptr_to_u64(set_tid),
+		.set_tid_size = set_tid_size,
+	};
+
+	pid = sys_clone3(&args, sizeof(struct clone_args));
+	if (pid < 0) {
+		ksft_print_msg("%s - Failed to create new process\n",
+			       strerror(errno));
+		return -errno;
+	}
+
+	if (pid == 0) {
+		int ret;
+		char tmp = 0;
+
+		ksft_print_msg
+		    ("I am the child, my PID is %d (expected %d)\n",
+		     getpid(), set_tid[0]);
+
+		if (set_tid[0] != getpid())
+			child_exit(EXIT_FAILURE);
+		child_exit(EXIT_SUCCESS);
+	}
+
+	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+		       getpid(), pid);
+
+	if (waitpid(pid, &status, 0) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		return -errno;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+			       size_t set_tid_size, int expected)
+{
+	int ret;
+
+	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+		       getpid(), set_tid[0]);
+	ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+	ksft_print_msg
+	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+	     getpid(), set_tid[0], ret, expected);
+	if (ret != expected) {
+		ksft_test_result_fail
+		    ("[%d] Result (%d) is different than expected (%d)\n",
+		     getpid(), ret, expected);
+		return -1;
+	}
+	ksft_test_result_pass
+	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+	     expected);
+
+	return 0;
+}
+
+struct libcap {
+	struct __user_cap_header_struct hdr;
+	struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+	struct libcap *cap;
+	int ret = -1;
+	cap_t caps;
+
+	caps = cap_get_proc();
+	if (!caps) {
+		perror("cap_get_proc");
+		return -1;
+	}
+
+	/* Drop all capabilities */
+	if (cap_clear(caps)) {
+		perror("cap_clear");
+		goto out;
+	}
+
+	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+	cap = (struct libcap *) caps;
+
+	/* 40 -> CAP_CHECKPOINT_RESTORE */
+	cap->data[1].effective |= 1 << (40 - 32);
+	cap->data[1].permitted |= 1 << (40 - 32);
+
+	if (cap_set_proc(caps)) {
+		perror("cap_set_proc");
+		goto out;
+	}
+	ret = 0;
+out:
+	if (cap_free(caps))
+		perror("cap_free");
+	return ret;
+}
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int status;
+	int ret = 0;
+	pid_t set_tid[1];
+	uid_t uid = getuid();
+
+	ksft_print_header();
+	test_clone3_supported();
+	ksft_set_plan(2);
+
+	if (uid != 0) {
+		ksft_cnt.ksft_xskip = ksft_plan;
+		ksft_print_msg("Skipping all tests as non-root\n");
+		return ksft_exit_pass();
+	}
+
+	memset(&set_tid, 0, sizeof(set_tid));
+
+	/* Find the current active PID */
+	pid = fork();
+	if (pid == 0) {
+		ksft_print_msg("Child has PID %d\n", getpid());
+		child_exit(EXIT_SUCCESS);
+	}
+	if (waitpid(pid, &status, 0) < 0)
+		ksft_exit_fail_msg("Waiting for child %d failed", pid);
+
+	/* After the child has finished, its PID should be free. */
+	set_tid[0] = pid;
+
+	if (set_capability())
+		ksft_test_result_fail
+		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
+	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
+	/* This would fail without CAP_CHECKPOINT_RESTORE */
+	setgid(1000);
+	setuid(1000);
+	set_tid[0] = pid;
+	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
+	if (set_capability())
+		ksft_test_result_fail
+		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
+	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
+	ret |= test_clone3_set_tid(set_tid, 1, 0);
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 1/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 ` [PATCH v5 1/6] " Adrian Reber
@ 2020-07-15 15:06   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-15 15:06 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:49PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> 
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
>   manager distributing jobs where users are always running as non-root.
>   There is a desire to provide a way to checkpoint and restore long running
>   jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>   applications are not meant to be running with CAP_SYS_ADMIN.
> 
> We have seen the following workarounds:
> * Use a setuid wrapper around CRIU:
>   See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> * Use a setuid helper that writes to ns_last_pid.
>   Unfortunately, this helper delegation technique is impossible to use with
>   clone3, and is thus prone to races.
>   See https://github.com/twosigma/set_ns_last_pid
> * Cycle through PIDs with fork() until the desired PID is reached:
>   This has been demonstrated to work with cycling rates of 100,000 PIDs/s
>   See https://github.com/twosigma/set_ns_last_pid
> * Patch out the CAP_SYS_ADMIN check from the kernel
> * Run the desired application in a new user and PID namespace to provide
>   a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
>   typical container environments (e.g., Kubernetes) as /proc is
>   typically protected with read-only layers (e.g., /proc/sys) for hardening
>   purposes. Read-only layers prevent additional /proc mounts (due to proc's
>   SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
>   certain applications need access to /proc matching their PID namespace.
> 
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
>   for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
>   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
>   files that are unreachable via the file system such as deleted files, or memfd
>   files.
> 
> See corresponding selftest for an example with clone3().
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---

Thanks!
This looks good now.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

>  include/linux/capability.h          | 6 ++++++
>  include/uapi/linux/capability.h     | 9 ++++++++-
>  security/selinux/include/classmap.h | 5 +++--
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b4345b38a6be..1e7fe311cabe 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
>  	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
>  }
>  
> +static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
> +{
> +	return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
> +		ns_capable(ns, CAP_SYS_ADMIN);
> +}
> +
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 48ff0757ae5e..395dd0df8d08 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
>   */
>  #define CAP_BPF			39
>  
> -#define CAP_LAST_CAP         CAP_BPF
> +
> +/* Allow checkpoint/restore related operations */
> +/* Allow PID selection during clone3() */
> +/* Allow writing to ns_last_pid */
> +
> +#define CAP_CHECKPOINT_RESTORE	40
> +
> +#define CAP_LAST_CAP         CAP_CHECKPOINT_RESTORE
>  
>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>  
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index e54d62d529f1..ba2e01a6955c 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -27,9 +27,10 @@
>  	    "audit_control", "setfcap"
>  
>  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> -		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> +		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> +		"checkpoint_restore"
>  
> -#if CAP_LAST_CAP > CAP_BPF
> +#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
>  #error New capability defined, please update COMMON_CAP2_PERMS.
>  #endif
>  
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid
  2020-07-15 14:49 ` [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid Adrian Reber
@ 2020-07-15 15:08   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-15 15:08 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:50PM +0200, Adrian Reber wrote:
> Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
> using clone3() with set_tid set.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---

Looks good!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

>  kernel/pid.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index de9d29c41d77..a9cbab0194d9 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			if (tid != 1 && !tmp->child_reaper)
>  				goto out_free;
>  			retval = -EPERM;
> -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> +			if (!checkpoint_restore_ns_capable(tmp->user_ns))
>  				goto out_free;
>  			set_tid_size--;
>  		}
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  2020-07-15 14:49 ` [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid Adrian Reber
@ 2020-07-15 15:08   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-15 15:08 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:51PM +0200, Adrian Reber wrote:
> Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
> writing to ns_last_pid.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
  2020-07-15 14:49 ` [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link Adrian Reber
@ 2020-07-15 15:20   ` Christian Brauner
  2020-07-15 15:49     ` Nicolas Viennot
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2020-07-15 15:20 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:53PM +0200, Adrian Reber wrote:
> From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> 
> Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.
> 
> This commit also changes the permission error code from -EINVAL to
> -EPERM for consistency with the rest of the prctl() syscall when
> checking capabilities.

I agree that EINVAL seems weird here but this is a potentially user
visible change. Might be nice to have the EINVAL->EPERM change be an
additional patch on top after this one so we can revert it in case it
breaks someone (unlikely though). I can split this out myself though so
no need to resend for that alone.

What I would also prefer is to have some history in the commit message
tbh. The reason is that when we started discussing that specific change
I had to hunt down the history of changing /proc/self/exe and had to
dig up and read through ancient threads on lore to come up with the
explanation why this is placed under a capability. The commit message
should then also mention that there are other ways to change the
/proc/self/exe link that don't require capabilities and that
/proc/self/exe itself is not something userspace should rely on for
security. Mainly so that in a few months/years we can read through that
commit message and go "Weird, but ok.". :)

But maybe I can just rewrite this myself so you don't have to go through
the trouble. This is really not pedantry it's just that it's a lot of
work digging up the reasons for a piece of code existing when it's
really not obvious. :)

Christian

> 
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  kernel/sys.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..dd59b9142b1d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2007,12 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>  
>  	if (prctl_map.exe_fd != (u32)-1) {
>  		/*
> -		 * Make sure the caller has the rights to
> -		 * change /proc/pid/exe link: only local sys admin should
> -		 * be allowed to.
> +		 * Check if the current user is checkpoint/restore capable.
> +		 * At the time of this writing, it checks for CAP_SYS_ADMIN
> +		 * or CAP_CHECKPOINT_RESTORE.
> +		 * Note that a user with access to ptrace can masquerade an
> +		 * arbitrary program as any executable, even setuid ones.
>  		 */
> -		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> -			return -EINVAL;
> +		if (!checkpoint_restore_ns_capable(current_user_ns()))
> +			return -EPERM;
>  
>  		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
>  		if (error)
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-15 14:49 ` [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
@ 2020-07-15 15:24   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-15 15:24 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:54PM +0200, Adrian Reber wrote:
> This adds a test that changes its UID, uses capabilities to
> get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> create a process with a given PID as non-root.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> ---

You need to add the new clone3_cap_checkpoint_restore binary to
.gitignore too. :)

And one more annoying request: can you port the selftests to use the
kselftest_harness.h infrastructure, please? It is way nicer to use, has
a more uniform feel, and generates better output. You can look at:

tools/testing/selftests/pidfd/pidfd_setns_test.c
tools/testing/selftests/seccomp/seccomp_bpf.c

and others for examples on how to use it.

Thanks!
Christian

>  tools/testing/selftests/clone3/Makefile       |   4 +-
>  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
>  2 files changed, 206 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> 
> diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> index cf976c732906..ef7564cb7abe 100644
> --- a/tools/testing/selftests/clone3/Makefile
> +++ b/tools/testing/selftests/clone3/Makefile
> @@ -1,6 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
>  CFLAGS += -g -I../../../../usr/include/
> +LDLIBS += -lcap
>  
> -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> +	clone3_cap_checkpoint_restore
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> new file mode 100644
> index 000000000000..2cc3d57b91f2
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> @@ -0,0 +1,203 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Based on Christian Brauner's clone3() example.
> + * These tests are assuming to be running in the host's
> + * PID namespace.
> + */
> +
> +/* capabilities related code based on selftests/bpf/test_verifier.c */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <sys/capability.h>
> +#include <sys/prctl.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/un.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sched.h>
> +
> +#include "../kselftest.h"
> +#include "clone3_selftests.h"
> +
> +#ifndef MAX_PID_NS_LEVEL
> +#define MAX_PID_NS_LEVEL 32
> +#endif
> +
> +static void child_exit(int ret)
> +{
> +	fflush(stdout);
> +	fflush(stderr);
> +	_exit(ret);
> +}
> +
> +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> +{
> +	int status;
> +	pid_t pid = -1;
> +
> +	struct clone_args args = {
> +		.exit_signal = SIGCHLD,
> +		.set_tid = ptr_to_u64(set_tid),
> +		.set_tid_size = set_tid_size,
> +	};
> +
> +	pid = sys_clone3(&args, sizeof(struct clone_args));
> +	if (pid < 0) {
> +		ksft_print_msg("%s - Failed to create new process\n",
> +			       strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (pid == 0) {
> +		int ret;
> +		char tmp = 0;
> +
> +		ksft_print_msg
> +		    ("I am the child, my PID is %d (expected %d)\n",
> +		     getpid(), set_tid[0]);
> +
> +		if (set_tid[0] != getpid())
> +			child_exit(EXIT_FAILURE);
> +		child_exit(EXIT_SUCCESS);
> +	}
> +
> +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> +		       getpid(), pid);
> +
> +	if (waitpid(pid, &status, 0) < 0) {
> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		return -1;
> +
> +	return WEXITSTATUS(status);
> +}
> +
> +static int test_clone3_set_tid(pid_t * set_tid,
> +			       size_t set_tid_size, int expected)
> +{
> +	int ret;
> +
> +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> +		       getpid(), set_tid[0]);
> +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> +
> +	ksft_print_msg
> +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> +	     getpid(), set_tid[0], ret, expected);
> +	if (ret != expected) {
> +		ksft_test_result_fail
> +		    ("[%d] Result (%d) is different than expected (%d)\n",
> +		     getpid(), ret, expected);
> +		return -1;
> +	}
> +	ksft_test_result_pass
> +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> +	     expected);
> +
> +	return 0;
> +}
> +
> +struct libcap {
> +	struct __user_cap_header_struct hdr;
> +	struct __user_cap_data_struct data[2];
> +};
> +
> +static int set_capability()
> +{
> +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> +	struct libcap *cap;
> +	int ret = -1;
> +	cap_t caps;
> +
> +	caps = cap_get_proc();
> +	if (!caps) {
> +		perror("cap_get_proc");
> +		return -1;
> +	}
> +
> +	/* Drop all capabilities */
> +	if (cap_clear(caps)) {
> +		perror("cap_clear");
> +		goto out;
> +	}
> +
> +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> +
> +	cap = (struct libcap *) caps;
> +
> +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> +	cap->data[1].effective |= 1 << (40 - 32);
> +	cap->data[1].permitted |= 1 << (40 - 32);
> +
> +	if (cap_set_proc(caps)) {
> +		perror("cap_set_proc");
> +		goto out;
> +	}
> +	ret = 0;
> +out:
> +	if (cap_free(caps))
> +		perror("cap_free");
> +	return ret;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	pid_t pid;
> +	int status;
> +	int ret = 0;
> +	pid_t set_tid[1];
> +	uid_t uid = getuid();
> +
> +	ksft_print_header();
> +	test_clone3_supported();
> +	ksft_set_plan(2);
> +
> +	if (uid != 0) {
> +		ksft_cnt.ksft_xskip = ksft_plan;
> +		ksft_print_msg("Skipping all tests as non-root\n");
> +		return ksft_exit_pass();
> +	}
> +
> +	memset(&set_tid, 0, sizeof(set_tid));
> +
> +	/* Find the current active PID */
> +	pid = fork();
> +	if (pid == 0) {
> +		ksft_print_msg("Child has PID %d\n", getpid());
> +		child_exit(EXIT_SUCCESS);
> +	}
> +	if (waitpid(pid, &status, 0) < 0)
> +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> +
> +	/* After the child has finished, its PID should be free. */
> +	set_tid[0] = pid;
> +
> +	if (set_capability())
> +		ksft_test_result_fail
> +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> +	setgid(1000);
> +	setuid(1000);
> +	set_tid[0] = pid;
> +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> +	if (set_capability())
> +		ksft_test_result_fail
> +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> +
> +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> +}
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
  2020-07-15 15:20   ` Christian Brauner
@ 2020-07-15 15:49     ` Nicolas Viennot
  0 siblings, 0 replies; 17+ messages in thread
From: Nicolas Viennot @ 2020-07-15 15:49 UTC (permalink / raw)
  To: Christian Brauner, Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Michał Cłapiński, Kamil Yurtsever,
	Dirk Petersen, Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

> On Wed, Jul 15, 2020 at 04:49:53PM +0200, Adrian Reber wrote:
> > From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > 
> > Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.
> > 
> > This commit also changes the permission error code from -EINVAL to 
> > -EPERM for consistency with the rest of the prctl() syscall when 
> > checking capabilities.
> I agree that EINVAL seems weird here but this is a potentially user visible change. Might be nice to have the EINVAL->EPERM change be an additional patch on top after this one so we can revert it in case it breaks someone (unlikely though). I can split this out myself though so no need to resend for that alone.
> What I would also prefer is to have some history in the commit message tbh. The reason is that when we started discussing that specific change I had to hunt down the history of changing /proc/self/exe and had to dig up and read through ancient threads on lore to come up with the explanation why this is placed under a capability. The commit message should then also mention that there are other ways to change the /proc/self/exe link that don't require capabilities and that /proc/self/exe itself is not something userspace should rely on for security. Mainly so that in a few months/years we can read through that commit message and go "Weird, but ok.". :)
> But maybe I can just rewrite this myself so you don't have to go through the trouble. This is really not pedantry it's just that it's a lot of work digging up the reasons for a piece of code existing when it's really not obvious. :)

Hello Christian,

I agree.
Thank you for suggesting doing the work, but you've done plenty already. So we'll come back to you with:
1) A separate commit for EINVAL->EPERM
2) A full history of discussions in the commit message related to /proc/self/exe capability check

Thanks,
Nico

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 ` [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Adrian Reber
@ 2020-07-15 21:17   ` Cyrill Gorcunov
  2020-07-16  8:51   ` Christian Brauner
  1 sibling, 0 replies; 17+ messages in thread
From: Cyrill Gorcunov @ 2020-07-15 21:17 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Serge Hallyn, Stephen Smalley, Sargun Dhillon,
	Arnd Bergmann, linux-security-module, linux-kernel, selinux,
	Eric Paris, Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:52PM +0200, Adrian Reber wrote:
> Opening files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
> checkpointing and restoring to recover files that are unreachable via
> the file system such as deleted files, or memfd files.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>

I still have a plan to make this code been usable without
capabilities requirements but due to lack of spare time
for deep investigation this won't happen anytime soon.
Thus the patch looks OK to me, fwiw

Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 ` [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Adrian Reber
  2020-07-15 21:17   ` Cyrill Gorcunov
@ 2020-07-16  8:51   ` Christian Brauner
  1 sibling, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-16  8:51 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:52PM +0200, Adrian Reber wrote:
> Opening files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
> checkpointing and restoring to recover files that are unreachable via
> the file system such as deleted files, or memfd files.
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---
>  fs/proc/base.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 65893686d1f1..cada783f229e 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2194,16 +2194,16 @@ struct map_files_info {
>  };
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>  			struct inode *inode,
>  		        struct delayed_call *done)
>  {
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))

So right now, when I'd do

git grep checkpoint_restore_ns_capable

I'd not hit that codepath which isn't great. So I'd suggest to use:

if (!checkpoint_restore_ns_capable(&init_user_ns))

at the end of the day, capable(<cap>) just calls
ns_capable(&init_user_ns, <cap>) anyway.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
                   ` (5 preceding siblings ...)
  2020-07-15 14:49 ` [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
@ 2020-07-18  3:24 ` Serge E. Hallyn
  2020-07-18 17:47   ` Christian Brauner
  6 siblings, 1 reply; 17+ messages in thread
From: Serge E. Hallyn @ 2020-07-18  3:24 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 15, 2020 at 04:49:48PM +0200, Adrian Reber wrote:
> This is v5 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> changes to v4 are:
> 
>  * split into more patches to have the introduction of
>    CAP_CHECKPOINT_RESTORE and the actual usage in different
>    patches
>  * reduce the /proc/self/exe patch to only be about
>    CAP_CHECKPOINT_RESTORE
> 
> Adrian Reber (5):
>   capabilities: Introduce CAP_CHECKPOINT_RESTORE
>   pid: use checkpoint_restore_ns_capable() for set_tid
>   pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
>   proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
>   selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> 
> Nicolas Viennot (1):
>   prctl: Allow checkpoint/restore capable processes to change exe link

(This is probably bad form, but)  All

Reviewed-by: Serge Hallyn <serge@hallyn.com>

Assuming you changes patches 4 and 6 per Christian's suggestions,
I'd like to re-review those then.

> 
>  fs/proc/base.c                                |   8 +-
>  include/linux/capability.h                    |   6 +
>  include/uapi/linux/capability.h               |   9 +-
>  kernel/pid.c                                  |   2 +-
>  kernel/pid_namespace.c                        |   2 +-
>  kernel/sys.c                                  |  12 +-
>  security/selinux/include/classmap.h           |   5 +-
>  tools/testing/selftests/clone3/Makefile       |   4 +-
>  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
>  9 files changed, 236 insertions(+), 15 deletions(-)
>  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> 
> 
> base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-18  3:24 ` [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Serge E. Hallyn
@ 2020-07-18 17:47   ` Christian Brauner
  0 siblings, 0 replies; 17+ messages in thread
From: Christian Brauner @ 2020-07-18 17:47 UTC (permalink / raw)
  To: Serge E. Hallyn, Adrian Reber, Nicolas Viennot
  Cc: Adrian Reber, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Andrei Vagin, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Fri, Jul 17, 2020 at 10:24:16PM -0500, Serge Hallyn wrote:
> On Wed, Jul 15, 2020 at 04:49:48PM +0200, Adrian Reber wrote:
> > This is v5 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> > changes to v4 are:
> > 
> >  * split into more patches to have the introduction of
> >    CAP_CHECKPOINT_RESTORE and the actual usage in different
> >    patches
> >  * reduce the /proc/self/exe patch to only be about
> >    CAP_CHECKPOINT_RESTORE
> > 
> > Adrian Reber (5):
> >   capabilities: Introduce CAP_CHECKPOINT_RESTORE
> >   pid: use checkpoint_restore_ns_capable() for set_tid
> >   pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> >   proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
> >   selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> > 
> > Nicolas Viennot (1):
> >   prctl: Allow checkpoint/restore capable processes to change exe link
> 
> (This is probably bad form, but)  All
> 
> Reviewed-by: Serge Hallyn <serge@hallyn.com>
> 
> Assuming you changes patches 4 and 6 per Christian's suggestions,
> I'd like to re-review those then.

Thanks, once Adrian has reposted the changes and you agree with them as
well, I'll pick them up though I might end up pushing this into the next
merge window...

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-18 17:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 14:49 [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
2020-07-15 14:49 ` [PATCH v5 1/6] " Adrian Reber
2020-07-15 15:06   ` Christian Brauner
2020-07-15 14:49 ` [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid Adrian Reber
2020-07-15 15:08   ` Christian Brauner
2020-07-15 14:49 ` [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid Adrian Reber
2020-07-15 15:08   ` Christian Brauner
2020-07-15 14:49 ` [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Adrian Reber
2020-07-15 21:17   ` Cyrill Gorcunov
2020-07-16  8:51   ` Christian Brauner
2020-07-15 14:49 ` [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link Adrian Reber
2020-07-15 15:20   ` Christian Brauner
2020-07-15 15:49     ` Nicolas Viennot
2020-07-15 14:49 ` [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
2020-07-15 15:24   ` Christian Brauner
2020-07-18  3:24 ` [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE Serge E. Hallyn
2020-07-18 17:47   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).