Linux-Security-Module Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
@ 2020-07-01  6:49 Adrian Reber
  2020-07-01  6:49 ` [PATCH v4 1/3] " Adrian Reber
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Adrian Reber @ 2020-07-01  6:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This is v4 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. There
is only one change from v3 to address Jann's comment on patch 3/3

 (That is not necessarily true in the presence of LSMs like SELinux:
 You'd have to be able to FILE__EXECUTE_NO_TRANS the target executable
 according to the system's security policy.)

Nicolas updated the last patch (3/3). The first two patches are
unchanged from v3.

Adrian Reber (2):
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
  prctl: Allow ptrace capable processes to change /proc/self/exe

 fs/proc/base.c                                |   8 +-
 include/linux/capability.h                    |   6 +
 include/linux/lsm_hook_defs.h                 |   1 +
 include/linux/security.h                      |   6 +
 include/uapi/linux/capability.h               |   9 +-
 kernel/pid.c                                  |   2 +-
 kernel/pid_namespace.c                        |   2 +-
 kernel/sys.c                                  |  12 +-
 security/commoncap.c                          |  26 +++
 security/security.c                           |   5 +
 security/selinux/hooks.c                      |  14 ++
 security/selinux/include/classmap.h           |   5 +-
 tools/testing/selftests/clone3/Makefile       |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
 14 files changed, 285 insertions(+), 18 deletions(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: f2b92b14533e646e434523abdbafddb727c23898
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-01  6:49 [PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
@ 2020-07-01  6:49 ` Adrian Reber
  2020-07-01  8:27   ` Christian Brauner
  2020-07-01  6:49 ` [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
  2020-07-01  6:49 ` [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe Adrian Reber
  2 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-01  6:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
  manager distributing jobs where users are always running as non-root.
  There is a desire to provide a way to checkpoint and restore long running
  jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These checkpoint/restore
  applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use with
  clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
  typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for hardening
  purposes. Read-only layers prevent additional /proc mounts (due to proc's
  SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
  certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
  files that are unreachable via the file system such as deleted files, or memfd
  files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
 fs/proc/base.c                      | 8 ++++----
 include/linux/capability.h          | 6 ++++++
 include/uapi/linux/capability.h     | 9 ++++++++-
 kernel/pid.c                        | 2 +-
 kernel/pid_namespace.c              | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ad806069c778 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@ struct map_files_info {
 };
 
 /*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
  */
 static const char *
 proc_map_files_get_link(struct dentry *dentry,
 			struct inode *inode,
 		        struct delayed_call *done)
 {
-	if (!capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
 		return ERR_PTR(-EPERM);
 
 	return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+	return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+		ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF			39
 
-#define CAP_LAST_CAP         CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE	40
+
+#define CAP_LAST_CAP         CAP_CHECKPOINT_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 5799ae54b89e..2d0a97b7ed7a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			if (tid != 1 && !tmp->child_reaper)
 				goto out_free;
 			retval = -EPERM;
-			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+			if (!checkpoint_restore_ns_capable(tmp->user_ns))
 				goto out_free;
 			set_tid_size--;
 		}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	struct ctl_table tmp = *table;
 	int ret, next;
 
-	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
 		return -EPERM;
 
 	/*
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 98e1513b608a..40cebde62856 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+		"checkpoint_restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-01  6:49 [PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
  2020-07-01  6:49 ` [PATCH v4 1/3] " Adrian Reber
@ 2020-07-01  6:49 ` Adrian Reber
  2020-07-02 20:53   ` Serge E. Hallyn
  2020-07-01  6:49 ` [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe Adrian Reber
  2 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-01  6:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber <areber@redhat.com>
---
 tools/testing/selftests/clone3/Makefile       |   4 +-
 .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
 2 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
 
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+	clone3_cap_checkpoint_restore
 
 include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index 000000000000..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+	fflush(stdout);
+	fflush(stderr);
+	_exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+	int status;
+	pid_t pid = -1;
+
+	struct clone_args args = {
+		.exit_signal = SIGCHLD,
+		.set_tid = ptr_to_u64(set_tid),
+		.set_tid_size = set_tid_size,
+	};
+
+	pid = sys_clone3(&args, sizeof(struct clone_args));
+	if (pid < 0) {
+		ksft_print_msg("%s - Failed to create new process\n",
+			       strerror(errno));
+		return -errno;
+	}
+
+	if (pid == 0) {
+		int ret;
+		char tmp = 0;
+
+		ksft_print_msg
+		    ("I am the child, my PID is %d (expected %d)\n",
+		     getpid(), set_tid[0]);
+
+		if (set_tid[0] != getpid())
+			child_exit(EXIT_FAILURE);
+		child_exit(EXIT_SUCCESS);
+	}
+
+	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+		       getpid(), pid);
+
+	if (waitpid(pid, &status, 0) < 0) {
+		ksft_print_msg("Child returned %s\n", strerror(errno));
+		return -errno;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+			       size_t set_tid_size, int expected)
+{
+	int ret;
+
+	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+		       getpid(), set_tid[0]);
+	ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+	ksft_print_msg
+	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+	     getpid(), set_tid[0], ret, expected);
+	if (ret != expected) {
+		ksft_test_result_fail
+		    ("[%d] Result (%d) is different than expected (%d)\n",
+		     getpid(), ret, expected);
+		return -1;
+	}
+	ksft_test_result_pass
+	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+	     expected);
+
+	return 0;
+}
+
+struct libcap {
+	struct __user_cap_header_struct hdr;
+	struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+	struct libcap *cap;
+	int ret = -1;
+	cap_t caps;
+
+	caps = cap_get_proc();
+	if (!caps) {
+		perror("cap_get_proc");
+		return -1;
+	}
+
+	/* Drop all capabilities */
+	if (cap_clear(caps)) {
+		perror("cap_clear");
+		goto out;
+	}
+
+	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+	cap = (struct libcap *) caps;
+
+	/* 40 -> CAP_CHECKPOINT_RESTORE */
+	cap->data[1].effective |= 1 << (40 - 32);
+	cap->data[1].permitted |= 1 << (40 - 32);
+
+	if (cap_set_proc(caps)) {
+		perror("cap_set_proc");
+		goto out;
+	}
+	ret = 0;
+out:
+	if (cap_free(caps))
+		perror("cap_free");
+	return ret;
+}
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int status;
+	int ret = 0;
+	pid_t set_tid[1];
+	uid_t uid = getuid();
+
+	ksft_print_header();
+	test_clone3_supported();
+	ksft_set_plan(2);
+
+	if (uid != 0) {
+		ksft_cnt.ksft_xskip = ksft_plan;
+		ksft_print_msg("Skipping all tests as non-root\n");
+		return ksft_exit_pass();
+	}
+
+	memset(&set_tid, 0, sizeof(set_tid));
+
+	/* Find the current active PID */
+	pid = fork();
+	if (pid == 0) {
+		ksft_print_msg("Child has PID %d\n", getpid());
+		child_exit(EXIT_SUCCESS);
+	}
+	if (waitpid(pid, &status, 0) < 0)
+		ksft_exit_fail_msg("Waiting for child %d failed", pid);
+
+	/* After the child has finished, its PID should be free. */
+	set_tid[0] = pid;
+
+	if (set_capability())
+		ksft_test_result_fail
+		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
+	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
+	/* This would fail without CAP_CHECKPOINT_RESTORE */
+	setgid(1000);
+	setuid(1000);
+	set_tid[0] = pid;
+	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
+	if (set_capability())
+		ksft_test_result_fail
+		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
+	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
+	ret |= test_clone3_set_tid(set_tid, 1, 0);
+
+	return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-01  6:49 [PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
  2020-07-01  6:49 ` [PATCH v4 1/3] " Adrian Reber
  2020-07-01  6:49 ` [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
@ 2020-07-01  6:49 ` Adrian Reber
  2020-07-01  8:55   ` Christian Brauner
  2020-07-02 21:16   ` Serge E. Hallyn
  2 siblings, 2 replies; 17+ messages in thread
From: Adrian Reber @ 2020-07-01  6:49 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>

Previously, the current process could only change the /proc/self/exe
link with local CAP_SYS_ADMIN.
This commit relaxes this restriction by permitting such change with
CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.

With access to ptrace facilities, a process can do the following: fork a
child, execve() the target executable, and have the child use ptrace()
to replace the memory content of the current process. This technique
makes it possible to masquerade an arbitrary program as any executable,
even setuid ones.

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
---
 include/linux/lsm_hook_defs.h |  1 +
 include/linux/security.h      |  6 ++++++
 kernel/sys.c                  | 12 ++++--------
 security/commoncap.c          | 26 ++++++++++++++++++++++++++
 security/security.c           |  5 +++++
 security/selinux/hooks.c      | 14 ++++++++++++++
 6 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 0098852bb56a..90e51d5e093b 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -211,6 +211,7 @@ LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct kernel_siginfo *info,
 	 int sig, const struct cred *cred)
 LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
 	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
+LSM_HOOK(int, 0, prctl_set_mm_exe_file, struct file *exe_file)
 LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
 	 struct inode *inode)
 LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
diff --git a/include/linux/security.h b/include/linux/security.h
index 2797e7f6418e..0f594eb7e766 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -412,6 +412,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
 			int sig, const struct cred *cred);
 int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
+int security_prctl_set_mm_exe_file(struct file *exe_file);
 void security_task_to_inode(struct task_struct *p, struct inode *inode);
 int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
 void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
@@ -1124,6 +1125,11 @@ static inline int security_task_prctl(int option, unsigned long arg2,
 	return cap_task_prctl(option, arg2, arg3, arg4, arg5);
 }
 
+static inline int security_prctl_set_mm_exe_file(struct file *exe_file)
+{
+	return cap_prctl_set_mm_exe_file(exe_file);
+}
+
 static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
 { }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..bb53e8408c63 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1851,6 +1851,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 	if (err)
 		goto exit;
 
+	err = security_prctl_set_mm_exe_file(exe.file);
+	if (err)
+		goto exit;
+
 	/*
 	 * Forbid mm->exe_file change if old file still mapped.
 	 */
@@ -2006,14 +2010,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 	}
 
 	if (prctl_map.exe_fd != (u32)-1) {
-		/*
-		 * Make sure the caller has the rights to
-		 * change /proc/pid/exe link: only local sys admin should
-		 * be allowed to.
-		 */
-		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
-			return -EINVAL;
-
 		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
 		if (error)
 			return error;
diff --git a/security/commoncap.c b/security/commoncap.c
index 59bf3c1674c8..663d00fe2ecc 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -1291,6 +1291,31 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 	}
 }
 
+/**
+ * cap_prctl_set_mm_exe_file - Determine whether /proc/self/exe can be changed
+ * by the current process.
+ * @exe_file: The new exe file
+ * Returns 0 if permission is granted, -ve if denied.
+ *
+ * The current process is permitted to change its /proc/self/exe link via two policies:
+ * 1) The current user can do checkpoint/restore. At the time of this writing,
+ *    this means CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
+ * 2) The current user can use ptrace.
+ *
+ * With access to ptrace facilities, a process can do the following:
+ * fork a child, execve() the target executable, and have the child use
+ * ptrace() to replace the memory content of the current process.
+ * This technique makes it possible to masquerade an arbitrary program as the
+ * target executable, even if it is setuid.
+ */
+int cap_prctl_set_mm_exe_file(struct file *exe_file)
+{
+	if (checkpoint_restore_ns_capable(current_user_ns()))
+		return 0;
+
+	return security_ptrace_access_check(current, PTRACE_MODE_ATTACH_REALCREDS);
+}
+
 /**
  * cap_vm_enough_memory - Determine whether a new virtual mapping is permitted
  * @mm: The VM space in which the new mapping is to be made
@@ -1356,6 +1381,7 @@ static struct security_hook_list capability_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
 	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
 	LSM_HOOK_INIT(task_prctl, cap_task_prctl),
+	LSM_HOOK_INIT(prctl_set_mm_exe_file, cap_prctl_set_mm_exe_file),
 	LSM_HOOK_INIT(task_setscheduler, cap_task_setscheduler),
 	LSM_HOOK_INIT(task_setioprio, cap_task_setioprio),
 	LSM_HOOK_INIT(task_setnice, cap_task_setnice),
diff --git a/security/security.c b/security/security.c
index 2bb912496232..13a1ed32f9e3 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1790,6 +1790,11 @@ int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 	return rc;
 }
 
+int security_prctl_set_mm_exe_file(struct file *exe_file)
+{
+	return call_int_hook(prctl_set_mm_exe_file, 0, exe_file);
+}
+
 void security_task_to_inode(struct task_struct *p, struct inode *inode)
 {
 	call_void_hook(task_to_inode, p, inode);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ca901025802a..fca5581392b8 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4156,6 +4156,19 @@ static int selinux_task_kill(struct task_struct *p, struct kernel_siginfo *info,
 			    secid, task_sid(p), SECCLASS_PROCESS, perm, NULL);
 }
 
+static int selinux_prctl_set_mm_exe_file(struct file *exe_file)
+{
+	u32 sid = current_sid();
+
+	struct common_audit_data ad = {
+		.type = LSM_AUDIT_DATA_FILE,
+		.u.file = exe_file,
+	};
+
+	return avc_has_perm(&selinux_state, sid, sid,
+			    SECCLASS_FILE, FILE__EXECUTE_NO_TRANS, &ad);
+}
+
 static void selinux_task_to_inode(struct task_struct *p,
 				  struct inode *inode)
 {
@@ -7057,6 +7070,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(task_getscheduler, selinux_task_getscheduler),
 	LSM_HOOK_INIT(task_movememory, selinux_task_movememory),
 	LSM_HOOK_INIT(task_kill, selinux_task_kill),
+	LSM_HOOK_INIT(prctl_set_mm_exe_file, selinux_prctl_set_mm_exe_file),
 	LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
 
 	LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-01  6:49 ` [PATCH v4 1/3] " Adrian Reber
@ 2020-07-01  8:27   ` Christian Brauner
  2020-07-03 11:11     ` Adrian Reber
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2020-07-01  8:27 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 08:49:04AM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
> 
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
> 
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> 
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
>   manager distributing jobs where users are always running as non-root.
>   There is a desire to provide a way to checkpoint and restore long running
>   jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
>   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>   applications are not meant to be running with CAP_SYS_ADMIN.
> 
> We have seen the following workarounds:
> * Use a setuid wrapper around CRIU:
>   See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> * Use a setuid helper that writes to ns_last_pid.
>   Unfortunately, this helper delegation technique is impossible to use with
>   clone3, and is thus prone to races.
>   See https://github.com/twosigma/set_ns_last_pid
> * Cycle through PIDs with fork() until the desired PID is reached:
>   This has been demonstrated to work with cycling rates of 100,000 PIDs/s
>   See https://github.com/twosigma/set_ns_last_pid
> * Patch out the CAP_SYS_ADMIN check from the kernel
> * Run the desired application in a new user and PID namespace to provide
>   a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
>   typical container environments (e.g., Kubernetes) as /proc is
>   typically protected with read-only layers (e.g., /proc/sys) for hardening
>   purposes. Read-only layers prevent additional /proc mounts (due to proc's
>   SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
>   certain applications need access to /proc matching their PID namespace.
> 
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
>   for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
>   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
>   files that are unreachable via the file system such as deleted files, or memfd
>   files.
> 
> See corresponding selftest for an example with clone3().
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---

I think that now looks reasonable. A few comments.

Before we proceed, please split the addition of
checkpoint_restore_ns_capable() out into a separate patch.
In fact, I think the cleanest way of doing this would be:
- 0/n capability: add CAP_CHECKPOINT_RESTORE
- 1/n pid: use checkpoint_restore_ns_capable() for set_tid
- 2/n pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
- 3/n: proc: require checkpoint_restore_ns_capable() in init userns for map_files

(commit subjects up to you of course) and a nice commit message for each
time we relax a permissions on something so we have a clear separate
track record for each change in case we need to revert something. Then
the rest of the patches in this series. Testing patches probably last.

>  fs/proc/base.c                      | 8 ++++----
>  include/linux/capability.h          | 6 ++++++
>  include/uapi/linux/capability.h     | 9 ++++++++-
>  kernel/pid.c                        | 2 +-
>  kernel/pid_namespace.c              | 2 +-
>  security/selinux/include/classmap.h | 5 +++--
>  6 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d86c0afc8a85..ad806069c778 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2189,16 +2189,16 @@ struct map_files_info {
>  };
>  
>  /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
>   */
>  static const char *
>  proc_map_files_get_link(struct dentry *dentry,
>  			struct inode *inode,
>  		        struct delayed_call *done)
>  {
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
>  		return ERR_PTR(-EPERM);

I think it's clearer if you just use:
checkpoint_restore_ns_capable(&init_user_ns)

> +static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)

>  
>  	return proc_pid_get_link(dentry, inode, done);
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b4345b38a6be..1e7fe311cabe 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
>  	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
>  }
>  
> +static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
> +{
> +	return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
> +		ns_capable(ns, CAP_SYS_ADMIN);
> +}
> +
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 48ff0757ae5e..395dd0df8d08 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
>   */
>  #define CAP_BPF			39
>  
> -#define CAP_LAST_CAP         CAP_BPF
> +
> +/* Allow checkpoint/restore related operations */
> +/* Allow PID selection during clone3() */
> +/* Allow writing to ns_last_pid */
> +
> +#define CAP_CHECKPOINT_RESTORE	40
> +
> +#define CAP_LAST_CAP         CAP_CHECKPOINT_RESTORE
>  
>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>  
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 5799ae54b89e..2d0a97b7ed7a 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			if (tid != 1 && !tmp->child_reaper)
>  				goto out_free;
>  			retval = -EPERM;
> -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> +			if (!checkpoint_restore_ns_capable(tmp->user_ns))
>  				goto out_free;
>  			set_tid_size--;
>  		}
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0e5ac162c3a8..ac135bd600eb 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  	struct ctl_table tmp = *table;
>  	int ret, next;
>  
> -	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
>  		return -EPERM;
>  
>  	/*
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index 98e1513b608a..40cebde62856 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -27,9 +27,10 @@
>  	    "audit_control", "setfcap"
>  
>  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> -		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> +		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> +		"checkpoint_restore"
>  
> -#if CAP_LAST_CAP > CAP_BPF
> +#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
>  #error New capability defined, please update COMMON_CAP2_PERMS.
>  #endif
>  
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-01  6:49 ` [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe Adrian Reber
@ 2020-07-01  8:55   ` Christian Brauner
  2020-07-02 21:58     ` Serge E. Hallyn
  2020-07-02 21:16   ` Serge E. Hallyn
  1 sibling, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2020-07-01  8:55 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 08:49:06AM +0200, Adrian Reber wrote:
> From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> 
> Previously, the current process could only change the /proc/self/exe
> link with local CAP_SYS_ADMIN.
> This commit relaxes this restriction by permitting such change with
> CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.
> 
> With access to ptrace facilities, a process can do the following: fork a
> child, execve() the target executable, and have the child use ptrace()
> to replace the memory content of the current process. This technique
> makes it possible to masquerade an arbitrary program as any executable,
> even setuid ones.
> 
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  include/linux/lsm_hook_defs.h |  1 +
>  include/linux/security.h      |  6 ++++++
>  kernel/sys.c                  | 12 ++++--------
>  security/commoncap.c          | 26 ++++++++++++++++++++++++++
>  security/security.c           |  5 +++++
>  security/selinux/hooks.c      | 14 ++++++++++++++
>  6 files changed, 56 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index 0098852bb56a..90e51d5e093b 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -211,6 +211,7 @@ LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct kernel_siginfo *info,
>  	 int sig, const struct cred *cred)
>  LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
>  	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
> +LSM_HOOK(int, 0, prctl_set_mm_exe_file, struct file *exe_file)
>  LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
>  	 struct inode *inode)
>  LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 2797e7f6418e..0f594eb7e766 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -412,6 +412,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
>  			int sig, const struct cred *cred);
>  int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  			unsigned long arg4, unsigned long arg5);
> +int security_prctl_set_mm_exe_file(struct file *exe_file);
>  void security_task_to_inode(struct task_struct *p, struct inode *inode);
>  int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
>  void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
> @@ -1124,6 +1125,11 @@ static inline int security_task_prctl(int option, unsigned long arg2,
>  	return cap_task_prctl(option, arg2, arg3, arg4, arg5);
>  }
>  
> +static inline int security_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	return cap_prctl_set_mm_exe_file(exe_file);
> +}
> +
>  static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
>  { }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..bb53e8408c63 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1851,6 +1851,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
>  	if (err)
>  		goto exit;
>  
> +	err = security_prctl_set_mm_exe_file(exe.file);
> +	if (err)
> +		goto exit;
> +
>  	/*
>  	 * Forbid mm->exe_file change if old file still mapped.
>  	 */
> @@ -2006,14 +2010,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>  	}
>  
>  	if (prctl_map.exe_fd != (u32)-1) {
> -		/*
> -		 * Make sure the caller has the rights to
> -		 * change /proc/pid/exe link: only local sys admin should
> -		 * be allowed to.
> -		 */
> -		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> -			return -EINVAL;
> -
>  		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
>  		if (error)
>  			return error;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 59bf3c1674c8..663d00fe2ecc 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -1291,6 +1291,31 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  	}
>  }
>  
> +/**
> + * cap_prctl_set_mm_exe_file - Determine whether /proc/self/exe can be changed
> + * by the current process.
> + * @exe_file: The new exe file
> + * Returns 0 if permission is granted, -ve if denied.
> + *
> + * The current process is permitted to change its /proc/self/exe link via two policies:
> + * 1) The current user can do checkpoint/restore. At the time of this writing,
> + *    this means CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
> + * 2) The current user can use ptrace.
> + *
> + * With access to ptrace facilities, a process can do the following:
> + * fork a child, execve() the target executable, and have the child use
> + * ptrace() to replace the memory content of the current process.
> + * This technique makes it possible to masquerade an arbitrary program as the
> + * target executable, even if it is setuid.
> + */
> +int cap_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	if (checkpoint_restore_ns_capable(current_user_ns()))
> +		return 0;
> +
> +	return security_ptrace_access_check(current, PTRACE_MODE_ATTACH_REALCREDS);
> +}

What is the reason for having this be a new security hook? Doesn't look
like it needs to be unless I'm missing something. This just seems more
complex than it needs to be.

I might be wrong here but if you look at the callsites for
security_ptrace_access_check() right now, you'll see that it's only
called from kernel/ptrace.c in __ptrace_may_access() and that function
checks right at the top:

	/* Don't let security modules deny introspection */
	if (same_thread_group(task, current))
		return 0;

since you're passing in same_thread_group(current, current) you're
passing this check and never even hitting
security_ptrace_access_check(). So the contract seems to be (as is
obvious from the comment) that a task can't be denied ptrace
introspection. But if you're using security_ptrace_access_check(current)
here and _if_ there would be any lsm that would deny ptrace
introspection to current you'd suddenly introduce a callsite where
ptrace introspection is denied. That seems wrong. So either you meant to
do something else here or you really just want:

checkpoint_restore_ns_capable(current_user_ns())

and none of the rest. But I might be missing the big picture in this
patch.

> +	if (checkpoint_restore_ns_capable(current_user_ns()))
> +		return 0;
> +
>  /**
>   * cap_vm_enough_memory - Determine whether a new virtual mapping is permitted
>   * @mm: The VM space in which the new mapping is to be made
> @@ -1356,6 +1381,7 @@ static struct security_hook_list capability_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
>  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
>  	LSM_HOOK_INIT(task_prctl, cap_task_prctl),
> +	LSM_HOOK_INIT(prctl_set_mm_exe_file, cap_prctl_set_mm_exe_file),
>  	LSM_HOOK_INIT(task_setscheduler, cap_task_setscheduler),
>  	LSM_HOOK_INIT(task_setioprio, cap_task_setioprio),
>  	LSM_HOOK_INIT(task_setnice, cap_task_setnice),
> diff --git a/security/security.c b/security/security.c
> index 2bb912496232..13a1ed32f9e3 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1790,6 +1790,11 @@ int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  	return rc;
>  }
>  
> +int security_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	return call_int_hook(prctl_set_mm_exe_file, 0, exe_file);
> +}
> +
>  void security_task_to_inode(struct task_struct *p, struct inode *inode)
>  {
>  	call_void_hook(task_to_inode, p, inode);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index ca901025802a..fca5581392b8 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4156,6 +4156,19 @@ static int selinux_task_kill(struct task_struct *p, struct kernel_siginfo *info,
>  			    secid, task_sid(p), SECCLASS_PROCESS, perm, NULL);
>  }
>  
> +static int selinux_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	u32 sid = current_sid();
> +
> +	struct common_audit_data ad = {
> +		.type = LSM_AUDIT_DATA_FILE,
> +		.u.file = exe_file,
> +	};
> +
> +	return avc_has_perm(&selinux_state, sid, sid,
> +			    SECCLASS_FILE, FILE__EXECUTE_NO_TRANS, &ad);
> +}
> +
>  static void selinux_task_to_inode(struct task_struct *p,
>  				  struct inode *inode)
>  {
> @@ -7057,6 +7070,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(task_getscheduler, selinux_task_getscheduler),
>  	LSM_HOOK_INIT(task_movememory, selinux_task_movememory),
>  	LSM_HOOK_INIT(task_kill, selinux_task_kill),
> +	LSM_HOOK_INIT(prctl_set_mm_exe_file, selinux_prctl_set_mm_exe_file),
>  	LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
>  
>  	LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-01  6:49 ` [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
@ 2020-07-02 20:53   ` Serge E. Hallyn
  2020-07-03 11:18     ` Adrian Reber
  0 siblings, 1 reply; 17+ messages in thread
From: Serge E. Hallyn @ 2020-07-02 20:53 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> This adds a test that changes its UID, uses capabilities to
> get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> create a process with a given PID as non-root.

Seems worth also verifying that it fails if you have no capabilities.
I don't see that in the existing clone3/ test dir.


> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  tools/testing/selftests/clone3/Makefile       |   4 +-
>  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
>  2 files changed, 206 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> 
> diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> index cf976c732906..ef7564cb7abe 100644
> --- a/tools/testing/selftests/clone3/Makefile
> +++ b/tools/testing/selftests/clone3/Makefile
> @@ -1,6 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
>  CFLAGS += -g -I../../../../usr/include/
> +LDLIBS += -lcap
>  
> -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> +	clone3_cap_checkpoint_restore
>  
>  include ../lib.mk
> diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> new file mode 100644
> index 000000000000..2cc3d57b91f2
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> @@ -0,0 +1,203 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Based on Christian Brauner's clone3() example.
> + * These tests are assuming to be running in the host's
> + * PID namespace.
> + */
> +
> +/* capabilities related code based on selftests/bpf/test_verifier.c */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <sys/capability.h>
> +#include <sys/prctl.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/un.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sched.h>
> +
> +#include "../kselftest.h"
> +#include "clone3_selftests.h"
> +
> +#ifndef MAX_PID_NS_LEVEL
> +#define MAX_PID_NS_LEVEL 32
> +#endif
> +
> +static void child_exit(int ret)
> +{
> +	fflush(stdout);
> +	fflush(stderr);
> +	_exit(ret);
> +}
> +
> +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> +{
> +	int status;
> +	pid_t pid = -1;
> +
> +	struct clone_args args = {
> +		.exit_signal = SIGCHLD,
> +		.set_tid = ptr_to_u64(set_tid),
> +		.set_tid_size = set_tid_size,
> +	};
> +
> +	pid = sys_clone3(&args, sizeof(struct clone_args));
> +	if (pid < 0) {
> +		ksft_print_msg("%s - Failed to create new process\n",
> +			       strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (pid == 0) {
> +		int ret;
> +		char tmp = 0;
> +
> +		ksft_print_msg
> +		    ("I am the child, my PID is %d (expected %d)\n",
> +		     getpid(), set_tid[0]);
> +
> +		if (set_tid[0] != getpid())
> +			child_exit(EXIT_FAILURE);
> +		child_exit(EXIT_SUCCESS);
> +	}
> +
> +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> +		       getpid(), pid);
> +
> +	if (waitpid(pid, &status, 0) < 0) {
> +		ksft_print_msg("Child returned %s\n", strerror(errno));
> +		return -errno;
> +	}
> +
> +	if (!WIFEXITED(status))
> +		return -1;
> +
> +	return WEXITSTATUS(status);
> +}
> +
> +static int test_clone3_set_tid(pid_t * set_tid,
> +			       size_t set_tid_size, int expected)
> +{
> +	int ret;
> +
> +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> +		       getpid(), set_tid[0]);
> +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> +
> +	ksft_print_msg
> +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> +	     getpid(), set_tid[0], ret, expected);
> +	if (ret != expected) {
> +		ksft_test_result_fail
> +		    ("[%d] Result (%d) is different than expected (%d)\n",
> +		     getpid(), ret, expected);
> +		return -1;
> +	}
> +	ksft_test_result_pass
> +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> +	     expected);
> +
> +	return 0;
> +}
> +
> +struct libcap {
> +	struct __user_cap_header_struct hdr;
> +	struct __user_cap_data_struct data[2];
> +};
> +
> +static int set_capability()
> +{
> +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> +	struct libcap *cap;
> +	int ret = -1;
> +	cap_t caps;
> +
> +	caps = cap_get_proc();
> +	if (!caps) {
> +		perror("cap_get_proc");
> +		return -1;
> +	}
> +
> +	/* Drop all capabilities */
> +	if (cap_clear(caps)) {
> +		perror("cap_clear");
> +		goto out;
> +	}
> +
> +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> +
> +	cap = (struct libcap *) caps;
> +
> +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> +	cap->data[1].effective |= 1 << (40 - 32);
> +	cap->data[1].permitted |= 1 << (40 - 32);
> +
> +	if (cap_set_proc(caps)) {
> +		perror("cap_set_proc");
> +		goto out;
> +	}
> +	ret = 0;
> +out:
> +	if (cap_free(caps))
> +		perror("cap_free");
> +	return ret;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	pid_t pid;
> +	int status;
> +	int ret = 0;
> +	pid_t set_tid[1];
> +	uid_t uid = getuid();
> +
> +	ksft_print_header();
> +	test_clone3_supported();
> +	ksft_set_plan(2);
> +
> +	if (uid != 0) {
> +		ksft_cnt.ksft_xskip = ksft_plan;
> +		ksft_print_msg("Skipping all tests as non-root\n");
> +		return ksft_exit_pass();
> +	}
> +
> +	memset(&set_tid, 0, sizeof(set_tid));
> +
> +	/* Find the current active PID */
> +	pid = fork();
> +	if (pid == 0) {
> +		ksft_print_msg("Child has PID %d\n", getpid());
> +		child_exit(EXIT_SUCCESS);
> +	}
> +	if (waitpid(pid, &status, 0) < 0)
> +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> +
> +	/* After the child has finished, its PID should be free. */
> +	set_tid[0] = pid;
> +
> +	if (set_capability())
> +		ksft_test_result_fail
> +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> +	setgid(1000);
> +	setuid(1000);
> +	set_tid[0] = pid;
> +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> +	if (set_capability())
> +		ksft_test_result_fail
> +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> +
> +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> +}
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-01  6:49 ` [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe Adrian Reber
  2020-07-01  8:55   ` Christian Brauner
@ 2020-07-02 21:16   ` Serge E. Hallyn
  2020-07-02 22:00     ` Paul Moore
  1 sibling, 1 reply; 17+ messages in thread
From: Serge E. Hallyn @ 2020-07-02 21:16 UTC (permalink / raw)
  To: Adrian Reber, Paul Moore
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 08:49:06AM +0200, Adrian Reber wrote:
> From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> 
> Previously, the current process could only change the /proc/self/exe
> link with local CAP_SYS_ADMIN.
> This commit relaxes this restriction by permitting such change with
> CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.
> 
> With access to ptrace facilities, a process can do the following: fork a
> child, execve() the target executable, and have the child use ptrace()
> to replace the memory content of the current process. This technique
> makes it possible to masquerade an arbitrary program as any executable,
> even setuid ones.
> 
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Signed-off-by: Adrian Reber <areber@redhat.com>

This is scary.  But I believe it is safe.

Reviewed-by: Serge Hallyn <serge@hallyn.com>

I am a bit curious about the implications of the selinux patch.
IIUC you are using the permission of the tracing process to
execute the file without transition, so this is a way to work
around the policy which might prevent the tracee from doing so.
Given that SELinux wants to be MAC, I'm not *quite* sure that's
considered kosher.  You also are skipping the PROCESS__PTRACE
to SECCLASS_PROCESS check which selinux_bprm_set_creds does later
on.  Again I'm just not quite sure what's considered normal there
these days.

Paul, do you have input there?

> ---
>  include/linux/lsm_hook_defs.h |  1 +
>  include/linux/security.h      |  6 ++++++
>  kernel/sys.c                  | 12 ++++--------
>  security/commoncap.c          | 26 ++++++++++++++++++++++++++
>  security/security.c           |  5 +++++
>  security/selinux/hooks.c      | 14 ++++++++++++++
>  6 files changed, 56 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index 0098852bb56a..90e51d5e093b 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -211,6 +211,7 @@ LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct kernel_siginfo *info,
>  	 int sig, const struct cred *cred)
>  LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
>  	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
> +LSM_HOOK(int, 0, prctl_set_mm_exe_file, struct file *exe_file)
>  LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
>  	 struct inode *inode)
>  LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 2797e7f6418e..0f594eb7e766 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -412,6 +412,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
>  			int sig, const struct cred *cred);
>  int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  			unsigned long arg4, unsigned long arg5);
> +int security_prctl_set_mm_exe_file(struct file *exe_file);
>  void security_task_to_inode(struct task_struct *p, struct inode *inode);
>  int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
>  void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
> @@ -1124,6 +1125,11 @@ static inline int security_task_prctl(int option, unsigned long arg2,
>  	return cap_task_prctl(option, arg2, arg3, arg4, arg5);
>  }
>  
> +static inline int security_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	return cap_prctl_set_mm_exe_file(exe_file);
> +}
> +
>  static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
>  { }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..bb53e8408c63 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1851,6 +1851,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
>  	if (err)
>  		goto exit;
>  
> +	err = security_prctl_set_mm_exe_file(exe.file);
> +	if (err)
> +		goto exit;
> +
>  	/*
>  	 * Forbid mm->exe_file change if old file still mapped.
>  	 */
> @@ -2006,14 +2010,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>  	}
>  
>  	if (prctl_map.exe_fd != (u32)-1) {
> -		/*
> -		 * Make sure the caller has the rights to
> -		 * change /proc/pid/exe link: only local sys admin should
> -		 * be allowed to.
> -		 */
> -		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> -			return -EINVAL;
> -
>  		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
>  		if (error)
>  			return error;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 59bf3c1674c8..663d00fe2ecc 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -1291,6 +1291,31 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  	}
>  }
>  
> +/**
> + * cap_prctl_set_mm_exe_file - Determine whether /proc/self/exe can be changed
> + * by the current process.
> + * @exe_file: The new exe file
> + * Returns 0 if permission is granted, -ve if denied.
> + *
> + * The current process is permitted to change its /proc/self/exe link via two policies:
> + * 1) The current user can do checkpoint/restore. At the time of this writing,
> + *    this means CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
> + * 2) The current user can use ptrace.
> + *
> + * With access to ptrace facilities, a process can do the following:
> + * fork a child, execve() the target executable, and have the child use
> + * ptrace() to replace the memory content of the current process.
> + * This technique makes it possible to masquerade an arbitrary program as the
> + * target executable, even if it is setuid.
> + */
> +int cap_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	if (checkpoint_restore_ns_capable(current_user_ns()))
> +		return 0;
> +
> +	return security_ptrace_access_check(current, PTRACE_MODE_ATTACH_REALCREDS);
> +}
> +
>  /**
>   * cap_vm_enough_memory - Determine whether a new virtual mapping is permitted
>   * @mm: The VM space in which the new mapping is to be made
> @@ -1356,6 +1381,7 @@ static struct security_hook_list capability_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
>  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
>  	LSM_HOOK_INIT(task_prctl, cap_task_prctl),
> +	LSM_HOOK_INIT(prctl_set_mm_exe_file, cap_prctl_set_mm_exe_file),
>  	LSM_HOOK_INIT(task_setscheduler, cap_task_setscheduler),
>  	LSM_HOOK_INIT(task_setioprio, cap_task_setioprio),
>  	LSM_HOOK_INIT(task_setnice, cap_task_setnice),
> diff --git a/security/security.c b/security/security.c
> index 2bb912496232..13a1ed32f9e3 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1790,6 +1790,11 @@ int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  	return rc;
>  }
>  
> +int security_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	return call_int_hook(prctl_set_mm_exe_file, 0, exe_file);
> +}
> +
>  void security_task_to_inode(struct task_struct *p, struct inode *inode)
>  {
>  	call_void_hook(task_to_inode, p, inode);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index ca901025802a..fca5581392b8 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4156,6 +4156,19 @@ static int selinux_task_kill(struct task_struct *p, struct kernel_siginfo *info,
>  			    secid, task_sid(p), SECCLASS_PROCESS, perm, NULL);
>  }
>  
> +static int selinux_prctl_set_mm_exe_file(struct file *exe_file)
> +{
> +	u32 sid = current_sid();
> +
> +	struct common_audit_data ad = {
> +		.type = LSM_AUDIT_DATA_FILE,
> +		.u.file = exe_file,
> +	};
> +
> +	return avc_has_perm(&selinux_state, sid, sid,
> +			    SECCLASS_FILE, FILE__EXECUTE_NO_TRANS, &ad);
> +}
> +
>  static void selinux_task_to_inode(struct task_struct *p,
>  				  struct inode *inode)
>  {
> @@ -7057,6 +7070,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
>  	LSM_HOOK_INIT(task_getscheduler, selinux_task_getscheduler),
>  	LSM_HOOK_INIT(task_movememory, selinux_task_movememory),
>  	LSM_HOOK_INIT(task_kill, selinux_task_kill),
> +	LSM_HOOK_INIT(prctl_set_mm_exe_file, selinux_prctl_set_mm_exe_file),
>  	LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
>  
>  	LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-01  8:55   ` Christian Brauner
@ 2020-07-02 21:58     ` Serge E. Hallyn
  0 siblings, 0 replies; 17+ messages in thread
From: Serge E. Hallyn @ 2020-07-02 21:58 UTC (permalink / raw)
  To: Christian Brauner, Paul Moore
  Cc: Adrian Reber, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 10:55:37AM +0200, Christian Brauner wrote:
> On Wed, Jul 01, 2020 at 08:49:06AM +0200, Adrian Reber wrote:
> > From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > 
> > Previously, the current process could only change the /proc/self/exe
> > link with local CAP_SYS_ADMIN.
> > This commit relaxes this restriction by permitting such change with
> > CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.
> > 
> > With access to ptrace facilities, a process can do the following: fork a
> > child, execve() the target executable, and have the child use ptrace()
> > to replace the memory content of the current process. This technique
> > makes it possible to masquerade an arbitrary program as any executable,
> > even setuid ones.
> > 
> > Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > ---
> >  include/linux/lsm_hook_defs.h |  1 +
> >  include/linux/security.h      |  6 ++++++
> >  kernel/sys.c                  | 12 ++++--------
> >  security/commoncap.c          | 26 ++++++++++++++++++++++++++
> >  security/security.c           |  5 +++++
> >  security/selinux/hooks.c      | 14 ++++++++++++++
> >  6 files changed, 56 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> > index 0098852bb56a..90e51d5e093b 100644
> > --- a/include/linux/lsm_hook_defs.h
> > +++ b/include/linux/lsm_hook_defs.h
> > @@ -211,6 +211,7 @@ LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct kernel_siginfo *info,
> >  	 int sig, const struct cred *cred)
> >  LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
> >  	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
> > +LSM_HOOK(int, 0, prctl_set_mm_exe_file, struct file *exe_file)
> >  LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
> >  	 struct inode *inode)
> >  LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
> > diff --git a/include/linux/security.h b/include/linux/security.h
> > index 2797e7f6418e..0f594eb7e766 100644
> > --- a/include/linux/security.h
> > +++ b/include/linux/security.h
> > @@ -412,6 +412,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
> >  			int sig, const struct cred *cred);
> >  int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
> >  			unsigned long arg4, unsigned long arg5);
> > +int security_prctl_set_mm_exe_file(struct file *exe_file);
> >  void security_task_to_inode(struct task_struct *p, struct inode *inode);
> >  int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
> >  void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
> > @@ -1124,6 +1125,11 @@ static inline int security_task_prctl(int option, unsigned long arg2,
> >  	return cap_task_prctl(option, arg2, arg3, arg4, arg5);
> >  }
> >  
> > +static inline int security_prctl_set_mm_exe_file(struct file *exe_file)
> > +{
> > +	return cap_prctl_set_mm_exe_file(exe_file);
> > +}
> > +
> >  static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
> >  { }
> >  
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 00a96746e28a..bb53e8408c63 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1851,6 +1851,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
> >  	if (err)
> >  		goto exit;
> >  
> > +	err = security_prctl_set_mm_exe_file(exe.file);
> > +	if (err)
> > +		goto exit;
> > +
> >  	/*
> >  	 * Forbid mm->exe_file change if old file still mapped.
> >  	 */
> > @@ -2006,14 +2010,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> >  	}
> >  
> >  	if (prctl_map.exe_fd != (u32)-1) {
> > -		/*
> > -		 * Make sure the caller has the rights to
> > -		 * change /proc/pid/exe link: only local sys admin should
> > -		 * be allowed to.
> > -		 */
> > -		if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > -			return -EINVAL;
> > -
> >  		error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
> >  		if (error)
> >  			return error;
> > diff --git a/security/commoncap.c b/security/commoncap.c
> > index 59bf3c1674c8..663d00fe2ecc 100644
> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -1291,6 +1291,31 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
> >  	}
> >  }
> >  
> > +/**
> > + * cap_prctl_set_mm_exe_file - Determine whether /proc/self/exe can be changed
> > + * by the current process.
> > + * @exe_file: The new exe file
> > + * Returns 0 if permission is granted, -ve if denied.
> > + *
> > + * The current process is permitted to change its /proc/self/exe link via two policies:
> > + * 1) The current user can do checkpoint/restore. At the time of this writing,
> > + *    this means CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
> > + * 2) The current user can use ptrace.
> > + *
> > + * With access to ptrace facilities, a process can do the following:
> > + * fork a child, execve() the target executable, and have the child use
> > + * ptrace() to replace the memory content of the current process.
> > + * This technique makes it possible to masquerade an arbitrary program as the
> > + * target executable, even if it is setuid.
> > + */
> > +int cap_prctl_set_mm_exe_file(struct file *exe_file)
> > +{
> > +	if (checkpoint_restore_ns_capable(current_user_ns()))
> > +		return 0;
> > +
> > +	return security_ptrace_access_check(current, PTRACE_MODE_ATTACH_REALCREDS);
> > +}
> 
> What is the reason for having this be a new security hook? Doesn't look
> like it needs to be unless I'm missing something. This just seems more
> complex than it needs to be.

Yeah, agreed, and that in turn risks actually making it less safe.  Is
there a good reason not to do what Christian suggests?

> I might be wrong here but if you look at the callsites for
> security_ptrace_access_check() right now, you'll see that it's only
> called from kernel/ptrace.c in __ptrace_may_access() and that function
> checks right at the top:
> 
> 	/* Don't let security modules deny introspection */
> 	if (same_thread_group(task, current))
> 		return 0;
> 
> since you're passing in same_thread_group(current, current) you're
> passing this check and never even hitting
> security_ptrace_access_check(). So the contract seems to be (as is
> obvious from the comment) that a task can't be denied ptrace
> introspection. But if you're using security_ptrace_access_check(current)
> here and _if_ there would be any lsm that would deny ptrace
> introspection to current you'd suddenly introduce a callsite where
> ptrace introspection is denied. That seems wrong. So either you meant to
> do something else here or you really just want:
> 
> checkpoint_restore_ns_capable(current_user_ns())
> 
> and none of the rest. But I might be missing the big picture in this
> patch.
> 
> > +	if (checkpoint_restore_ns_capable(current_user_ns()))
> > +		return 0;
> > +
> >  /**
> >   * cap_vm_enough_memory - Determine whether a new virtual mapping is permitted
> >   * @mm: The VM space in which the new mapping is to be made
> > @@ -1356,6 +1381,7 @@ static struct security_hook_list capability_hooks[] __lsm_ro_after_init = {
> >  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
> >  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
> >  	LSM_HOOK_INIT(task_prctl, cap_task_prctl),
> > +	LSM_HOOK_INIT(prctl_set_mm_exe_file, cap_prctl_set_mm_exe_file),
> >  	LSM_HOOK_INIT(task_setscheduler, cap_task_setscheduler),
> >  	LSM_HOOK_INIT(task_setioprio, cap_task_setioprio),
> >  	LSM_HOOK_INIT(task_setnice, cap_task_setnice),
> > diff --git a/security/security.c b/security/security.c
> > index 2bb912496232..13a1ed32f9e3 100644
> > --- a/security/security.c
> > +++ b/security/security.c
> > @@ -1790,6 +1790,11 @@ int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
> >  	return rc;
> >  }
> >  
> > +int security_prctl_set_mm_exe_file(struct file *exe_file)
> > +{
> > +	return call_int_hook(prctl_set_mm_exe_file, 0, exe_file);
> > +}
> > +
> >  void security_task_to_inode(struct task_struct *p, struct inode *inode)
> >  {
> >  	call_void_hook(task_to_inode, p, inode);
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index ca901025802a..fca5581392b8 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -4156,6 +4156,19 @@ static int selinux_task_kill(struct task_struct *p, struct kernel_siginfo *info,
> >  			    secid, task_sid(p), SECCLASS_PROCESS, perm, NULL);
> >  }
> >  
> > +static int selinux_prctl_set_mm_exe_file(struct file *exe_file)
> > +{
> > +	u32 sid = current_sid();
> > +
> > +	struct common_audit_data ad = {
> > +		.type = LSM_AUDIT_DATA_FILE,
> > +		.u.file = exe_file,
> > +	};
> > +
> > +	return avc_has_perm(&selinux_state, sid, sid,
> > +			    SECCLASS_FILE, FILE__EXECUTE_NO_TRANS, &ad);
> > +}
> > +
> >  static void selinux_task_to_inode(struct task_struct *p,
> >  				  struct inode *inode)
> >  {
> > @@ -7057,6 +7070,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
> >  	LSM_HOOK_INIT(task_getscheduler, selinux_task_getscheduler),
> >  	LSM_HOOK_INIT(task_movememory, selinux_task_movememory),
> >  	LSM_HOOK_INIT(task_kill, selinux_task_kill),
> > +	LSM_HOOK_INIT(prctl_set_mm_exe_file, selinux_prctl_set_mm_exe_file),
> >  	LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
> >  
> >  	LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
> > -- 
> > 2.26.2
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-02 21:16   ` Serge E. Hallyn
@ 2020-07-02 22:00     ` Paul Moore
  2020-07-06 17:13       ` Nicolas Viennot
  0 siblings, 1 reply; 17+ messages in thread
From: Paul Moore @ 2020-07-02 22:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel

On Thu, Jul 2, 2020 at 5:16 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, Jul 01, 2020 at 08:49:06AM +0200, Adrian Reber wrote:
> > From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> >
> > Previously, the current process could only change the /proc/self/exe
> > link with local CAP_SYS_ADMIN.
> > This commit relaxes this restriction by permitting such change with
> > CAP_CHECKPOINT_RESTORE, and the ability to use ptrace.
> >
> > With access to ptrace facilities, a process can do the following: fork a
> > child, execve() the target executable, and have the child use ptrace()
> > to replace the memory content of the current process. This technique
> > makes it possible to masquerade an arbitrary program as any executable,
> > even setuid ones.
> >
> > Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > Signed-off-by: Adrian Reber <areber@redhat.com>
>
> This is scary.  But I believe it is safe.
>
> Reviewed-by: Serge Hallyn <serge@hallyn.com>
>
> I am a bit curious about the implications of the selinux patch.
> IIUC you are using the permission of the tracing process to
> execute the file without transition, so this is a way to work
> around the policy which might prevent the tracee from doing so.
> Given that SELinux wants to be MAC, I'm not *quite* sure that's
> considered kosher.  You also are skipping the PROCESS__PTRACE
> to SECCLASS_PROCESS check which selinux_bprm_set_creds does later
> on.  Again I'm just not quite sure what's considered normal there
> these days.
>
> Paul, do you have input there?

I agree, the SELinux hook looks wrong.  Building on what Christian
said, this looks more like a ptrace operation than an exec operation.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
  2020-07-01  8:27   ` Christian Brauner
@ 2020-07-03 11:11     ` Adrian Reber
  0 siblings, 0 replies; 17+ messages in thread
From: Adrian Reber @ 2020-07-03 11:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Wed, Jul 01, 2020 at 10:27:08AM +0200, Christian Brauner wrote:
> On Wed, Jul 01, 2020 at 08:49:04AM +0200, Adrian Reber wrote:
> > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > checkpoint/restore for non-root users.
> > 
> > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > asked numerous times if it is possible to checkpoint/restore a process as
> > non-root. The answer usually was: 'almost'.
> > 
> > The main blocker to restore a process as non-root was to control the PID of the
> > restored process. This feature available via the clone3 system call, or via
> > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > 
> > In the past two years, requests for non-root checkpoint/restore have increased
> > due to the following use cases:
> > * Checkpoint/Restore in an HPC environment in combination with a resource
> >   manager distributing jobs where users are always running as non-root.
> >   There is a desire to provide a way to checkpoint and restore long running
> >   jobs.
> > * Container migration as non-root
> > * We have been in contact with JVM developers who are integrating
> >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >   applications are not meant to be running with CAP_SYS_ADMIN.
> > 
> > We have seen the following workarounds:
> > * Use a setuid wrapper around CRIU:
> >   See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> > * Use a setuid helper that writes to ns_last_pid.
> >   Unfortunately, this helper delegation technique is impossible to use with
> >   clone3, and is thus prone to races.
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Cycle through PIDs with fork() until the desired PID is reached:
> >   This has been demonstrated to work with cycling rates of 100,000 PIDs/s
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Patch out the CAP_SYS_ADMIN check from the kernel
> > * Run the desired application in a new user and PID namespace to provide
> >   a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
> >   typical container environments (e.g., Kubernetes) as /proc is
> >   typically protected with read-only layers (e.g., /proc/sys) for hardening
> >   purposes. Read-only layers prevent additional /proc mounts (due to proc's
> >   SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
> >   certain applications need access to /proc matching their PID namespace.
> > 
> > The introduced capability allows to:
> > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> >   for the corresponding PID namespace via ns_last_pid/clone3.
> > * Open files in /proc/pid/map_files when the current user is
> >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> >   files that are unreachable via the file system such as deleted files, or memfd
> >   files.
> > 
> > See corresponding selftest for an example with clone3().
> > 
> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > ---
> 
> I think that now looks reasonable. A few comments.
> 
> Before we proceed, please split the addition of
> checkpoint_restore_ns_capable() out into a separate patch.
> In fact, I think the cleanest way of doing this would be:
> - 0/n capability: add CAP_CHECKPOINT_RESTORE
> - 1/n pid: use checkpoint_restore_ns_capable() for set_tid
> - 2/n pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> - 3/n: proc: require checkpoint_restore_ns_capable() in init userns for map_files
> 
> (commit subjects up to you of course) and a nice commit message for each
> time we relax a permissions on something so we have a clear separate
> track record for each change in case we need to revert something. Then
> the rest of the patches in this series. Testing patches probably last.

Yes, makes sense. I was thinking about this already, but I was not sure
if it I should do it or not. But I had the same idea already.

		Adrian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-02 20:53   ` Serge E. Hallyn
@ 2020-07-03 11:18     ` Adrian Reber
  2020-07-03 18:12       ` Serge E. Hallyn
  0 siblings, 1 reply; 17+ messages in thread
From: Adrian Reber @ 2020-07-03 11:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel

On Thu, Jul 02, 2020 at 03:53:05PM -0500, Serge E. Hallyn wrote:
> On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> > This adds a test that changes its UID, uses capabilities to
> > get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> > create a process with a given PID as non-root.
> 
> Seems worth also verifying that it fails if you have no capabilities.
> I don't see that in the existing clone3/ test dir.

Bit confused about what you mean. This test does:

 * switch UID to 1000
 * run clone3() with set_tid set and expect EPERM
 * set CAP_CHECKPOINT_RESTORE capability
 * run clone3() with set_tid set and expect success

So it already does what I think you are asking for. Did I misunderstand
your comment?

		Adrian

> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > ---
> >  tools/testing/selftests/clone3/Makefile       |   4 +-
> >  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
> >  2 files changed, 206 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > 
> > diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> > index cf976c732906..ef7564cb7abe 100644
> > --- a/tools/testing/selftests/clone3/Makefile
> > +++ b/tools/testing/selftests/clone3/Makefile
> > @@ -1,6 +1,8 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  CFLAGS += -g -I../../../../usr/include/
> > +LDLIBS += -lcap
> >  
> > -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> > +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> > +	clone3_cap_checkpoint_restore
> >  
> >  include ../lib.mk
> > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > new file mode 100644
> > index 000000000000..2cc3d57b91f2
> > --- /dev/null
> > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > @@ -0,0 +1,203 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Based on Christian Brauner's clone3() example.
> > + * These tests are assuming to be running in the host's
> > + * PID namespace.
> > + */
> > +
> > +/* capabilities related code based on selftests/bpf/test_verifier.c */
> > +
> > +#define _GNU_SOURCE
> > +#include <errno.h>
> > +#include <linux/types.h>
> > +#include <linux/sched.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <stdbool.h>
> > +#include <sys/capability.h>
> > +#include <sys/prctl.h>
> > +#include <sys/syscall.h>
> > +#include <sys/types.h>
> > +#include <sys/un.h>
> > +#include <sys/wait.h>
> > +#include <unistd.h>
> > +#include <sched.h>
> > +
> > +#include "../kselftest.h"
> > +#include "clone3_selftests.h"
> > +
> > +#ifndef MAX_PID_NS_LEVEL
> > +#define MAX_PID_NS_LEVEL 32
> > +#endif
> > +
> > +static void child_exit(int ret)
> > +{
> > +	fflush(stdout);
> > +	fflush(stderr);
> > +	_exit(ret);
> > +}
> > +
> > +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> > +{
> > +	int status;
> > +	pid_t pid = -1;
> > +
> > +	struct clone_args args = {
> > +		.exit_signal = SIGCHLD,
> > +		.set_tid = ptr_to_u64(set_tid),
> > +		.set_tid_size = set_tid_size,
> > +	};
> > +
> > +	pid = sys_clone3(&args, sizeof(struct clone_args));
> > +	if (pid < 0) {
> > +		ksft_print_msg("%s - Failed to create new process\n",
> > +			       strerror(errno));
> > +		return -errno;
> > +	}
> > +
> > +	if (pid == 0) {
> > +		int ret;
> > +		char tmp = 0;
> > +
> > +		ksft_print_msg
> > +		    ("I am the child, my PID is %d (expected %d)\n",
> > +		     getpid(), set_tid[0]);
> > +
> > +		if (set_tid[0] != getpid())
> > +			child_exit(EXIT_FAILURE);
> > +		child_exit(EXIT_SUCCESS);
> > +	}
> > +
> > +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> > +		       getpid(), pid);
> > +
> > +	if (waitpid(pid, &status, 0) < 0) {
> > +		ksft_print_msg("Child returned %s\n", strerror(errno));
> > +		return -errno;
> > +	}
> > +
> > +	if (!WIFEXITED(status))
> > +		return -1;
> > +
> > +	return WEXITSTATUS(status);
> > +}
> > +
> > +static int test_clone3_set_tid(pid_t * set_tid,
> > +			       size_t set_tid_size, int expected)
> > +{
> > +	int ret;
> > +
> > +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> > +		       getpid(), set_tid[0]);
> > +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> > +
> > +	ksft_print_msg
> > +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> > +	     getpid(), set_tid[0], ret, expected);
> > +	if (ret != expected) {
> > +		ksft_test_result_fail
> > +		    ("[%d] Result (%d) is different than expected (%d)\n",
> > +		     getpid(), ret, expected);
> > +		return -1;
> > +	}
> > +	ksft_test_result_pass
> > +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> > +	     expected);
> > +
> > +	return 0;
> > +}
> > +
> > +struct libcap {
> > +	struct __user_cap_header_struct hdr;
> > +	struct __user_cap_data_struct data[2];
> > +};
> > +
> > +static int set_capability()
> > +{
> > +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> > +	struct libcap *cap;
> > +	int ret = -1;
> > +	cap_t caps;
> > +
> > +	caps = cap_get_proc();
> > +	if (!caps) {
> > +		perror("cap_get_proc");
> > +		return -1;
> > +	}
> > +
> > +	/* Drop all capabilities */
> > +	if (cap_clear(caps)) {
> > +		perror("cap_clear");
> > +		goto out;
> > +	}
> > +
> > +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> > +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> > +
> > +	cap = (struct libcap *) caps;
> > +
> > +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> > +	cap->data[1].effective |= 1 << (40 - 32);
> > +	cap->data[1].permitted |= 1 << (40 - 32);
> > +
> > +	if (cap_set_proc(caps)) {
> > +		perror("cap_set_proc");
> > +		goto out;
> > +	}
> > +	ret = 0;
> > +out:
> > +	if (cap_free(caps))
> > +		perror("cap_free");
> > +	return ret;
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +	pid_t pid;
> > +	int status;
> > +	int ret = 0;
> > +	pid_t set_tid[1];
> > +	uid_t uid = getuid();
> > +
> > +	ksft_print_header();
> > +	test_clone3_supported();
> > +	ksft_set_plan(2);
> > +
> > +	if (uid != 0) {
> > +		ksft_cnt.ksft_xskip = ksft_plan;
> > +		ksft_print_msg("Skipping all tests as non-root\n");
> > +		return ksft_exit_pass();
> > +	}
> > +
> > +	memset(&set_tid, 0, sizeof(set_tid));
> > +
> > +	/* Find the current active PID */
> > +	pid = fork();
> > +	if (pid == 0) {
> > +		ksft_print_msg("Child has PID %d\n", getpid());
> > +		child_exit(EXIT_SUCCESS);
> > +	}
> > +	if (waitpid(pid, &status, 0) < 0)
> > +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> > +
> > +	/* After the child has finished, its PID should be free. */
> > +	set_tid[0] = pid;
> > +
> > +	if (set_capability())
> > +		ksft_test_result_fail
> > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> > +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> > +	setgid(1000);
> > +	setuid(1000);
> > +	set_tid[0] = pid;
> > +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> > +	if (set_capability())
> > +		ksft_test_result_fail
> > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> > +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> > +
> > +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> > +}
> > -- 
> > 2.26.2
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  2020-07-03 11:18     ` Adrian Reber
@ 2020-07-03 18:12       ` Serge E. Hallyn
  0 siblings, 0 replies; 17+ messages in thread
From: Serge E. Hallyn @ 2020-07-03 18:12 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Serge E. Hallyn, Christian Brauner, Eric Biederman,
	Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov, Andrei Vagin,
	Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

On Fri, Jul 03, 2020 at 01:18:07PM +0200, Adrian Reber wrote:
> On Thu, Jul 02, 2020 at 03:53:05PM -0500, Serge E. Hallyn wrote:
> > On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> > > This adds a test that changes its UID, uses capabilities to
> > > get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> > > create a process with a given PID as non-root.
> > 
> > Seems worth also verifying that it fails if you have no capabilities.
> > I don't see that in the existing clone3/ test dir.
> 
> Bit confused about what you mean. This test does:
> 
>  * switch UID to 1000
>  * run clone3() with set_tid set and expect EPERM
>  * set CAP_CHECKPOINT_RESTORE capability
>  * run clone3() with set_tid set and expect success
> 
> So it already does what I think you are asking for. Did I misunderstand
> your comment?

Ah, no, I missed that line doing the call with -EPERM.  Thanks!

Acked-by: Serge Hallyn <serge@hallyn.com>


> 		Adrian
> 
> > > Signed-off-by: Adrian Reber <areber@redhat.com>
> > > ---
> > >  tools/testing/selftests/clone3/Makefile       |   4 +-
> > >  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
> > >  2 files changed, 206 insertions(+), 1 deletion(-)
> > >  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > 
> > > diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> > > index cf976c732906..ef7564cb7abe 100644
> > > --- a/tools/testing/selftests/clone3/Makefile
> > > +++ b/tools/testing/selftests/clone3/Makefile
> > > @@ -1,6 +1,8 @@
> > >  # SPDX-License-Identifier: GPL-2.0
> > >  CFLAGS += -g -I../../../../usr/include/
> > > +LDLIBS += -lcap
> > >  
> > > -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> > > +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> > > +	clone3_cap_checkpoint_restore
> > >  
> > >  include ../lib.mk
> > > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > new file mode 100644
> > > index 000000000000..2cc3d57b91f2
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > @@ -0,0 +1,203 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +/*
> > > + * Based on Christian Brauner's clone3() example.
> > > + * These tests are assuming to be running in the host's
> > > + * PID namespace.
> > > + */
> > > +
> > > +/* capabilities related code based on selftests/bpf/test_verifier.c */
> > > +
> > > +#define _GNU_SOURCE
> > > +#include <errno.h>
> > > +#include <linux/types.h>
> > > +#include <linux/sched.h>
> > > +#include <stdio.h>
> > > +#include <stdlib.h>
> > > +#include <stdbool.h>
> > > +#include <sys/capability.h>
> > > +#include <sys/prctl.h>
> > > +#include <sys/syscall.h>
> > > +#include <sys/types.h>
> > > +#include <sys/un.h>
> > > +#include <sys/wait.h>
> > > +#include <unistd.h>
> > > +#include <sched.h>
> > > +
> > > +#include "../kselftest.h"
> > > +#include "clone3_selftests.h"
> > > +
> > > +#ifndef MAX_PID_NS_LEVEL
> > > +#define MAX_PID_NS_LEVEL 32
> > > +#endif
> > > +
> > > +static void child_exit(int ret)
> > > +{
> > > +	fflush(stdout);
> > > +	fflush(stderr);
> > > +	_exit(ret);
> > > +}
> > > +
> > > +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> > > +{
> > > +	int status;
> > > +	pid_t pid = -1;
> > > +
> > > +	struct clone_args args = {
> > > +		.exit_signal = SIGCHLD,
> > > +		.set_tid = ptr_to_u64(set_tid),
> > > +		.set_tid_size = set_tid_size,
> > > +	};
> > > +
> > > +	pid = sys_clone3(&args, sizeof(struct clone_args));
> > > +	if (pid < 0) {
> > > +		ksft_print_msg("%s - Failed to create new process\n",
> > > +			       strerror(errno));
> > > +		return -errno;
> > > +	}
> > > +
> > > +	if (pid == 0) {
> > > +		int ret;
> > > +		char tmp = 0;
> > > +
> > > +		ksft_print_msg
> > > +		    ("I am the child, my PID is %d (expected %d)\n",
> > > +		     getpid(), set_tid[0]);
> > > +
> > > +		if (set_tid[0] != getpid())
> > > +			child_exit(EXIT_FAILURE);
> > > +		child_exit(EXIT_SUCCESS);
> > > +	}
> > > +
> > > +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> > > +		       getpid(), pid);
> > > +
> > > +	if (waitpid(pid, &status, 0) < 0) {
> > > +		ksft_print_msg("Child returned %s\n", strerror(errno));
> > > +		return -errno;
> > > +	}
> > > +
> > > +	if (!WIFEXITED(status))
> > > +		return -1;
> > > +
> > > +	return WEXITSTATUS(status);
> > > +}
> > > +
> > > +static int test_clone3_set_tid(pid_t * set_tid,
> > > +			       size_t set_tid_size, int expected)
> > > +{
> > > +	int ret;
> > > +
> > > +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> > > +		       getpid(), set_tid[0]);
> > > +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> > > +
> > > +	ksft_print_msg
> > > +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> > > +	     getpid(), set_tid[0], ret, expected);
> > > +	if (ret != expected) {
> > > +		ksft_test_result_fail
> > > +		    ("[%d] Result (%d) is different than expected (%d)\n",
> > > +		     getpid(), ret, expected);
> > > +		return -1;
> > > +	}
> > > +	ksft_test_result_pass
> > > +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> > > +	     expected);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +struct libcap {
> > > +	struct __user_cap_header_struct hdr;
> > > +	struct __user_cap_data_struct data[2];
> > > +};
> > > +
> > > +static int set_capability()
> > > +{
> > > +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> > > +	struct libcap *cap;
> > > +	int ret = -1;
> > > +	cap_t caps;
> > > +
> > > +	caps = cap_get_proc();
> > > +	if (!caps) {
> > > +		perror("cap_get_proc");
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* Drop all capabilities */
> > > +	if (cap_clear(caps)) {
> > > +		perror("cap_clear");
> > > +		goto out;
> > > +	}
> > > +
> > > +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> > > +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> > > +
> > > +	cap = (struct libcap *) caps;
> > > +
> > > +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> > > +	cap->data[1].effective |= 1 << (40 - 32);
> > > +	cap->data[1].permitted |= 1 << (40 - 32);
> > > +
> > > +	if (cap_set_proc(caps)) {
> > > +		perror("cap_set_proc");
> > > +		goto out;
> > > +	}
> > > +	ret = 0;
> > > +out:
> > > +	if (cap_free(caps))
> > > +		perror("cap_free");
> > > +	return ret;
> > > +}
> > > +
> > > +int main(int argc, char *argv[])
> > > +{
> > > +	pid_t pid;
> > > +	int status;
> > > +	int ret = 0;
> > > +	pid_t set_tid[1];
> > > +	uid_t uid = getuid();
> > > +
> > > +	ksft_print_header();
> > > +	test_clone3_supported();
> > > +	ksft_set_plan(2);
> > > +
> > > +	if (uid != 0) {
> > > +		ksft_cnt.ksft_xskip = ksft_plan;
> > > +		ksft_print_msg("Skipping all tests as non-root\n");
> > > +		return ksft_exit_pass();
> > > +	}
> > > +
> > > +	memset(&set_tid, 0, sizeof(set_tid));
> > > +
> > > +	/* Find the current active PID */
> > > +	pid = fork();
> > > +	if (pid == 0) {
> > > +		ksft_print_msg("Child has PID %d\n", getpid());
> > > +		child_exit(EXIT_SUCCESS);
> > > +	}
> > > +	if (waitpid(pid, &status, 0) < 0)
> > > +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> > > +
> > > +	/* After the child has finished, its PID should be free. */
> > > +	set_tid[0] = pid;
> > > +
> > > +	if (set_capability())
> > > +		ksft_test_result_fail
> > > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > > +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> > > +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> > > +	setgid(1000);
> > > +	setuid(1000);
> > > +	set_tid[0] = pid;
> > > +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> > > +	if (set_capability())
> > > +		ksft_test_result_fail
> > > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > > +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> > > +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> > > +
> > > +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> > > +}
> > > -- 
> > > 2.26.2
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-02 22:00     ` Paul Moore
@ 2020-07-06 17:13       ` Nicolas Viennot
  2020-07-06 17:44         ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Nicolas Viennot @ 2020-07-06 17:13 UTC (permalink / raw)
  To: Paul Moore, Serge E. Hallyn, Christian Brauner
  Cc: Adrian Reber, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Andrei Vagin, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel

> > This is scary.  But I believe it is safe.
> >
> > Reviewed-by: Serge Hallyn <serge@hallyn.com>
> >
> > I am a bit curious about the implications of the selinux patch.
> > IIUC you are using the permission of the tracing process to execute
> > the file without transition, so this is a way to work around the
> > policy which might prevent the tracee from doing so.
> > Given that SELinux wants to be MAC, I'm not *quite* sure that's
> > considered kosher.  You also are skipping the PROCESS__PTRACE to
> > SECCLASS_PROCESS check which selinux_bprm_set_creds does later on.
> > Again I'm just not quite sure what's considered normal there these
> > days.
> >
> > Paul, do you have input there?
>
> I agree, the SELinux hook looks wrong.  Building on what Christian said, this looks more like a ptrace operation than an exec operation.

Serge, Paul, Christian,

I made a PoC to demonstrate the change of /proc/self/exe without CAP_SYS_ADMIN using only ptrace and execve.
You may find it here: https://github.com/nviennot/run_as_exe

What do you recommend to relax the security checks in the kernel when it comes to changing the exe link?

    Nico

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-06 17:13       ` Nicolas Viennot
@ 2020-07-06 17:44         ` Christian Brauner
  2020-07-07 15:45           ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2020-07-06 17:44 UTC (permalink / raw)
  To: Nicolas Viennot
  Cc: Paul Moore, Serge E. Hallyn, Adrian Reber, Eric Biederman,
	Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov, Andrei Vagin,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel

On Mon, Jul 06, 2020 at 05:13:35PM +0000, Nicolas Viennot wrote:
> > > This is scary.  But I believe it is safe.
> > >
> > > Reviewed-by: Serge Hallyn <serge@hallyn.com>
> > >
> > > I am a bit curious about the implications of the selinux patch.
> > > IIUC you are using the permission of the tracing process to execute
> > > the file without transition, so this is a way to work around the
> > > policy which might prevent the tracee from doing so.
> > > Given that SELinux wants to be MAC, I'm not *quite* sure that's
> > > considered kosher.  You also are skipping the PROCESS__PTRACE to
> > > SECCLASS_PROCESS check which selinux_bprm_set_creds does later on.
> > > Again I'm just not quite sure what's considered normal there these
> > > days.
> > >
> > > Paul, do you have input there?
> >
> > I agree, the SELinux hook looks wrong.  Building on what Christian said, this looks more like a ptrace operation than an exec operation.
> 
> Serge, Paul, Christian,
> 
> I made a PoC to demonstrate the change of /proc/self/exe without CAP_SYS_ADMIN using only ptrace and execve.
> You may find it here: https://github.com/nviennot/run_as_exe
> 
> What do you recommend to relax the security checks in the kernel when it comes to changing the exe link?

Looks fun! Yeah, so that this is possible is known afaict. But you're
not really circumventing the kernel check but are mucking with the EFL
by changing the auxv, right?

Originally, you needed to be userns root, i.e. only uid 0 could
change the /proc/self/exe link (cf. [1]). This was changed to
ns_capable(CAP_SYS_ADMIN) in [2].

The original reasoning in [1] is interesting as it basically already
points to your poc:

"Still note that updating exe-file link now doesn't require sys-resource
 capability anymore, after all there is no much profit in preventing
 setup own file link (there are a number of ways to execute own code --
 ptrace, ld-preload, so that the only reliable way to find which exactly
 code is executed is to inspect running program memory).  Still we
 require the caller to be at least user-namespace root user."

There were arguments being made that /proc/<pid>/exe needs to be sm that
userspace can have a decent amount of trust in but I believe that that's
not a great argument.

But let me dig a little into the original discussion and see what the
thread-model was.
At this point I'm starting to believe that it was people being cautios
but better be sure.

[1]: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[2]: 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[3]: https://lore.kernel.org/patchwork/patch/697304/

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-06 17:44         ` Christian Brauner
@ 2020-07-07 15:45           ` Christian Brauner
  2020-07-07 20:27             ` Cyrill Gorcunov
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2020-07-07 15:45 UTC (permalink / raw)
  To: Nicolas Viennot, Jann Horn
  Cc: Paul Moore, Serge E. Hallyn, Adrian Reber, Eric Biederman,
	Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov, Andrei Vagin,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel

On Mon, Jul 06, 2020 at 07:44:38PM +0200, Christian Brauner wrote:
> On Mon, Jul 06, 2020 at 05:13:35PM +0000, Nicolas Viennot wrote:
> > > > This is scary.  But I believe it is safe.
> > > >
> > > > Reviewed-by: Serge Hallyn <serge@hallyn.com>
> > > >
> > > > I am a bit curious about the implications of the selinux patch.
> > > > IIUC you are using the permission of the tracing process to execute
> > > > the file without transition, so this is a way to work around the
> > > > policy which might prevent the tracee from doing so.
> > > > Given that SELinux wants to be MAC, I'm not *quite* sure that's
> > > > considered kosher.  You also are skipping the PROCESS__PTRACE to
> > > > SECCLASS_PROCESS check which selinux_bprm_set_creds does later on.
> > > > Again I'm just not quite sure what's considered normal there these
> > > > days.
> > > >
> > > > Paul, do you have input there?
> > >
> > > I agree, the SELinux hook looks wrong.  Building on what Christian said, this looks more like a ptrace operation than an exec operation.
> > 
> > Serge, Paul, Christian,
> > 
> > I made a PoC to demonstrate the change of /proc/self/exe without CAP_SYS_ADMIN using only ptrace and execve.
> > You may find it here: https://github.com/nviennot/run_as_exe
> > 
> > What do you recommend to relax the security checks in the kernel when it comes to changing the exe link?
> 
> Looks fun! Yeah, so that this is possible is known afaict. But you're
> not really circumventing the kernel check but are mucking with the EFL
> by changing the auxv, right?
> 
> Originally, you needed to be userns root, i.e. only uid 0 could
> change the /proc/self/exe link (cf. [1]). This was changed to
> ns_capable(CAP_SYS_ADMIN) in [2].
> 
> The original reasoning in [1] is interesting as it basically already
> points to your poc:
> 
> "Still note that updating exe-file link now doesn't require sys-resource
>  capability anymore, after all there is no much profit in preventing
>  setup own file link (there are a number of ways to execute own code --
>  ptrace, ld-preload, so that the only reliable way to find which exactly
>  code is executed is to inspect running program memory).  Still we
>  require the caller to be at least user-namespace root user."
> 
> There were arguments being made that /proc/<pid>/exe needs to be sm that
> userspace can have a decent amount of trust in but I believe that that's
> not a great argument.
> 
> But let me dig a little into the original discussion and see what the
> thread-model was.
> At this point I'm starting to believe that it was people being cautios
> but better be sure.

Ok, so the original patch proposal was presented in [4] in 2014. The
final version of that patch added the PR_SET_MM_MAP we know today. The
initial version presented in [4] did not require _any_ privilege.

So the reasoning for only placing the /proc/<pid>/exe link under
ns_capable(CAP_SYS_ADMIN) is very thin. to quote from [5]:

"Controlling exe_fd without privileges may turn out to be dangerous. At
 least things like tomoyo examine it for making policy decisions (see
 tomoyo_manager())."

So yes, tomoyo_get_exe() is what this was retained for apparently:

const char *tomoyo_get_exe(void)
{
	struct file *exe_file;
	const char *cp;
	struct mm_struct *mm = current->mm;

	if (!mm)
		return NULL;
	exe_file = get_mm_exe_file(mm);
	if (!exe_file)
		return NULL;

	cp = tomoyo_realpath_from_path(&exe_file->f_path);
	fput(exe_file);
	return cp;
}

The exe path is literally used in tomoyo_manager() to verify that you
are allowed to change policy. That seems like a bad idea to me but then
again, I don't know enough about Tomoyo. In any case, I think that means
we can't remove CAP_SYS_ADMIN because that would make things worse than
they are right now for Tomoyo but I also don't see why placing this
under ns_capable(CAP_SYS_ADMIN) || ns_capable(CAP_CHECKPOINT_RESTORE)
would make this any worse.

And Cyrill (and later in that thread Andrei) already mentioned it in [6]:
"@exe_fd is just a hint and as I mentioned if we have ptrace/preload
 rights there damn a lot of ways to inject own code into any program so
 that a user won't even notice ;)"

Another place where the exe file is relevant is for the coredump with
the -E option. But it only uses the path when generating the coredump
pattern and if that's a security issue than your poc shows that this can
already be achieved today.

Christian

> 
> [1]: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> [2]: 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
> [3]: https://lore.kernel.org/patchwork/patch/697304/

[4]: https://lore.kernel.org/lkml/20140703151102.842945837@openvz.org/ 
[5]: https://lore.kernel.org/lkml/CAGXu5jL3exT4j+8rjMv1O54uJWQ5UHL69Z-24b61rhXROqZamQ@mail.gmail.com/
[6]: https://lore.kernel.org/lkml/20140722203614.GF838@moon/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe
  2020-07-07 15:45           ` Christian Brauner
@ 2020-07-07 20:27             ` Cyrill Gorcunov
  0 siblings, 0 replies; 17+ messages in thread
From: Cyrill Gorcunov @ 2020-07-07 20:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Nicolas Viennot, Jann Horn, Paul Moore, Serge E. Hallyn,
	Adrian Reber, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Andrei Vagin, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, linux-fsdevel

On Tue, Jul 07, 2020 at 05:45:04PM +0200, Christian Brauner wrote:
...
> 
> Ok, so the original patch proposal was presented in [4] in 2014. The
> final version of that patch added the PR_SET_MM_MAP we know today. The
> initial version presented in [4] did not require _any_ privilege.
> 

True. I still think that relyng on /proc/<pid>/exe being immutable (or
guarded by caps) in a sake of security is a bit misleading, this link
only a hint without any guarantees of what code is being executed once
we pass cs:rip to userspace right after exec is completed. Nowadays I rather
think we might need to call audit_log() here or something similar to point
that exe link is changed (by criu or someone else) and simply notify
node's administrator, that's all. But as you pointed tomoyo may be
affected if we simply drops all caps from here. Thus I agree that
the new cap won't make situation worse.

Still I'm not in touch with kernel code for a couple of years already
and might be missing something obvious here.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01  6:49 [PATCH v4 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE Adrian Reber
2020-07-01  6:49 ` [PATCH v4 1/3] " Adrian Reber
2020-07-01  8:27   ` Christian Brauner
2020-07-03 11:11     ` Adrian Reber
2020-07-01  6:49 ` [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test Adrian Reber
2020-07-02 20:53   ` Serge E. Hallyn
2020-07-03 11:18     ` Adrian Reber
2020-07-03 18:12       ` Serge E. Hallyn
2020-07-01  6:49 ` [PATCH v4 3/3] prctl: Allow ptrace capable processes to change /proc/self/exe Adrian Reber
2020-07-01  8:55   ` Christian Brauner
2020-07-02 21:58     ` Serge E. Hallyn
2020-07-02 21:16   ` Serge E. Hallyn
2020-07-02 22:00     ` Paul Moore
2020-07-06 17:13       ` Nicolas Viennot
2020-07-06 17:44         ` Christian Brauner
2020-07-07 15:45           ` Christian Brauner
2020-07-07 20:27             ` Cyrill Gorcunov

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org
	public-inbox-index linux-security-module

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git