Linux-Security-Module Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] capabilities: Introduce CAP_RESTORE
@ 2020-05-22  5:53 Adrian Reber
  2020-05-22  7:53 ` Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Adrian Reber @ 2020-05-22  5:53 UTC (permalink / raw)
  To: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood
  Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
	Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	Aaron Goidel, linux-security-module, linux-kernel, selinux,
	Eric Paris, Jann Horn

This enables CRIU to checkpoint and restore a process as non-root.

Over the last years CRIU upstream has been asked a couple of time if it
is possible to checkpoint and restore a process as non-root. The answer
usually was: 'almost'.

The main blocker to restore a process was that selecting the PID of the
restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.

In the last two years the questions about checkpoint/restore as non-root
have increased and especially in the last few months we have seen
multiple people inventing workarounds.

The use-cases so far and their workarounds:

 * Checkpoint/Restore in an HPC environment in combination with
   a resource manager distributing jobs. Users are always running
   as non root, but there was the desire to provide a way to
   checkpoint and restore long running jobs.
   Workaround: setuid wrapper to start CRIU as root as non-root
   https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
 * Another use case to checkpoint/restore processes as non-root
   uses as workaround a non privileged process which cycles through
   PIDs by calling fork() as fast as possible with a rate of
   100,000 pids/s instead of writing to ns_last_pid
   https://github.com/twosigma/set_ns_last_pid
 * Fast Java startup using checkpoint/restore.
   We have been in contact with JVM developers who are integrating
   CRIU into a JVM to decrease the startup time.
   Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
 * Container migration as non root. There are people already
   using CRIU to migrate containers as non-root. The solution
   there is to run it in a user namespace. So if you are able
   to carefully setup your environment with the namespaces
   it is already possible to restore a container/process as non-root.
   Unfortunately it is not always possible to setup an environment
   in such a way and for easier access to non-root based container
   migration this patch is also required.

There are probably a few more things guarded by CAP_SYS_ADMIN required
to run checkpoint/restore as non-root, but by applying this patch I can
already checkpoint and restore processes as non-root. As there are
already multiple workarounds I would prefer to do it correctly in the
kernel to avoid that CRIU users are starting to invent more workarounds.

I have used the following tests to verify that this change works as
expected by setting the new capability CAP_RESTORE on the two resulting
test binaries:

$ cat ns_last_pid.c
 // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/file.h>
 #include <sys/types.h>
 #include <unistd.h>

int main(int argc, char *argv[])
{
	pid_t pid, new_pid;
	char buf[32];
	int fd;

	if (argc != 2)
		return 1;

	printf("Opening ns_last_pid...\n");
	fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
	if (fd < 0) {
		perror("Cannot open ns_last_pid");
		return 1;
	}

	printf("Locking ns_last_pid...\n");
	if (flock(fd, LOCK_EX)) {
		close(fd);
		printf("Cannot lock ns_last_pid\n");
		return 1;
	}

	pid = atoi(argv[1]);
	snprintf(buf, sizeof(buf), "%d", pid - 1);
	printf("Writing pid-1 to ns_last_pid...\n");
	if (write(fd, buf, strlen(buf)) != strlen(buf)) {
		printf("Cannot write to buf\n");
		return 1;
	}

	printf("Forking...\n");
	new_pid = fork();
	if (new_pid == 0) {
		printf("I am the child!\n");
		exit(0);
	} else if (new_pid == pid)
		printf("I am the parent. My child got the pid %d!\n", new_pid);
	else
		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);

	printf("Cleaning up...\n");
	if (flock(fd, LOCK_UN))
		printf("Cannot unlock\n");
	close(fd);
	return 0;
}
$ id -u; /home/libcap/ns_last_pid 300000
1001
Opening ns_last_pid...
Locking ns_last_pid...
Writing pid-1 to ns_last_pid...
Forking...
I am the parent. My child got the pid 300000!
I am the child!
Cleaning up...

For the clone3() based approach:
$ cat clone3_set_tid.c
 #define _GNU_SOURCE
 #include <linux/sched.h>
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <unistd.h>

 #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))

int main(int argc, char *argv[])
{
	struct clone_args c_args = { };
	pid_t pid, new_pid;

	if (argc != 2)
		return 1;

	pid = atoi(argv[1]);
	c_args.set_tid = ptr_to_u64(&pid);
	c_args.set_tid_size = 1;

	printf("Forking...\n");
	new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args));
	if (new_pid == 0) {
		printf("I am the child!\n");
		exit(0);
	} else if (new_pid == pid)
		printf("I am the parent. My child got the pid %d!\n", new_pid);
	else
		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
	printf("Done\n");

	return 0;
}
$ id -u; /home/libcap/clone3_set_tid 300000
1001
Forking...
I am the parent. My child got the pid 300000!
Done
I am the child!

Signed-off-by: Adrian Reber <areber@redhat.com>
---
 include/linux/capability.h          | 5 +++++
 include/uapi/linux/capability.h     | 9 ++++++++-
 kernel/pid.c                        | 2 +-
 kernel/pid_namespace.c              | 2 +-
 security/selinux/include/classmap.h | 5 +++--
 5 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1278313cb2bc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,11 @@ static inline bool bpf_capable(void)
 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
 }
 
+static inline bool restore_ns_capable(struct user_namespace *ns)
+{
+	return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN);
+}
+
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c7372180a0a9..4bcc4e3d41ff 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -406,7 +406,14 @@ struct vfs_ns_cap_data {
  */
 #define CAP_BPF			39
 
-#define CAP_LAST_CAP         CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_RESTORE		40
+
+#define CAP_LAST_CAP         CAP_RESTORE
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 3122043fe364..bbc26f2bcff6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			if (tid != 1 && !tmp->child_reaper)
 				goto out_free;
 			retval = -EPERM;
-			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+			if (!restore_ns_capable(tmp->user_ns))
 				goto out_free;
 			set_tid_size--;
 		}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..f58186b31ce6 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	struct ctl_table tmp = *table;
 	int ret, next;
 
-	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+	if (write && !restore_ns_capable(pid_ns->user_ns))
 		return -EPERM;
 
 	/*
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 98e1513b608a..f8b8f12a6ebd 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
 	    "audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+		"restore"
 
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_RESTORE
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 

base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5
-- 
2.26.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22  5:53 [PATCH] capabilities: Introduce CAP_RESTORE Adrian Reber
@ 2020-05-22  7:53 ` Christian Brauner
  2020-05-22 18:02   ` Andrei Vagin
  2020-05-22 13:41 ` Christian Brauner
  2020-05-22 16:40 ` Casey Schaufler
  2 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2020-05-22  7:53 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann, Aaron Goidel,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn

On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote:
> This enables CRIU to checkpoint and restore a process as non-root.
> 
> Over the last years CRIU upstream has been asked a couple of time if it
> is possible to checkpoint and restore a process as non-root. The answer
> usually was: 'almost'.
> 
> The main blocker to restore a process was that selecting the PID of the
> restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.
> 
> In the last two years the questions about checkpoint/restore as non-root
> have increased and especially in the last few months we have seen
> multiple people inventing workarounds.
> 
> The use-cases so far and their workarounds:
> 
>  * Checkpoint/Restore in an HPC environment in combination with
>    a resource manager distributing jobs. Users are always running
>    as non root, but there was the desire to provide a way to
>    checkpoint and restore long running jobs.
>    Workaround: setuid wrapper to start CRIU as root as non-root
>    https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
>  * Another use case to checkpoint/restore processes as non-root
>    uses as workaround a non privileged process which cycles through
>    PIDs by calling fork() as fast as possible with a rate of
>    100,000 pids/s instead of writing to ns_last_pid
>    https://github.com/twosigma/set_ns_last_pid
>  * Fast Java startup using checkpoint/restore.
>    We have been in contact with JVM developers who are integrating
>    CRIU into a JVM to decrease the startup time.
>    Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
>  * Container migration as non root. There are people already
>    using CRIU to migrate containers as non-root. The solution
>    there is to run it in a user namespace. So if you are able
>    to carefully setup your environment with the namespaces
>    it is already possible to restore a container/process as non-root.
>    Unfortunately it is not always possible to setup an environment
>    in such a way and for easier access to non-root based container
>    migration this patch is also required.
> 
> There are probably a few more things guarded by CAP_SYS_ADMIN required
> to run checkpoint/restore as non-root, but by applying this patch I can
> already checkpoint and restore processes as non-root. As there are
> already multiple workarounds I would prefer to do it correctly in the
> kernel to avoid that CRIU users are starting to invent more workarounds.

It sounds ok to me as long as this feature is guarded by any sensible
capability. I don't want users to be able to randomly choose their pid
without any capability required.

We've heard the plea for unprivileged checkpoint/restore through the
grapevine and a few times about CAP_RESTORE at plumbers but it's one of
those cases where nobody pushed for it so it's urgency was questionable.
This is 5.9 material though and could you please add selftests?

It also seems you have future changes planned that would make certain
things accessible via CAP_RESTORE that are currently guarded by other
capabilities. Any specific things in mind? It might be worth knowing
what we'd be getting ourselves into if you're planning on flipping
switches in other places.

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22  5:53 [PATCH] capabilities: Introduce CAP_RESTORE Adrian Reber
  2020-05-22  7:53 ` Christian Brauner
@ 2020-05-22 13:41 ` Christian Brauner
  2020-05-22 16:40 ` Casey Schaufler
  2 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2020-05-22 13:41 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann, Aaron Goidel,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn

On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote:
> This enables CRIU to checkpoint and restore a process as non-root.
> 
> Over the last years CRIU upstream has been asked a couple of time if it
> is possible to checkpoint and restore a process as non-root. The answer
> usually was: 'almost'.
> 
> The main blocker to restore a process was that selecting the PID of the
> restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.
> 
> In the last two years the questions about checkpoint/restore as non-root
> have increased and especially in the last few months we have seen
> multiple people inventing workarounds.
> 
> The use-cases so far and their workarounds:
> 
>  * Checkpoint/Restore in an HPC environment in combination with
>    a resource manager distributing jobs. Users are always running
>    as non root, but there was the desire to provide a way to
>    checkpoint and restore long running jobs.
>    Workaround: setuid wrapper to start CRIU as root as non-root
>    https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
>  * Another use case to checkpoint/restore processes as non-root
>    uses as workaround a non privileged process which cycles through
>    PIDs by calling fork() as fast as possible with a rate of
>    100,000 pids/s instead of writing to ns_last_pid
>    https://github.com/twosigma/set_ns_last_pid
>  * Fast Java startup using checkpoint/restore.
>    We have been in contact with JVM developers who are integrating
>    CRIU into a JVM to decrease the startup time.
>    Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
>  * Container migration as non root. There are people already
>    using CRIU to migrate containers as non-root. The solution
>    there is to run it in a user namespace. So if you are able
>    to carefully setup your environment with the namespaces
>    it is already possible to restore a container/process as non-root.
>    Unfortunately it is not always possible to setup an environment
>    in such a way and for easier access to non-root based container
>    migration this patch is also required.
> 
> There are probably a few more things guarded by CAP_SYS_ADMIN required
> to run checkpoint/restore as non-root, but by applying this patch I can
> already checkpoint and restore processes as non-root. As there are
> already multiple workarounds I would prefer to do it correctly in the
> kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> I have used the following tests to verify that this change works as
> expected by setting the new capability CAP_RESTORE on the two resulting
> test binaries:
> 
> $ cat ns_last_pid.c
>  // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/file.h>
>  #include <sys/types.h>
>  #include <unistd.h>
> 
> int main(int argc, char *argv[])
> {
> 	pid_t pid, new_pid;
> 	char buf[32];
> 	int fd;
> 
> 	if (argc != 2)
> 		return 1;
> 
> 	printf("Opening ns_last_pid...\n");
> 	fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
> 	if (fd < 0) {
> 		perror("Cannot open ns_last_pid");
> 		return 1;
> 	}
> 
> 	printf("Locking ns_last_pid...\n");
> 	if (flock(fd, LOCK_EX)) {
> 		close(fd);
> 		printf("Cannot lock ns_last_pid\n");
> 		return 1;
> 	}
> 
> 	pid = atoi(argv[1]);
> 	snprintf(buf, sizeof(buf), "%d", pid - 1);
> 	printf("Writing pid-1 to ns_last_pid...\n");
> 	if (write(fd, buf, strlen(buf)) != strlen(buf)) {
> 		printf("Cannot write to buf\n");
> 		return 1;
> 	}
> 
> 	printf("Forking...\n");
> 	new_pid = fork();
> 	if (new_pid == 0) {
> 		printf("I am the child!\n");
> 		exit(0);
> 	} else if (new_pid == pid)
> 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> 	else
> 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> 
> 	printf("Cleaning up...\n");
> 	if (flock(fd, LOCK_UN))
> 		printf("Cannot unlock\n");
> 	close(fd);
> 	return 0;
> }
> $ id -u; /home/libcap/ns_last_pid 300000
> 1001
> Opening ns_last_pid...
> Locking ns_last_pid...
> Writing pid-1 to ns_last_pid...
> Forking...
> I am the parent. My child got the pid 300000!
> I am the child!
> Cleaning up...
> 
> For the clone3() based approach:
> $ cat clone3_set_tid.c
>  #define _GNU_SOURCE
>  #include <linux/sched.h>
>  #include <stdint.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <sys/syscall.h>
>  #include <unistd.h>
> 
>  #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> 
> int main(int argc, char *argv[])
> {
> 	struct clone_args c_args = { };
> 	pid_t pid, new_pid;
> 
> 	if (argc != 2)
> 		return 1;
> 
> 	pid = atoi(argv[1]);
> 	c_args.set_tid = ptr_to_u64(&pid);
> 	c_args.set_tid_size = 1;
> 
> 	printf("Forking...\n");
> 	new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args));
> 	if (new_pid == 0) {
> 		printf("I am the child!\n");
> 		exit(0);
> 	} else if (new_pid == pid)
> 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> 	else
> 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> 	printf("Done\n");
> 
> 	return 0;
> }
> $ id -u; /home/libcap/clone3_set_tid 300000
> 1001
> Forking...
> I am the parent. My child got the pid 300000!
> Done
> I am the child!
> 
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  include/linux/capability.h          | 5 +++++
>  include/uapi/linux/capability.h     | 9 ++++++++-
>  kernel/pid.c                        | 2 +-
>  kernel/pid_namespace.c              | 2 +-
>  security/selinux/include/classmap.h | 5 +++--
>  5 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b4345b38a6be..1278313cb2bc 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -261,6 +261,11 @@ static inline bool bpf_capable(void)
>  	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
>  }
>  
> +static inline bool restore_ns_capable(struct user_namespace *ns)
> +{
> +	return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN);
> +}
> +

When did everyone start putting their favorite little helpers in here?
I see both bpf_capable() and perform_capable() are going to be in
linux/capability.h if linux-next is to be believed. Seems they could
very well go in their own bpf headers.
Anyway, if everyone is fine with putting specific helpers in
capability.h then ok.


>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c7372180a0a9..4bcc4e3d41ff 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -406,7 +406,14 @@ struct vfs_ns_cap_data {
>   */
>  #define CAP_BPF			39
>  
> -#define CAP_LAST_CAP         CAP_BPF
> +
> +/* Allow checkpoint/restore related operations */
> +/* Allow PID selection during clone3() */
> +/* Allow writing to ns_last_pid */
> +
> +#define CAP_RESTORE		40

Thinking about what Adrian said this should probably really be:

CAP_CHECKPOINT_RESTORE

I initially thought that might be too long but our config option is the
same name CONFIG_CHECKPOINT_RESTORE so seems ok and other capabilities
have similar long names (CAP_DAC_READ_SEARCH, CAP_NET_BIND_SERVICE come
to mind)

> +
> +#define CAP_LAST_CAP         CAP_RESTORE
>  
>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>  
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 3122043fe364..bbc26f2bcff6 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			if (tid != 1 && !tmp->child_reaper)
>  				goto out_free;
>  			retval = -EPERM;
> -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> +			if (!restore_ns_capable(tmp->user_ns))
>  				goto out_free;
>  			set_tid_size--;
>  		}
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0e5ac162c3a8..f58186b31ce6 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  	struct ctl_table tmp = *table;
>  	int ret, next;
>  
> -	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +	if (write && !restore_ns_capable(pid_ns->user_ns))
>  		return -EPERM;
>  
>  	/*
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index 98e1513b608a..f8b8f12a6ebd 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -27,9 +27,10 @@
>  	    "audit_control", "setfcap"
>  
>  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> -		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> +		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> +		"restore"
>  
> -#if CAP_LAST_CAP > CAP_BPF
> +#if CAP_LAST_CAP > CAP_RESTORE
>  #error New capability defined, please update COMMON_CAP2_PERMS.
>  #endif
>  
> 
> base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22  5:53 [PATCH] capabilities: Introduce CAP_RESTORE Adrian Reber
  2020-05-22  7:53 ` Christian Brauner
  2020-05-22 13:41 ` Christian Brauner
@ 2020-05-22 16:40 ` Casey Schaufler
  2020-05-23  4:27   ` Andrei Vagin
  2020-05-25  8:05   ` Adrian Reber
  2 siblings, 2 replies; 8+ messages in thread
From: Casey Schaufler @ 2020-05-22 16:40 UTC (permalink / raw)
  To: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood
  Cc: Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann, Aaron Goidel,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, Casey Schaufler

On 5/21/2020 10:53 PM, Adrian Reber wrote:
> This enables CRIU to checkpoint and restore a process as non-root.

I know it sounds pedantic, but could you spell out CRIU once?
While I know that everyone who cares either knows or can guess
what you're talking about, it may be a mystery to some of the
newer kernel developers.

> Over the last years CRIU upstream has been asked a couple of time if it
> is possible to checkpoint and restore a process as non-root. The answer
> usually was: 'almost'.
>
> The main blocker to restore a process was that selecting the PID of the
> restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.

What are the other blockers? Are you going to suggest additional new
capabilities to clear them?

> In the last two years the questions about checkpoint/restore as non-root
> have increased and especially in the last few months we have seen
> multiple people inventing workarounds.

Giving a process CAP_SYS_ADMIN is a non-root solution.

> The use-cases so far and their workarounds:
>
>  * Checkpoint/Restore in an HPC environment in combination with
>    a resource manager distributing jobs. Users are always running
>    as non root, but there was the desire to provide a way to
>    checkpoint and restore long running jobs.
>    Workaround: setuid wrapper to start CRIU as root as non-root
>    https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c

This is a classic and well understood mechanism for dealing with
this kind of situation. You could have checkpointer-filecap-sys_admin.c
instead, if you want to reduce use of the super-user.

> * Another use case to checkpoint/restore processes as non-root
>    uses as workaround a non privileged process which cycles through
>    PIDs by calling fork() as fast as possible with a rate of
>    100,000 pids/s instead of writing to ns_last_pid
>    https://github.com/twosigma/set_ns_last_pid

Oh dear.

>  * Fast Java startup using checkpoint/restore.
>    We have been in contact with JVM developers who are integrating
>    CRIU into a JVM to decrease the startup time.
>    Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel

That's not a workaround, it's a policy violation.
Bad JVM! No biscuit!

>  * Container migration as non root. There are people already
>    using CRIU to migrate containers as non-root. The solution
>    there is to run it in a user namespace. So if you are able
>    to carefully setup your environment with the namespaces
>    it is already possible to restore a container/process as non-root.

This is exactly the kind of situation that user namespaces are
supposed to address.

>    Unfortunately it is not always possible to setup an environment
>    in such a way and for easier access to non-root based container
>    migration this patch is also required.

If a user namespace solution is impossible or (more likely) too
expensive, there's always the checkpointer-filecap-sys_admin option.

> There are probably a few more things guarded by CAP_SYS_ADMIN required
> to run checkpoint/restore as non-root,

If you need CAP_SYS_ADMIN anyway you're not gaining anything by
separating out CAP_RESTORE.

>  but by applying this patch I can
> already checkpoint and restore processes as non-root. As there are
> already multiple workarounds I would prefer to do it correctly in the
> kernel to avoid that CRIU users are starting to invent more workarounds.

You've presented a couple of really inappropriate implementations
that would qualify as workarounds. But the other two are completely
appropriate within the system security policy. They don't "get around"
the problem, they use existing mechanisms as they are intended.



> I have used the following tests to verify that this change works as
> expected by setting the new capability CAP_RESTORE on the two resulting
> test binaries:
>
> $ cat ns_last_pid.c
>  // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/file.h>
>  #include <sys/types.h>
>  #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> 	pid_t pid, new_pid;
> 	char buf[32];
> 	int fd;
>
> 	if (argc != 2)
> 		return 1;
>
> 	printf("Opening ns_last_pid...\n");
> 	fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
> 	if (fd < 0) {
> 		perror("Cannot open ns_last_pid");
> 		return 1;
> 	}
>
> 	printf("Locking ns_last_pid...\n");
> 	if (flock(fd, LOCK_EX)) {
> 		close(fd);
> 		printf("Cannot lock ns_last_pid\n");
> 		return 1;
> 	}
>
> 	pid = atoi(argv[1]);
> 	snprintf(buf, sizeof(buf), "%d", pid - 1);
> 	printf("Writing pid-1 to ns_last_pid...\n");
> 	if (write(fd, buf, strlen(buf)) != strlen(buf)) {
> 		printf("Cannot write to buf\n");
> 		return 1;
> 	}
>
> 	printf("Forking...\n");
> 	new_pid = fork();
> 	if (new_pid == 0) {
> 		printf("I am the child!\n");
> 		exit(0);
> 	} else if (new_pid == pid)
> 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> 	else
> 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
>
> 	printf("Cleaning up...\n");
> 	if (flock(fd, LOCK_UN))
> 		printf("Cannot unlock\n");
> 	close(fd);
> 	return 0;
> }
> $ id -u; /home/libcap/ns_last_pid 300000
> 1001
> Opening ns_last_pid...
> Locking ns_last_pid...
> Writing pid-1 to ns_last_pid...
> Forking...
> I am the parent. My child got the pid 300000!
> I am the child!
> Cleaning up...
>
> For the clone3() based approach:
> $ cat clone3_set_tid.c
>  #define _GNU_SOURCE
>  #include <linux/sched.h>
>  #include <stdint.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <sys/syscall.h>
>  #include <unistd.h>
>
>  #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
>
> int main(int argc, char *argv[])
> {
> 	struct clone_args c_args = { };
> 	pid_t pid, new_pid;
>
> 	if (argc != 2)
> 		return 1;
>
> 	pid = atoi(argv[1]);
> 	c_args.set_tid = ptr_to_u64(&pid);
> 	c_args.set_tid_size = 1;
>
> 	printf("Forking...\n");
> 	new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args));
> 	if (new_pid == 0) {
> 		printf("I am the child!\n");
> 		exit(0);
> 	} else if (new_pid == pid)
> 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> 	else
> 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> 	printf("Done\n");
>
> 	return 0;
> }
> $ id -u; /home/libcap/clone3_set_tid 300000
> 1001
> Forking...
> I am the parent. My child got the pid 300000!
> Done
> I am the child!
>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
>  include/linux/capability.h          | 5 +++++
>  include/uapi/linux/capability.h     | 9 ++++++++-
>  kernel/pid.c                        | 2 +-
>  kernel/pid_namespace.c              | 2 +-
>  security/selinux/include/classmap.h | 5 +++--
>  5 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b4345b38a6be..1278313cb2bc 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -261,6 +261,11 @@ static inline bool bpf_capable(void)
>  	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
>  }
>  
> +static inline bool restore_ns_capable(struct user_namespace *ns)
> +{
> +	return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN);
> +}
> +
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>  
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c7372180a0a9..4bcc4e3d41ff 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -406,7 +406,14 @@ struct vfs_ns_cap_data {
>   */
>  #define CAP_BPF			39
>  
> -#define CAP_LAST_CAP         CAP_BPF
> +
> +/* Allow checkpoint/restore related operations */
> +/* Allow PID selection during clone3() */
> +/* Allow writing to ns_last_pid */
> +
> +#define CAP_RESTORE		40
> +
> +#define CAP_LAST_CAP         CAP_RESTORE
>  
>  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>  
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 3122043fe364..bbc26f2bcff6 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			if (tid != 1 && !tmp->child_reaper)
>  				goto out_free;
>  			retval = -EPERM;
> -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> +			if (!restore_ns_capable(tmp->user_ns))
>  				goto out_free;
>  			set_tid_size--;
>  		}
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0e5ac162c3a8..f58186b31ce6 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>  	struct ctl_table tmp = *table;
>  	int ret, next;
>  
> -	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +	if (write && !restore_ns_capable(pid_ns->user_ns))
>  		return -EPERM;
>  
>  	/*
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index 98e1513b608a..f8b8f12a6ebd 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -27,9 +27,10 @@
>  	    "audit_control", "setfcap"
>  
>  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> -		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> +		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> +		"restore"
>  
> -#if CAP_LAST_CAP > CAP_BPF
> +#if CAP_LAST_CAP > CAP_RESTORE
>  #error New capability defined, please update COMMON_CAP2_PERMS.
>  #endif
>  
>
> base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22  7:53 ` Christian Brauner
@ 2020-05-22 18:02   ` Andrei Vagin
  0 siblings, 0 replies; 8+ messages in thread
From: Andrei Vagin @ 2020-05-22 18:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adrian Reber, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann, Aaron Goidel,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn

On Fri, May 22, 2020 at 09:53:31AM +0200, Christian Brauner wrote:
> On Fri, May 22, 2020 at 07:53:50AM +0200, Adrian Reber wrote:
> > 
> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root, but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> It sounds ok to me as long as this feature is guarded by any sensible
> capability. I don't want users to be able to randomly choose their pid
> without any capability required.
> 
> We've heard the plea for unprivileged checkpoint/restore through the
> grapevine and a few times about CAP_RESTORE at plumbers but it's one of
> those cases where nobody pushed for it so it's urgency was questionable.
> This is 5.9 material though and could you please add selftests?
> 
> It also seems you have future changes planned that would make certain
> things accessible via CAP_RESTORE that are currently guarded by other
> capabilities. Any specific things in mind? It might be worth knowing
> what we'd be getting ourselves into if you're planning on flipping
> switches in other places.

/proc/pid/map_files is one of the first candidate what we need to think
about. CRIU opens files from /proc/pid/map_files to dump file mappings,
shared memory mappings, memfd files.

Right now, it is impossible to open these files without CAP_SYS_ADMIN in
the root user-namespace (proc_map_files_get_link).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22 16:40 ` Casey Schaufler
@ 2020-05-23  4:27   ` Andrei Vagin
  2020-05-25  2:01     ` Casey Schaufler
  2020-05-25  8:05   ` Adrian Reber
  1 sibling, 1 reply; 8+ messages in thread
From: Andrei Vagin @ 2020-05-23  4:27 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Mike Rapoport, Radostin Stoyanov,
	Cyrill Gorcunov, Serge Hallyn, Stephen Smalley, Sargun Dhillon,
	Arnd Bergmann, Aaron Goidel, linux-security-module, linux-kernel,
	selinux, Eric Paris, Jann Horn

On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> On 5/21/2020 10:53 PM, Adrian Reber wrote:
> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root,
> 
> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
> separating out CAP_RESTORE.
> 
> >  but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> You've presented a couple of really inappropriate implementations
> that would qualify as workarounds. But the other two are completely
> appropriate within the system security policy. They don't "get around"
> the problem, they use existing mechanisms as they are intended.
> 

With CAP_CHECKPOINT_RESTORE, we will need to use the same mechanisms.

The problem is that CAP_SYS_ADMIN is too wide. If a process has
CAP_SYS_ADMIN, it can do a lot of things and  the operation of forking a
process with a specified pid isn't the most dangerous one in this case.
Offten security policies don't allow to grant CAP_SYS_ADMIN to any
third-party tools even in non-root user namespaces.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-23  4:27   ` Andrei Vagin
@ 2020-05-25  2:01     ` Casey Schaufler
  0 siblings, 0 replies; 8+ messages in thread
From: Casey Schaufler @ 2020-05-25  2:01 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Adrian Reber, Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Mike Rapoport, Radostin Stoyanov,
	Cyrill Gorcunov, Serge Hallyn, Stephen Smalley, Sargun Dhillon,
	Arnd Bergmann, Aaron Goidel, linux-security-module, linux-kernel,
	selinux, Eric Paris, Jann Horn, Casey Schaufler

On 5/22/2020 9:27 PM, Andrei Vagin wrote:
> On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
>> On 5/21/2020 10:53 PM, Adrian Reber wrote:
>>> There are probably a few more things guarded by CAP_SYS_ADMIN required
>>> to run checkpoint/restore as non-root,
>> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
>> separating out CAP_RESTORE.
>>
>>>  but by applying this patch I can
>>> already checkpoint and restore processes as non-root. As there are
>>> already multiple workarounds I would prefer to do it correctly in the
>>> kernel to avoid that CRIU users are starting to invent more workarounds.
>> You've presented a couple of really inappropriate implementations
>> that would qualify as workarounds. But the other two are completely
>> appropriate within the system security policy. They don't "get around"
>> the problem, they use existing mechanisms as they are intended.
>>
> With CAP_CHECKPOINT_RESTORE, we will need to use the same mechanisms.

Then why call them out as objectionable "workarounds"?

> The problem is that CAP_SYS_ADMIN is too wide.

This is well understood, and irrelevant.

If we broke out CAP_SYS_ADMIN properly we'd have hundreds of
capabilities, and no one would be able to manage the capability
sets on anything. Just breaking out of CAP_SYS_ADMIN, especially
if the process is going to need other capabilities anyway, gains
you nothing.

>  If a process has
> CAP_SYS_ADMIN, it can do a lot of things and  the operation of forking a
> process with a specified pid isn't the most dangerous one in this case.
> Offten security policies don't allow to grant CAP_SYS_ADMIN to any
> third-party tools even in non-root user namespaces.
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] capabilities: Introduce CAP_RESTORE
  2020-05-22 16:40 ` Casey Schaufler
  2020-05-23  4:27   ` Andrei Vagin
@ 2020-05-25  8:05   ` Adrian Reber
  1 sibling, 0 replies; 8+ messages in thread
From: Adrian Reber @ 2020-05-25  8:05 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov,
	Oleg Nesterov, Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Mike Rapoport, Radostin Stoyanov,
	Cyrill Gorcunov, Serge Hallyn, Stephen Smalley, Sargun Dhillon,
	Arnd Bergmann, Aaron Goidel, linux-security-module, linux-kernel,
	selinux, Eric Paris, Jann Horn

On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote:
> On 5/21/2020 10:53 PM, Adrian Reber wrote:
> > This enables CRIU to checkpoint and restore a process as non-root.
> 
> I know it sounds pedantic, but could you spell out CRIU once?
> While I know that everyone who cares either knows or can guess
> what you're talking about, it may be a mystery to some of the
> newer kernel developers.

Sure. CRIU - Checkpoint/Restore In Userspace.

> > Over the last years CRIU upstream has been asked a couple of time if it
> > is possible to checkpoint and restore a process as non-root. The answer
> > usually was: 'almost'.
> >
> > The main blocker to restore a process was that selecting the PID of the
> > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN.
> 
> What are the other blockers? Are you going to suggest additional new
> capabilities to clear them?

As mentioned somewhere else access to /proc/<pid>/map_files/ would be
helpful. Right now I am testing with a JVM and it works without root
just with the attached patch. Without access to /proc/<pid>/map_files/
not everything CRIU can do will actually work, but we are a lot closer
to what our users have been asking for.

> > In the last two years the questions about checkpoint/restore as non-root
> > have increased and especially in the last few months we have seen
> > multiple people inventing workarounds.
> 
> Giving a process CAP_SYS_ADMIN is a non-root solution.

Yes, but like mentioned somewhere else not a solution that actually
works, because CAP_SYS_ADMIN allows too much. Especially for the
checkpoint/restore case, we really need one (setting the PID of a new
process) and to make it complete a second (reading map_files).

Reading the comments in include/uapi/linux/capability.h concerning
CAP_SYS_ADMIN it allows the binary to do at least 35 things. The two
(three) I mentioned above (ns_last_pid (clone3) map_files) are not
mentioned in that list, so CAP_SYS_ADMIN allows probably much more.

To allow checkpoint/restore as non-root nobody will give CRIU
CAP_SYS_ADMIN because it is too wide.

> > The use-cases so far and their workarounds:
> >
> >  * Checkpoint/Restore in an HPC environment in combination with
> >    a resource manager distributing jobs. Users are always running
> >    as non root, but there was the desire to provide a way to
> >    checkpoint and restore long running jobs.
> >    Workaround: setuid wrapper to start CRIU as root as non-root
> >    https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> 
> This is a classic and well understood mechanism for dealing with
> this kind of situation. You could have checkpointer-filecap-sys_admin.c
> instead, if you want to reduce use of the super-user.
> 
> > * Another use case to checkpoint/restore processes as non-root
> >    uses as workaround a non privileged process which cycles through
> >    PIDs by calling fork() as fast as possible with a rate of
> >    100,000 pids/s instead of writing to ns_last_pid
> >    https://github.com/twosigma/set_ns_last_pid
> 
> Oh dear.
> 
> >  * Fast Java startup using checkpoint/restore.
> >    We have been in contact with JVM developers who are integrating
> >    CRIU into a JVM to decrease the startup time.
> >    Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel
> 
> That's not a workaround, it's a policy violation.
> Bad JVM! No biscuit!

This was used as a proof of concept to see if we can checkpoint and
restore a JVM without root. Only the ns_last_pid check was removed to
see if it works and it does.

> >  * Container migration as non root. There are people already
> >    using CRIU to migrate containers as non-root. The solution
> >    there is to run it in a user namespace. So if you are able
> >    to carefully setup your environment with the namespaces
> >    it is already possible to restore a container/process as non-root.
> 
> This is exactly the kind of situation that user namespaces are
> supposed to address.
> 
> >    Unfortunately it is not always possible to setup an environment
> >    in such a way and for easier access to non-root based container
> >    migration this patch is also required.
> 
> If a user namespace solution is impossible or (more likely) too
> expensive, there's always the checkpointer-filecap-sys_admin option.

But then again we open up all of CAP_SYS_ADMIN, which is not necessary.

> > There are probably a few more things guarded by CAP_SYS_ADMIN required
> > to run checkpoint/restore as non-root,
> 
> If you need CAP_SYS_ADMIN anyway you're not gaining anything by
> separating out CAP_RESTORE.

No, as described we can checkpoint and restore a JVM with this patch and
it also solves the problem the set_ns_last_pid fork() loop daemon tries
to solve. It is not enough to support the full functionality of CRIU as
map_files is also important, but we do not need CAP_SYS_ADMIN and
CAP_RESTORE. Only CAP_RESTORE would be necessary.

With a new capability users can enable checkpoint/restore as non-root
without giving CRIU access to any of the other possibilities offered by
CAP_SYS_ADMIN. Setting a PID and map_files have been introduced for CRIU
and used to live behind CONFIG_CHECKPOINT_RESTORE. Having a capability
for checkpoint/restore would make it easier for CRIU users to run it as
non-root and make it very clear what is possible when giving CRIU the
new capability. No other things would be allowed than necessary for
checkpoint/restore. Setting a PID is most important for the restore part
and reading map_files would be helpful during checkpoint. So it actually
should be called CAP_CHECKPOINT_RESTORE as Christian mentioned in
another email.

> >  but by applying this patch I can
> > already checkpoint and restore processes as non-root. As there are
> > already multiple workarounds I would prefer to do it correctly in the
> > kernel to avoid that CRIU users are starting to invent more workarounds.
> 
> You've presented a couple of really inappropriate implementations
> that would qualify as workarounds. But the other two are completely
> appropriate within the system security policy. They don't "get around"
> the problem, they use existing mechanisms as they are intended.

I agree with the user namespace approach to be appropriate, but not the
CAP_SYS_ADMIN approach as CRIU only needs a tiny subset (2 things) of
what CAP_SYS_ADMIN allows.

> > I have used the following tests to verify that this change works as
> > expected by setting the new capability CAP_RESTORE on the two resulting
> > test binaries:
> >
> > $ cat ns_last_pid.c
> >  // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/file.h>
> >  #include <sys/types.h>
> >  #include <unistd.h>
> >
> > int main(int argc, char *argv[])
> > {
> > 	pid_t pid, new_pid;
> > 	char buf[32];
> > 	int fd;
> >
> > 	if (argc != 2)
> > 		return 1;
> >
> > 	printf("Opening ns_last_pid...\n");
> > 	fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644);
> > 	if (fd < 0) {
> > 		perror("Cannot open ns_last_pid");
> > 		return 1;
> > 	}
> >
> > 	printf("Locking ns_last_pid...\n");
> > 	if (flock(fd, LOCK_EX)) {
> > 		close(fd);
> > 		printf("Cannot lock ns_last_pid\n");
> > 		return 1;
> > 	}
> >
> > 	pid = atoi(argv[1]);
> > 	snprintf(buf, sizeof(buf), "%d", pid - 1);
> > 	printf("Writing pid-1 to ns_last_pid...\n");
> > 	if (write(fd, buf, strlen(buf)) != strlen(buf)) {
> > 		printf("Cannot write to buf\n");
> > 		return 1;
> > 	}
> >
> > 	printf("Forking...\n");
> > 	new_pid = fork();
> > 	if (new_pid == 0) {
> > 		printf("I am the child!\n");
> > 		exit(0);
> > 	} else if (new_pid == pid)
> > 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> > 	else
> > 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> >
> > 	printf("Cleaning up...\n");
> > 	if (flock(fd, LOCK_UN))
> > 		printf("Cannot unlock\n");
> > 	close(fd);
> > 	return 0;
> > }
> > $ id -u; /home/libcap/ns_last_pid 300000
> > 1001
> > Opening ns_last_pid...
> > Locking ns_last_pid...
> > Writing pid-1 to ns_last_pid...
> > Forking...
> > I am the parent. My child got the pid 300000!
> > I am the child!
> > Cleaning up...
> >
> > For the clone3() based approach:
> > $ cat clone3_set_tid.c
> >  #define _GNU_SOURCE
> >  #include <linux/sched.h>
> >  #include <stdint.h>
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/types.h>
> >  #include <sys/stat.h>
> >  #include <sys/syscall.h>
> >  #include <unistd.h>
> >
> >  #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> >
> > int main(int argc, char *argv[])
> > {
> > 	struct clone_args c_args = { };
> > 	pid_t pid, new_pid;
> >
> > 	if (argc != 2)
> > 		return 1;
> >
> > 	pid = atoi(argv[1]);
> > 	c_args.set_tid = ptr_to_u64(&pid);
> > 	c_args.set_tid_size = 1;
> >
> > 	printf("Forking...\n");
> > 	new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args));
> > 	if (new_pid == 0) {
> > 		printf("I am the child!\n");
> > 		exit(0);
> > 	} else if (new_pid == pid)
> > 		printf("I am the parent. My child got the pid %d!\n", new_pid);
> > 	else
> > 		printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid);
> > 	printf("Done\n");
> >
> > 	return 0;
> > }
> > $ id -u; /home/libcap/clone3_set_tid 300000
> > 1001
> > Forking...
> > I am the parent. My child got the pid 300000!
> > Done
> > I am the child!
> >
> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > ---
> >  include/linux/capability.h          | 5 +++++
> >  include/uapi/linux/capability.h     | 9 ++++++++-
> >  kernel/pid.c                        | 2 +-
> >  kernel/pid_namespace.c              | 2 +-
> >  security/selinux/include/classmap.h | 5 +++--
> >  5 files changed, 18 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> > index b4345b38a6be..1278313cb2bc 100644
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -261,6 +261,11 @@ static inline bool bpf_capable(void)
> >  	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
> >  }
> >  
> > +static inline bool restore_ns_capable(struct user_namespace *ns)
> > +{
> > +	return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN);
> > +}
> > +
> >  /* audit system wants to get cap info from files as well */
> >  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
> >  
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index c7372180a0a9..4bcc4e3d41ff 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -406,7 +406,14 @@ struct vfs_ns_cap_data {
> >   */
> >  #define CAP_BPF			39
> >  
> > -#define CAP_LAST_CAP         CAP_BPF
> > +
> > +/* Allow checkpoint/restore related operations */
> > +/* Allow PID selection during clone3() */
> > +/* Allow writing to ns_last_pid */
> > +
> > +#define CAP_RESTORE		40
> > +
> > +#define CAP_LAST_CAP         CAP_RESTORE
> >  
> >  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
> >  
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 3122043fe364..bbc26f2bcff6 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >  			if (tid != 1 && !tmp->child_reaper)
> >  				goto out_free;
> >  			retval = -EPERM;
> > -			if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> > +			if (!restore_ns_capable(tmp->user_ns))
> >  				goto out_free;
> >  			set_tid_size--;
> >  		}
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index 0e5ac162c3a8..f58186b31ce6 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> >  	struct ctl_table tmp = *table;
> >  	int ret, next;
> >  
> > -	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> > +	if (write && !restore_ns_capable(pid_ns->user_ns))
> >  		return -EPERM;
> >  
> >  	/*
> > diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> > index 98e1513b608a..f8b8f12a6ebd 100644
> > --- a/security/selinux/include/classmap.h
> > +++ b/security/selinux/include/classmap.h
> > @@ -27,9 +27,10 @@
> >  	    "audit_control", "setfcap"
> >  
> >  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
> > -		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> > +		"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> > +		"restore"
> >  
> > -#if CAP_LAST_CAP > CAP_BPF
> > +#if CAP_LAST_CAP > CAP_RESTORE
> >  #error New capability defined, please update COMMON_CAP2_PERMS.
> >  #endif
> >  
> >
> > base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22  5:53 [PATCH] capabilities: Introduce CAP_RESTORE Adrian Reber
2020-05-22  7:53 ` Christian Brauner
2020-05-22 18:02   ` Andrei Vagin
2020-05-22 13:41 ` Christian Brauner
2020-05-22 16:40 ` Casey Schaufler
2020-05-23  4:27   ` Andrei Vagin
2020-05-25  2:01     ` Casey Schaufler
2020-05-25  8:05   ` Adrian Reber

Linux-Security-Module Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-security-module/0 linux-security-module/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-security-module linux-security-module/ https://lore.kernel.org/linux-security-module \
		linux-security-module@vger.kernel.org
	public-inbox-index linux-security-module

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-security-module


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git