[v3.4] capabilities: require CAP_SETFCAP to map uid 0
diff mbox series

Message ID 20210420134334.GA11582@mail.hallyn.com
State Accepted
Commit db2e718a47984b9d71ed890eb2ea36ecf150de18
Headers show
Series
  • [v3.4] capabilities: require CAP_SETFCAP to map uid 0
Related show

Commit Message

Serge E. Hallyn April 20, 2021, 1:43 p.m. UTC
cap_setfcap is required to create file capabilities.

Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
process running as uid 0 but without cap_setfcap is able to work around
this as follows: unshare a new user namespace which maps parent uid 0
into the child namespace.  While this task will not have new
capabilities against the parent namespace, there is a loophole due to
the way namespaced file capabilities are represented as xattrs.  File
capabilities valid in userns 1 are distinguished from file capabilities
valid in userns 2 by the kuid which underlies uid 0.  Therefore the
restricted root process can unshare a new self-mapping namespace, add a
namespaced file capability onto a file, then use that file capability in
the parent namespace.

To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

As a further wrinkle:  a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid.  In this
case we do not have the credential from before unshare,  which was
potentially more restricted.  So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP.  Then we can use that
during map_write().

With this patch:

1. Unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. Root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. Root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities.  This approach can be seen here:
    https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4

History:

Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file
capabilities") tried to fix the issue by preventing v3 fscaps to be
written to disk when the root uid would map to the same uid in nested
user namespaces. This led to regressions for various workloads. For
example, see [1]. Ultimately this is a valid use-case we have to support
meaning we had to revert this change in 3b0c2d3eaa83 ("Revert
95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file
capabilities")").

[1]: https://github.com/containers/buildah/issues/3071

Signed-off-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>

Changelog:
   * fix logic in the case of writing to another task's uid_map
   * rename 'ns' to 'map_ns', and make a file_ns local variable
   * use /* comments */
   * update the CAP_SETFCAP comment in capability.h
   * rename parent_unpriv to parent_can_setfcap (and reverse the
     logic)
   * remove printks
   * clarify (i hope) the code comments
   * update capability.h comment
   * renamed parent_can_setfcap to parent_could_setfcap
   * made the check its own disallowed_0_mapping() fn
   * moved the check into new_idmap_permitted
   * rename disallowed_0_mapping to verify_root_mapping
   * change verify_root_mapping to Christian's suggested flow
   * correct+clarify comments: parent uid 0 mapping to any
     child uid is a problem.
   * remove unused lower_first variable.
---
 include/linux/user_namespace.h  |  3 ++
 include/uapi/linux/capability.h |  3 +-
 kernel/user_namespace.c         | 65 +++++++++++++++++++++++++++++++--
 3 files changed, 67 insertions(+), 4 deletions(-)

Comments

Christian Brauner April 20, 2021, 4:47 p.m. UTC | #1
On Tue, Apr 20, 2021 at 08:43:34AM -0500, Serge Hallyn wrote:
> cap_setfcap is required to create file capabilities.
> 
> Since 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a
> process running as uid 0 but without cap_setfcap is able to work around
> this as follows: unshare a new user namespace which maps parent uid 0
> into the child namespace.  While this task will not have new
> capabilities against the parent namespace, there is a loophole due to
> the way namespaced file capabilities are represented as xattrs.  File
> capabilities valid in userns 1 are distinguished from file capabilities
> valid in userns 2 by the kuid which underlies uid 0.  Therefore the
> restricted root process can unshare a new self-mapping namespace, add a
> namespaced file capability onto a file, then use that file capability in
> the parent namespace.
> 
> To prevent that, do not allow mapping parent uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
> 
> As a further wrinkle:  a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid.  In this
> case we do not have the credential from before unshare,  which was
> potentially more restricted.  So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP.  Then we can use that
> during map_write().
> 
> With this patch:
> 
> 1. Unprivileged user can still unshare -Ur
> 
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
> 
> 2. Root user can still unshare -Ur
> 
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
> 
> 3. Root user without CAP_SETFCAP cannot unshare -Ur:
> 
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
> 
> Note: an alternative solution would be to allow uid 0 mappings by
> processes without CAP_SETFCAP, but to prevent such a namespace from
> writing any file capabilities.  This approach can be seen here:
>     https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> 
> History:
> 
> Commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file
> capabilities") tried to fix the issue by preventing v3 fscaps to be
> written to disk when the root uid would map to the same uid in nested
> user namespaces. This led to regressions for various workloads. For
> example, see [1]. Ultimately this is a valid use-case we have to support
> meaning we had to revert this change in 3b0c2d3eaa83 ("Revert
> 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file
> capabilities")").
> 
> [1]: https://github.com/containers/buildah/issues/3071
> 
> Signed-off-by: Serge Hallyn <serge@hallyn.com>
> Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
> Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>

If there's no objections then Linus can probably just pick up the single
patch here directly:
https://lore.kernel.org/lkml/20210420134334.GA11582@mail.hallyn.com

I'm not sure it's worth waiting and releasing another kernel with this
bug. This tigthens the semantics nicely and makes for a simple check at
userns creation time instead of repeatedly checking at setxattr(). With
all the testing done we can be quite confident the risk of regressions
is way lower than the old patch and even if we see one I think this
version of the fix is actually worth the risk.

Christian
Linus Torvalds April 20, 2021, 5:33 p.m. UTC | #2
On Tue, Apr 20, 2021 at 9:47 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> If there's no objections then Linus can probably just pick up the single
> patch here directly:

I'm ok with that assuming I don't hear any last-minute concerns..

I'll plan on appling it later today, so anybody with concerns please
holler quickly.

I don't want to leave it to much later in the week, and I may or may
not be functional tomorrow (getting my second vaccine shot, some
people react more strongly to it, so I'm leaving the possibility open
of not getting out of bed ;)

             Linus
Christian Brauner April 21, 2021, 8:26 a.m. UTC | #3
On Tue, Apr 20, 2021 at 10:33:54AM -0700, Linus Torvalds wrote:
> On Tue, Apr 20, 2021 at 9:47 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > If there's no objections then Linus can probably just pick up the single
> > patch here directly:
> 
> I'm ok with that assuming I don't hear any last-minute concerns..
> 
> I'll plan on appling it later today, so anybody with concerns please
> holler quickly.
> 
> I don't want to leave it to much later in the week, and I may or may
> not be functional tomorrow (getting my second vaccine shot, some
> people react more strongly to it, so I'm leaving the possibility open
> of not getting out of bed ;)

We're a few short months away from that pleasure here... :)
At least the general plan is to still have 75% of the population
vaccinated by end of July (That's assuming supply holds up.).
And we're probably looking at another booster shot soon enough. With the
EU having cancelled J&J and AZ contracts for next year that's going to
be interesting as well. Let's hope they won't fumble it again.

Christian
Eric W. Biederman April 21, 2021, 7:16 p.m. UTC | #4
"Serge E. Hallyn" <serge@hallyn.com> writes:

> +/**
> + * verify_root_map() - check the uid 0 mapping
> + * @file: idmapping file
> + * @map_ns: user namespace of the target process
> + * @new_map: requested idmap
> + *
> + * If a process requests mapping parent uid 0 into the new ns, verify that the
> + * process writing the map had the CAP_SETFCAP capability as the target process
> + * will be able to write fscaps that are valid in ancestor user namespaces.
> + *
> + * Return: true if the mapping is allowed, false if not.
> + */
> +static bool verify_root_map(const struct file *file,
> +			    struct user_namespace *map_ns,
> +			    struct uid_gid_map *new_map)
> +{
> +	int idx;
> +	const struct user_namespace *file_ns = file->f_cred->user_ns;
> +	struct uid_gid_extent *extent0 = NULL;
> +
> +	for (idx = 0; idx < new_map->nr_extents; idx++) {
> +		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> +			extent0 = &new_map->extent[idx];
> +		else
> +			extent0 = &new_map->forward[idx];
> +		if (extent0->lower_first == 0)
> +			break;
> +
> +		extent0 = NULL;
> +	}
> +
> +	if (!extent0)
> +		return true;
> +
> +	if (map_ns == file_ns) {
> +		/* The process unshared its ns and is writing to its own
> +		 * /proc/self/uid_map.  User already has full capabilites in
> +		 * the new namespace.  Verify that the parent had CAP_SETFCAP
> +		 * when it unshared.
> +		 * */
> +		if (!file_ns->parent_could_setfcap)
> +			return false;
> +	} else {
> +		/* Process p1 is writing to uid_map of p2, who is in a child
> +		 * user namespace to p1's.  Verify that the opener of the map
> +		 * file has CAP_SETFCAP against the parent of the new map
> +		 * namespace */
> +		if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> +			return false;
> +	}

Is there any reason this permission check is not simply:

	return map_ns->parent_could_setfcap ||
               file_ns_capable(file, map_ns->parent, CAP_SETFCAP);

That is why don't we allow any mapping (that is otherwise valid) in user
namespaces whose creator had the permission to call CAP_SETFCAP?

Why limit the case of using the creators permissions to only the case of
mapping just a single uid (that happens to be the current euid) in the
user namespace?

I don't see any safety reasons for the map_ns == file_ns test.



Is the file_ns_capable check for CAP_SETFCAP actually needed?  AKA could
the permission check be simplified to:

	return map_ns->parent_could_setfcap;

That would be a much easier rule to explain to people.

I seem to remember distributions at least trying to make newuidmap have
just CAP_SETUID and newgidmap have just CAP_SETGID.  Such a simplified
check would facilitate that.

Eric
Serge E. Hallyn April 22, 2021, 1:20 p.m. UTC | #5
On Wed, Apr 21, 2021 at 02:16:34PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > +/**
> > + * verify_root_map() - check the uid 0 mapping
> > + * @file: idmapping file
> > + * @map_ns: user namespace of the target process
> > + * @new_map: requested idmap
> > + *
> > + * If a process requests mapping parent uid 0 into the new ns, verify that the
> > + * process writing the map had the CAP_SETFCAP capability as the target process
> > + * will be able to write fscaps that are valid in ancestor user namespaces.
> > + *
> > + * Return: true if the mapping is allowed, false if not.
> > + */
> > +static bool verify_root_map(const struct file *file,
> > +			    struct user_namespace *map_ns,
> > +			    struct uid_gid_map *new_map)
> > +{
> > +	int idx;
> > +	const struct user_namespace *file_ns = file->f_cred->user_ns;
> > +	struct uid_gid_extent *extent0 = NULL;
> > +
> > +	for (idx = 0; idx < new_map->nr_extents; idx++) {
> > +		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > +			extent0 = &new_map->extent[idx];
> > +		else
> > +			extent0 = &new_map->forward[idx];
> > +		if (extent0->lower_first == 0)
> > +			break;
> > +
> > +		extent0 = NULL;
> > +	}
> > +
> > +	if (!extent0)
> > +		return true;
> > +
> > +	if (map_ns == file_ns) {
> > +		/* The process unshared its ns and is writing to its own
> > +		 * /proc/self/uid_map.  User already has full capabilites in
> > +		 * the new namespace.  Verify that the parent had CAP_SETFCAP
> > +		 * when it unshared.
> > +		 * */
> > +		if (!file_ns->parent_could_setfcap)
> > +			return false;
> > +	} else {
> > +		/* Process p1 is writing to uid_map of p2, who is in a child
> > +		 * user namespace to p1's.  Verify that the opener of the map
> > +		 * file has CAP_SETFCAP against the parent of the new map
> > +		 * namespace */
> > +		if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > +			return false;
> > +	}
> 
> Is there any reason this permission check is not simply:
> 
> 	return map_ns->parent_could_setfcap ||
>                file_ns_capable(file, map_ns->parent, CAP_SETFCAP);
>
> That is why don't we allow any mapping (that is otherwise valid) in user
> namespaces whose creator had the permission to call CAP_SETFCAP?

Well I guess the question is exactly who has to have the privilege.

If task X does the unshare and its sibling task Y writes the mapping
(technically, opens the map file), do we want to say that it suffices
for *either* X or Y to have CAP_SETFCAP?  I was thinking we want to
require task Y to have the privilege.  Task X having it would not
suffice.

> Why limit the case of using the creators permissions to only the case of
> mapping just a single uid (that happens to be the current euid) in the
> user namespace?
> 
> I don't see any safety reasons for the map_ns == file_ns test.
> 
> 
> 
> Is the file_ns_capable check for CAP_SETFCAP actually needed?  AKA could
> the permission check be simplified to:
> 
> 	return map_ns->parent_could_setfcap;

Currently uid 1000 can create a new user namespace, then have a fully privileged
root process in uid 1000's peer userns write a 0 mapping.  With just this
check, that would not be possible.

> That would be a much easier rule to explain to people.
> 
> I seem to remember distributions at least trying to make newuidmap have
> just CAP_SETUID and newgidmap have just CAP_SETGID.  Such a simplified
> check would facilitate that.

Yes they would have to add an additional CAP_SETFCAP.

Patch
diff mbox series

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@  struct user_namespace {
 	kgid_t			group;
 	struct ns_common	ns;
 	unsigned long		flags;
+	/* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+	 * in its effective capability set at the child ns creation time. */
+	bool			parent_could_setfcap;
 
 #ifdef CONFIG_KEYS
 	/* List of joinable keyrings in this namespace.  Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@  struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_CONTROL    30
 
-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+   Map uid=0 into a child user namespace. */
 
 #define CAP_SETFCAP	     31
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..9a4b980d695b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@  int create_user_ns(struct cred *new)
 	if (!ns)
 		goto fail_dec;
 
+	ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
 	ret = ns_alloc_inum(&ns->ns);
 	if (ret)
 		goto fail_free;
@@ -841,6 +842,60 @@  static int sort_idmaps(struct uid_gid_map *map)
 	return 0;
 }
 
+/**
+ * verify_root_map() - check the uid 0 mapping
+ * @file: idmapping file
+ * @map_ns: user namespace of the target process
+ * @new_map: requested idmap
+ *
+ * If a process requests mapping parent uid 0 into the new ns, verify that the
+ * process writing the map had the CAP_SETFCAP capability as the target process
+ * will be able to write fscaps that are valid in ancestor user namespaces.
+ *
+ * Return: true if the mapping is allowed, false if not.
+ */
+static bool verify_root_map(const struct file *file,
+			    struct user_namespace *map_ns,
+			    struct uid_gid_map *new_map)
+{
+	int idx;
+	const struct user_namespace *file_ns = file->f_cred->user_ns;
+	struct uid_gid_extent *extent0 = NULL;
+
+	for (idx = 0; idx < new_map->nr_extents; idx++) {
+		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+			extent0 = &new_map->extent[idx];
+		else
+			extent0 = &new_map->forward[idx];
+		if (extent0->lower_first == 0)
+			break;
+
+		extent0 = NULL;
+	}
+
+	if (!extent0)
+		return true;
+
+	if (map_ns == file_ns) {
+		/* The process unshared its ns and is writing to its own
+		 * /proc/self/uid_map.  User already has full capabilites in
+		 * the new namespace.  Verify that the parent had CAP_SETFCAP
+		 * when it unshared.
+		 * */
+		if (!file_ns->parent_could_setfcap)
+			return false;
+	} else {
+		/* Process p1 is writing to uid_map of p2, who is in a child
+		 * user namespace to p1's.  Verify that the opener of the map
+		 * file has CAP_SETFCAP against the parent of the new map
+		 * namespace */
+		if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
+			return false;
+	}
+
+	return true;
+}
+
 static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
@@ -848,7 +903,7 @@  static ssize_t map_write(struct file *file, const char __user *buf,
 			 struct uid_gid_map *parent_map)
 {
 	struct seq_file *seq = file->private_data;
-	struct user_namespace *ns = seq->private;
+	struct user_namespace *map_ns = seq->private;
 	struct uid_gid_map new_map;
 	unsigned idx;
 	struct uid_gid_extent extent;
@@ -895,7 +950,7 @@  static ssize_t map_write(struct file *file, const char __user *buf,
 	/*
 	 * Adjusting namespace settings requires capabilities on the target.
 	 */
-	if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
+	if (cap_valid(cap_setid) && !file_ns_capable(file, map_ns, CAP_SYS_ADMIN))
 		goto out;
 
 	/* Parse the user data */
@@ -965,7 +1020,7 @@  static ssize_t map_write(struct file *file, const char __user *buf,
 
 	ret = -EPERM;
 	/* Validate the user is allowed to use user id's mapped to. */
-	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+	if (!new_idmap_permitted(file, map_ns, cap_setid, &new_map))
 		goto out;
 
 	ret = -EPERM;
@@ -1086,6 +1141,10 @@  static bool new_idmap_permitted(const struct file *file,
 				struct uid_gid_map *new_map)
 {
 	const struct cred *cred = file->f_cred;
+
+	if (cap_setid == CAP_SETUID && !verify_root_map(file, ns, new_map))
+		return false;
+
 	/* Don't allow mappings that would allow anything that wouldn't
 	 * be allowed without the establishment of unprivileged mappings.
 	 */