linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* The problem of setgroups and containers
       [not found]     ` <20201018102026.34jtxfheygowgejp@wittgenstein>
@ 2020-10-18 13:05       ` Eric W. Biederman
  2020-10-19  0:15         ` Eric W. Biederman
  0 siblings, 1 reply; 4+ messages in thread
From: Eric W. Biederman @ 2020-10-18 13:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

[ Added linux-api because we are talking about a subtle semantic
  change to the permission checks ]

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
>> 
>> > On 30.08.20 16:39, Christian Brauner wrote:
>> >> For mount points
>> >>    that originate from outside the namespace, everything will show as
>> >>    the overflow ids and access would be restricted to the most
>> >>    restricted permission bit for any path that can be accessed.
>> >
>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>> 
>> Interesting until reading through your commentary I had missed the
>> proposal to effectively effectively change the permissions to:
>> ((mode >> 3) & (mode >> 6) & mode & 7).
>> 
>> The challenge is that in a permission triple it is possible to set
>> lower permissions for the owner of the file, or for a specific group,
>> than for everyone else.
>> 
>> Today we require root permissions to be able to map users and groups in
>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>> permissions to be able to drop groups with setgroups.
>> 
>> Now we are discussiong moving to a world where we can use users and
>> groups that don't map to any other user namespace in uid_map and
>> gid_map.  It should be completely safe to use those users and groups
>> except for negative permissions in filesystems.  So a big question is
>> how do we arrange the system so anyone can use those files without
>> negative permission causing problems.
>> 
>> 
>> I believe it is safe to not limit the owner of a file, as the
>> owner of a file can always chmode the file and remove any restrictions.
>> Which is no worse than calling setuid to a different uid.
>> 
>> Which leaves where we have been dealing with the ability to drop groups
>> with setgroups.
>> 
>> I guess the practical proposal is when the !in_group_p and we are
>> looking at the other permission.  Treat the permissions as:
>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>> 
>> Which for systems who don't use negative group permissions is a no-op.
>> So this should not effect your btrfs snapshots at all (unless you use
>> negative group permissions).
>> 
>> It denies things before we get to an NFS server or other interesting
>> case so it should work for pretty much everything the kernel deals with.
>> 
>> Userspace repeating permission checks could break.  But that is just a
>> problem of inconsistency, and will always be a problem.
>> 
>> We could make it more precise as Serge was suggesting with a set of that
>> were dropped from setgroups, but under the assumption that negative
>> groups are sufficient rare we can avoid that overhead.
>
> I'm tempted to agree and say that it's safe to assume that they are used
> very much. Negative acls have been brought up a couple of times in
> related contexts though. One being a potential bug in newgidmap which we
> discussed back in
> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
> But I think if we have this under a sysctl as proposed earlier is good
> enough.
>
>> 
>>  static int acl_permission_check(struct inode *inode, int mask)
>>  {
>>  	unsigned int mode = inode->i_mode;
>>  
>> - [irrelevant bits of this function]        
>>  
>>  	/* Only RWX matters for group/other mode bits */
>>  	mask &= 7;
>>  
>>  	/*
>>  	 * Are the group permissions different from
>>  	 * the other permissions in the bits we care
>>  	 * about? Need to check group ownership if so.
>>  	 */
>>  	if (mask & (mode ^ (mode >> 3))) {
>>  		if (in_group_p(inode->i_gid))
>>  			mode >>= 3;
>> +		/* Use the most restrictive permissions? */
>> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>> +			mode &= (mode >> 3);
>>  	}
>>  
>>  	/* Bits in 'mode' clear that we require? */
>>  	return (mask & ~mode) ? -EACCES : 0;
>>  }
>> 
>> As I read posix_acl_permission all of the posix acls for groups are
>> positive permissions.  So I think the only other code that would need to
>> be updated would be the filesystems that replace generic_permission with
>> something that doesn't call acl_permission check.
>> 
>> Userspace could then activate this mode with:
>> 	echo "safely_allow" > /proc/<pid>/setgroups
>> 
>> That looks very elegant and simple, and I don't think will cause
>> problems for anyone.  It might even make sense to make that the default
>> mode when creating a new user namespace.
>> 
>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>> 
>> Does anyone see any problems with tweaking the permissions this way so
>> that we can always allow setgroups in a user namespace?
>
> This looks sane and simple. I would still think that making it opt-in
> for a few kernel releases might be preferable to just making it the new
> default. We can then revisit flipping the default. Advanced enough
> container runtimes will quickly pick up on this and can make it the
> default for their unprivileged containers if they want to.

I think we can even do a little bit better than what I proposed above.
The downside of my code is that negtative acls won't work in containers
even if they do today.  (Not that I think negative acls are something to
encourage just that breaking them means we have to deal with the
question: "Does someone care?").

What we can very safely do is limit negative acls to filesystems that
are mounted in the same user namespace.  Like the code below.

 static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
 
- [irrelevant bits of this function]        
 
 	/* Only RWX matters for group/other mode bits */
 	mask &= 7;
 
 	/*
 	 * Are the group permissions different from
 	 * the other permissions in the bits we care
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/*
+		 * In a user namespace groups may have been dropped
+		 * so use the most restrictive permissions.
+		 */
+		else if (current->user_ns != inode->i_sb->user_ns)
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
 	return (mask & ~mode) ? -EACCES : 0;
 }

I would make the plan that we apply the fully fleshed out version of the
above (aka updating the permission methods that don't use
generic_permission), and then in a following kernel cycle we remove the
restrictions on setgroups because they are no longer needed.

The only possible user breaking issue I can see if a system with
negative acls where the containers rely on having access to the other
permissions for some reason.  If someone finds a system that does this
change would need to be reverted and another plan would need to be
found.  Otherwise I think/hope this is a safe semantic change.

Does anyone see any problems with my further simplification?

Eric

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: The problem of setgroups and containers
  2020-10-18 13:05       ` The problem of setgroups and containers Eric W. Biederman
@ 2020-10-19  0:15         ` Eric W. Biederman
  2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
  0 siblings, 1 reply; 4+ messages in thread
From: Eric W. Biederman @ 2020-10-19  0:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> [ Added linux-api because we are talking about a subtle semantic
>   change to the permission checks ]
>
> Christian Brauner <christian.brauner@ubuntu.com> writes:
>
>> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>>> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
>>> 
>>> > On 30.08.20 16:39, Christian Brauner wrote:
>>> >> For mount points
>>> >>    that originate from outside the namespace, everything will show as
>>> >>    the overflow ids and access would be restricted to the most
>>> >>    restricted permission bit for any path that can be accessed.
>>> >
>>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>>> 
>>> Interesting until reading through your commentary I had missed the
>>> proposal to effectively effectively change the permissions to:
>>> ((mode >> 3) & (mode >> 6) & mode & 7).
>>> 
>>> The challenge is that in a permission triple it is possible to set
>>> lower permissions for the owner of the file, or for a specific group,
>>> than for everyone else.
>>> 
>>> Today we require root permissions to be able to map users and groups in
>>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>>> permissions to be able to drop groups with setgroups.
>>> 
>>> Now we are discussiong moving to a world where we can use users and
>>> groups that don't map to any other user namespace in uid_map and
>>> gid_map.  It should be completely safe to use those users and groups
>>> except for negative permissions in filesystems.  So a big question is
>>> how do we arrange the system so anyone can use those files without
>>> negative permission causing problems.
>>> 
>>> 
>>> I believe it is safe to not limit the owner of a file, as the
>>> owner of a file can always chmode the file and remove any restrictions.
>>> Which is no worse than calling setuid to a different uid.
>>> 
>>> Which leaves where we have been dealing with the ability to drop groups
>>> with setgroups.
>>> 
>>> I guess the practical proposal is when the !in_group_p and we are
>>> looking at the other permission.  Treat the permissions as:
>>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>>> 
>>> Which for systems who don't use negative group permissions is a no-op.
>>> So this should not effect your btrfs snapshots at all (unless you use
>>> negative group permissions).
>>> 
>>> It denies things before we get to an NFS server or other interesting
>>> case so it should work for pretty much everything the kernel deals with.
>>> 
>>> Userspace repeating permission checks could break.  But that is just a
>>> problem of inconsistency, and will always be a problem.
>>> 
>>> We could make it more precise as Serge was suggesting with a set of that
>>> were dropped from setgroups, but under the assumption that negative
>>> groups are sufficient rare we can avoid that overhead.
>>
>> I'm tempted to agree and say that it's safe to assume that they are used
>> very much. Negative acls have been brought up a couple of times in
>> related contexts though. One being a potential bug in newgidmap which we
>> discussed back in
>> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
>> But I think if we have this under a sysctl as proposed earlier is good
>> enough.
>>
>>> 
>>>  static int acl_permission_check(struct inode *inode, int mask)
>>>  {
>>>  	unsigned int mode = inode->i_mode;
>>>  
>>> - [irrelevant bits of this function]        
>>>  
>>>  	/* Only RWX matters for group/other mode bits */
>>>  	mask &= 7;
>>>  
>>>  	/*
>>>  	 * Are the group permissions different from
>>>  	 * the other permissions in the bits we care
>>>  	 * about? Need to check group ownership if so.
>>>  	 */
>>>  	if (mask & (mode ^ (mode >> 3))) {
>>>  		if (in_group_p(inode->i_gid))
>>>  			mode >>= 3;
>>> +		/* Use the most restrictive permissions? */
>>> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>>> +			mode &= (mode >> 3);
>>>  	}
>>>  
>>>  	/* Bits in 'mode' clear that we require? */
>>>  	return (mask & ~mode) ? -EACCES : 0;
>>>  }
>>> 
>>> As I read posix_acl_permission all of the posix acls for groups are
>>> positive permissions.  So I think the only other code that would need to
>>> be updated would be the filesystems that replace generic_permission with
>>> something that doesn't call acl_permission check.
>>> 
>>> Userspace could then activate this mode with:
>>> 	echo "safely_allow" > /proc/<pid>/setgroups
>>> 
>>> That looks very elegant and simple, and I don't think will cause
>>> problems for anyone.  It might even make sense to make that the default
>>> mode when creating a new user namespace.
>>> 
>>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>>> 
>>> Does anyone see any problems with tweaking the permissions this way so
>>> that we can always allow setgroups in a user namespace?
>>
>> This looks sane and simple. I would still think that making it opt-in
>> for a few kernel releases might be preferable to just making it the new
>> default. We can then revisit flipping the default. Advanced enough
>> container runtimes will quickly pick up on this and can make it the
>> default for their unprivileged containers if they want to.
>
> I think we can even do a little bit better than what I proposed above.
> The downside of my code is that negtative acls won't work in containers
> even if they do today.  (Not that I think negative acls are something to
> encourage just that breaking them means we have to deal with the
> question: "Does someone care?").
>
> What we can very safely do is limit negative acls to filesystems that
> are mounted in the same user namespace.  Like the code below.
>
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
>  	unsigned int mode = inode->i_mode;
>  
> - [irrelevant bits of this function]        
>  
>  	/* Only RWX matters for group/other mode bits */
>  	mask &= 7;
>  
>  	/*
>  	 * Are the group permissions different from
>  	 * the other permissions in the bits we care
>  	 * about? Need to check group ownership if so.
>  	 */
>  	if (mask & (mode ^ (mode >> 3))) {
>  		if (in_group_p(inode->i_gid))
>  			mode >>= 3;
> +		/*
> +		 * In a user namespace groups may have been dropped
> +		 * so use the most restrictive permissions.
> +		 */
> +		else if (current->user_ns != inode->i_sb->user_ns)
> +			mode &= (mode >> 3);
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
>  	return (mask & ~mode) ? -EACCES : 0;
>  }
>
> I would make the plan that we apply the fully fleshed out version of the
> above (aka updating the permission methods that don't use
> generic_permission), and then in a following kernel cycle we remove the
> restrictions on setgroups because they are no longer needed.
>
> The only possible user breaking issue I can see if a system with
> negative acls where the containers rely on having access to the other
> permissions for some reason.  If someone finds a system that does this
> change would need to be reverted and another plan would need to be
> found.  Otherwise I think/hope this is a safe semantic change.
>
> Does anyone see any problems with my further simplification?

Ugh.  I do see a problem.  Not with the approach so much but with my
argument that it is fine to ignore users.

I have just re-read through posix_acl_permission, and the logic is just
like acl_permission_check except that instead of having one user (the
owner of the file) and one group.  There can be the owner of the file
and other users (each with their distinct permissions) followed by the
one or more groups each with their distinct permissions followed by a
mask of maximum permissions followed by permissions for other users.

Which means that we need to take the owner of the user namespace into
account to preserve the invariant that we have no more permissions than
that owner had.

So I am thinking for the other permission check we need to limit the
permissions that are available to based on recursively the owner of the
user namespace and the owner's groups when the user namespace was
created.


Eric

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed
  2020-10-19  0:15         ` Eric W. Biederman
@ 2020-10-19 20:07           ` Eric W. Biederman
  2020-10-20 14:11             ` Christian Brauner
  0 siblings, 1 reply; 4+ messages in thread
From: Eric W. Biederman @ 2020-10-19 20:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api


Ordinary unix permissions and posix acls have the ability to
expression that processes show uid or gid match have fewer permissions
than processes without matches that use the other permissions.

The fact a root user in a user namespace can call setgroups and setuid
allows these restrictive permissions to be avoided.  To limit the problems
this can cause populationg the the set of uids that can be switched to,
and the set of gids that can be switched to is an operation that requires
priviliege outside of the user namespace.

This restriction is currently being reexamined as it appears that
there is a way to implement uids and gids that do not map outside of
the user namespace.  Such uids and gids would not require privilege
from outside of the usernamespace to use.  So it becomes important to
find a way to allow calling setuid and setgroups in a user namespace
without allowing processes in a user namespace to do more than the
creator of the user namespace.

To that end capture the groups set with setgroups of the creator of a
user_namespace.  Update the affected permission checks to notice when
something is being allowed with other permissions and only allow the
operation if the creator of the user namespace does not have a user or
group match that would disallow the operation.

The goal is to ensure that creating a user namespace and allowing
the user namespace root to setuid and setgroups does not result
in being permitted to do more than before the user namespace was
created, while still supporting explicitly specified users and
groups to have fewer permissions.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

So far only generic_permission is covered, but I think this demonstrates
that the goal is achievable.  Preserving negative acls while allowing
setuid and setgroups.

 fs/namei.c                     |  7 +++++
 fs/posix_acl.c                 | 51 ++++++++++++++++++++++++++++++++++
 include/linux/user_namespace.h | 16 +++++++++++
 kernel/user_namespace.c        | 29 +++++++++++++++++++
 4 files changed, 103 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7..ca06bd81d4e4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -322,6 +322,13 @@ static int acl_permission_check(struct inode *inode, int mask)
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/*
+		 * Only allow the intersection of what the creator of
+		 * the user namespace is allowed and what everyone is
+		 * allowed.
+		 */
+		else if (userns_in_group_p(inode->i_sb->s_user_ns, inode->i_gid))
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 95882b3f5f62..525803f8f70c 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -340,6 +340,53 @@ posix_acl_from_mode(umode_t mode, gfp_t flags)
 }
 EXPORT_SYMBOL(posix_acl_from_mode);
 
+static bool userns_creator_allowed(struct inode *inode,
+				   const struct posix_acl *acl, int want)
+{
+	/* Don't allow access the creator of the user namespace does not have */
+	struct user_namespace *ns = inode->i_sb->s_user_ns;
+	const struct posix_acl_entry *pa, *pe;
+	unsigned short min_perm;
+	bool found = false;
+
+	min_perm = MAY_READ | MAY_WRITE | MAY_EXEC;
+	FOREACH_ACL_ENTRY(pa, acl, pe) {
+                switch(pa->e_tag) {
+                        case ACL_USER_OBJ:
+				/* No need to limit the owner of a file */
+                                break;
+                        case ACL_USER:
+				if (is_userns_creator(ns, pa->e_uid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+				break;
+                        case ACL_GROUP_OBJ:
+				if (userns_in_group_p(ns, inode->i_gid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+				break;
+                        case ACL_GROUP:
+				if (userns_in_group_p(ns, pa->e_gid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+                                break;
+                        case ACL_MASK:
+				if (found)
+					min_perm &= pa->e_perm;
+                                break;
+                        case ACL_OTHER:
+				if (found &&
+				    ((pa->e_perm & want & min_perm) != want))
+					return false;
+				return true;
+                }
+        }
+	return false;
+}
+
 /*
  * Return 0 if current is granted want access to the inode
  * by the acl. Returns -E... otherwise.
@@ -382,6 +429,10 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                         case ACL_OTHER:
 				if (found)
 					return -EACCES;
+				else if ((current_user_ns() != inode->i_sb->s_user_ns) &&
+					 ((pa->e_perm & want) == want) &&
+					 userns_creator_allowed(inode, acl, want))
+					return 0;
 				else
 					goto check_perm;
 			default:
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..b4bcb49bed7a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -62,6 +62,7 @@ struct user_namespace {
 	int			level;
 	kuid_t			owner;
 	kgid_t			group;
+	struct group_info 	*groups;
 	struct ns_common	ns;
 	unsigned long		flags;
 
@@ -137,6 +138,10 @@ extern bool in_userns(const struct user_namespace *ancestor,
 		       const struct user_namespace *child);
 extern bool current_in_userns(const struct user_namespace *target_ns);
 struct ns_common *ns_get_owner(struct ns_common *ns);
+
+bool is_userns_creator(struct user_namespace *ns, kuid_t uid);
+bool userns_in_group_p(struct user_namespace *ns, kgid_t group);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -181,6 +186,17 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
 {
 	return ERR_PTR(-EPERM);
 }
+
+static inline bool is_userns_creator(struct user_namespace *ns, kuid_t uid)
+{
+	return false;
+}
+
+static inline bool userns_in_group_p(struct user_namespace *ns, kgid_t group)
+{
+	return false;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 87804e0371fe..8a4949a32e36 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -117,6 +117,7 @@ int create_user_ns(struct cred *new)
 	ns->level = parent_ns->level + 1;
 	ns->owner = owner;
 	ns->group = group;
+	ns->groups = get_group_info(new->group_info);
 	INIT_WORK(&ns->work, free_user_ns);
 	for (i = 0; i < UCOUNT_COUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
@@ -143,6 +144,7 @@ int create_user_ns(struct cred *new)
 	key_put(ns->persistent_keyring_register);
 #endif
 	ns_free_inum(&ns->ns);
+	put_group_info(ns->groups);
 fail_free:
 	kmem_cache_free(user_ns_cachep, ns);
 fail_dec:
@@ -194,6 +196,7 @@ static void free_user_ns(struct work_struct *work)
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
+		put_group_info(ns->groups);
 		kmem_cache_free(user_ns_cachep, ns);
 		dec_user_namespaces(ucounts);
 		ns = parent;
@@ -1317,6 +1320,32 @@ const struct proc_ns_operations userns_operations = {
 	.get_parent	= ns_get_owner,
 };
 
+bool is_userns_creator(struct user_namespace *target_ns, kuid_t uid)
+{
+	struct user_namespace *user_ns = current_user_ns();
+
+	for (user_ns = current_user_ns();
+	     (user_ns != target_ns) && user_ns;
+	     user_ns = user_ns->parent) {
+		if (uid_eq(uid, user_ns->owner))
+			return true;
+	}
+	return false;
+}
+
+bool userns_in_group_p(struct user_namespace *target_ns, kgid_t group)
+{
+	struct user_namespace *user_ns;
+
+	for (user_ns = current_user_ns();
+	     (user_ns != target_ns) && user_ns;
+	     user_ns = user_ns->parent) {
+		if (groups_search(user_ns->groups, group))
+			return true;
+	}
+	return false;
+}
+
 static __init int user_namespaces_init(void)
 {
 	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed
  2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
@ 2020-10-20 14:11             ` Christian Brauner
  0 siblings, 0 replies; 4+ messages in thread
From: Christian Brauner @ 2020-10-20 14:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

On Mon, Oct 19, 2020 at 03:07:02PM -0500, Eric W. Biederman wrote:
> Ordinary unix permissions and posix acls have the ability to
> expression that processes show uid or gid match have fewer permissions
> than processes without matches that use the other permissions.

I'm stumbling a bit reading that sentence but that may just me parsing
it wrong:

"[...] have the ability to express that processes whose uid or gid match
nonetheless have fewer permissions than processes without matching uid
or gid that use other permissions."

is how I'm understanding this.

> 
> The fact a root user in a user namespace can call setgroups and setuid
> allows these restrictive permissions to be avoided.  To limit the problems
> this can cause populationg the the set of uids that can be switched to,
> and the set of gids that can be switched to is an operation that requires
> priviliege outside of the user namespace.
> 
> This restriction is currently being reexamined as it appears that
> there is a way to implement uids and gids that do not map outside of
> the user namespace.  Such uids and gids would not require privilege
> from outside of the usernamespace to use.  So it becomes important to
> find a way to allow calling setuid and setgroups in a user namespace
> without allowing processes in a user namespace to do more than the
> creator of the user namespace.
> 
> To that end capture the groups set with setgroups of the creator of a
> user_namespace.  Update the affected permission checks to notice when
> something is being allowed with other permissions and only allow the
> operation if the creator of the user namespace does not have a user or
> group match that would disallow the operation.
> 
> The goal is to ensure that creating a user namespace and allowing
> the user namespace root to setuid and setgroups does not result
> in being permitted to do more than before the user namespace was
> created, while still supporting explicitly specified users and
> groups to have fewer permissions.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
> 
> So far only generic_permission is covered, but I think this demonstrates
> that the goal is achievable.  Preserving negative acls while allowing
> setuid and setgroups.

I think that looks good. I have run our test-suite with this patch
applied and it survived no problem so I don't see any regressions for
current use-cases so far.

> 
>  fs/namei.c                     |  7 +++++
>  fs/posix_acl.c                 | 51 ++++++++++++++++++++++++++++++++++
>  include/linux/user_namespace.h | 16 +++++++++++
>  kernel/user_namespace.c        | 29 +++++++++++++++++++
>  4 files changed, 103 insertions(+)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index e99e2a9da0f7..ca06bd81d4e4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -322,6 +322,13 @@ static int acl_permission_check(struct inode *inode, int mask)
>  	if (mask & (mode ^ (mode >> 3))) {
>  		if (in_group_p(inode->i_gid))
>  			mode >>= 3;
> +		/*
> +		 * Only allow the intersection of what the creator of
> +		 * the user namespace is allowed and what everyone is
> +		 * allowed.
> +		 */
> +		else if (userns_in_group_p(inode->i_sb->s_user_ns, inode->i_gid))
> +			mode &= (mode >> 3);
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 95882b3f5f62..525803f8f70c 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -340,6 +340,53 @@ posix_acl_from_mode(umode_t mode, gfp_t flags)
>  }
>  EXPORT_SYMBOL(posix_acl_from_mode);
>  
> +static bool userns_creator_allowed(struct inode *inode,
> +				   const struct posix_acl *acl, int want)
> +{
> +	/* Don't allow access the creator of the user namespace does not have */
> +	struct user_namespace *ns = inode->i_sb->s_user_ns;
> +	const struct posix_acl_entry *pa, *pe;
> +	unsigned short min_perm;
> +	bool found = false;
> +
> +	min_perm = MAY_READ | MAY_WRITE | MAY_EXEC;
> +	FOREACH_ACL_ENTRY(pa, acl, pe) {
> +                switch(pa->e_tag) {
> +                        case ACL_USER_OBJ:
> +				/* No need to limit the owner of a file */
> +                                break;
> +                        case ACL_USER:
> +				if (is_userns_creator(ns, pa->e_uid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +				break;
> +                        case ACL_GROUP_OBJ:
> +				if (userns_in_group_p(ns, inode->i_gid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +				break;
> +                        case ACL_GROUP:
> +				if (userns_in_group_p(ns, pa->e_gid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +                                break;
> +                        case ACL_MASK:
> +				if (found)
> +					min_perm &= pa->e_perm;
> +                                break;
> +                        case ACL_OTHER:
> +				if (found &&
> +				    ((pa->e_perm & want & min_perm) != want))
> +					return false;
> +				return true;
> +                }
> +        }
> +	return false;
> +}
> +
>  /*
>   * Return 0 if current is granted want access to the inode
>   * by the acl. Returns -E... otherwise.
> @@ -382,6 +429,10 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
>                          case ACL_OTHER:
>  				if (found)
>  					return -EACCES;
> +				else if ((current_user_ns() != inode->i_sb->s_user_ns) &&
> +					 ((pa->e_perm & want) == want) &&
> +					 userns_creator_allowed(inode, acl, want))
> +					return 0;
>  				else
>  					goto check_perm;
>  			default:
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 6ef1c7109fc4..b4bcb49bed7a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -62,6 +62,7 @@ struct user_namespace {
>  	int			level;
>  	kuid_t			owner;
>  	kgid_t			group;
> +	struct group_info 	*groups;
>  	struct ns_common	ns;
>  	unsigned long		flags;
>  
> @@ -137,6 +138,10 @@ extern bool in_userns(const struct user_namespace *ancestor,
>  		       const struct user_namespace *child);
>  extern bool current_in_userns(const struct user_namespace *target_ns);
>  struct ns_common *ns_get_owner(struct ns_common *ns);
> +
> +bool is_userns_creator(struct user_namespace *ns, kuid_t uid);
> +bool userns_in_group_p(struct user_namespace *ns, kgid_t group);
> +
>  #else
>  
>  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -181,6 +186,17 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
>  {
>  	return ERR_PTR(-EPERM);
>  }
> +
> +static inline bool is_userns_creator(struct user_namespace *ns, kuid_t uid)
> +{
> +	return false;
> +}
> +
> +static inline bool userns_in_group_p(struct user_namespace *ns, kgid_t group)
> +{
> +	return false;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 87804e0371fe..8a4949a32e36 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -117,6 +117,7 @@ int create_user_ns(struct cred *new)
>  	ns->level = parent_ns->level + 1;
>  	ns->owner = owner;
>  	ns->group = group;
> +	ns->groups = get_group_info(new->group_info);
>  	INIT_WORK(&ns->work, free_user_ns);
>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>  		ns->ucount_max[i] = INT_MAX;
> @@ -143,6 +144,7 @@ int create_user_ns(struct cred *new)
>  	key_put(ns->persistent_keyring_register);
>  #endif
>  	ns_free_inum(&ns->ns);
> +	put_group_info(ns->groups);
>  fail_free:
>  	kmem_cache_free(user_ns_cachep, ns);
>  fail_dec:
> @@ -194,6 +196,7 @@ static void free_user_ns(struct work_struct *work)
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
> +		put_group_info(ns->groups);
>  		kmem_cache_free(user_ns_cachep, ns);
>  		dec_user_namespaces(ucounts);
>  		ns = parent;
> @@ -1317,6 +1320,32 @@ const struct proc_ns_operations userns_operations = {
>  	.get_parent	= ns_get_owner,
>  };
>  
> +bool is_userns_creator(struct user_namespace *target_ns, kuid_t uid)
> +{
> +	struct user_namespace *user_ns = current_user_ns();
> +
> +	for (user_ns = current_user_ns();
> +	     (user_ns != target_ns) && user_ns;
> +	     user_ns = user_ns->parent) {
> +		if (uid_eq(uid, user_ns->owner))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +bool userns_in_group_p(struct user_namespace *target_ns, kgid_t group)
> +{
> +	struct user_namespace *user_ns;
> +
> +	for (user_ns = current_user_ns();
> +	     (user_ns != target_ns) && user_ns;
> +	     user_ns = user_ns->parent) {
> +		if (groups_search(user_ns->groups, group))
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static __init int user_namespaces_init(void)
>  {
>  	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-10-20 14:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200830143959.rhosiunyz5yqbr35@wittgenstein>
     [not found] ` <cb6c6f1a-89ee-1762-3eec-4f69bd7739b1@metux.net>
     [not found]   ` <874kmsdcdx.fsf@x220.int.ebiederm.org>
     [not found]     ` <20201018102026.34jtxfheygowgejp@wittgenstein>
2020-10-18 13:05       ` The problem of setgroups and containers Eric W. Biederman
2020-10-19  0:15         ` Eric W. Biederman
2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
2020-10-20 14:11             ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).