All of lore.kernel.org
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "Enrico Weigelt, metux IT consult" <lkml@metux.net>,
	containers@lists.linux-foundation.org,
	"Alexander Mihalicyn" <alexander@mihalicyn.com>,
	"Giuseppe Scrivano" <gscrivan@redhat.com>,
	"Joseph Christopher Sible" <jcsible@cert.org>,
	"Kees Cook" <keescook@chromium.org>,
	linux-kernel@vger.kernel.org,
	"Josh Triplett" <josh@joshtriplett.org>,
	"Andy Lutomirski" <luto@amacapital.net>,
	"Mickaël Salaün" <mic@digikod.net>, "Wat Lim" <watl@google.com>,
	"Mrunal Patel" <mpatel@redhat.com>,
	"Pavel Tikhomirov" <ptikhomirov@virtuozzo.com>,
	"Geoffrey Thomas" <geofft@ldpreload.com>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	linux-api@vger.kernel.org
Subject: The problem of setgroups and containers
Date: Sun, 18 Oct 2020 08:05:06 -0500	[thread overview]
Message-ID: <87h7qradml.fsf_-_@x220.int.ebiederm.org> (raw)
In-Reply-To: <20201018102026.34jtxfheygowgejp@wittgenstein> (Christian Brauner's message of "Sun, 18 Oct 2020 12:20:26 +0200")

[ Added linux-api because we are talking about a subtle semantic
  change to the permission checks ]

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
>> 
>> > On 30.08.20 16:39, Christian Brauner wrote:
>> >> For mount points
>> >>    that originate from outside the namespace, everything will show as
>> >>    the overflow ids and access would be restricted to the most
>> >>    restricted permission bit for any path that can be accessed.
>> >
>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>> 
>> Interesting until reading through your commentary I had missed the
>> proposal to effectively effectively change the permissions to:
>> ((mode >> 3) & (mode >> 6) & mode & 7).
>> 
>> The challenge is that in a permission triple it is possible to set
>> lower permissions for the owner of the file, or for a specific group,
>> than for everyone else.
>> 
>> Today we require root permissions to be able to map users and groups in
>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>> permissions to be able to drop groups with setgroups.
>> 
>> Now we are discussiong moving to a world where we can use users and
>> groups that don't map to any other user namespace in uid_map and
>> gid_map.  It should be completely safe to use those users and groups
>> except for negative permissions in filesystems.  So a big question is
>> how do we arrange the system so anyone can use those files without
>> negative permission causing problems.
>> 
>> 
>> I believe it is safe to not limit the owner of a file, as the
>> owner of a file can always chmode the file and remove any restrictions.
>> Which is no worse than calling setuid to a different uid.
>> 
>> Which leaves where we have been dealing with the ability to drop groups
>> with setgroups.
>> 
>> I guess the practical proposal is when the !in_group_p and we are
>> looking at the other permission.  Treat the permissions as:
>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>> 
>> Which for systems who don't use negative group permissions is a no-op.
>> So this should not effect your btrfs snapshots at all (unless you use
>> negative group permissions).
>> 
>> It denies things before we get to an NFS server or other interesting
>> case so it should work for pretty much everything the kernel deals with.
>> 
>> Userspace repeating permission checks could break.  But that is just a
>> problem of inconsistency, and will always be a problem.
>> 
>> We could make it more precise as Serge was suggesting with a set of that
>> were dropped from setgroups, but under the assumption that negative
>> groups are sufficient rare we can avoid that overhead.
>
> I'm tempted to agree and say that it's safe to assume that they are used
> very much. Negative acls have been brought up a couple of times in
> related contexts though. One being a potential bug in newgidmap which we
> discussed back in
> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
> But I think if we have this under a sysctl as proposed earlier is good
> enough.
>
>> 
>>  static int acl_permission_check(struct inode *inode, int mask)
>>  {
>>  	unsigned int mode = inode->i_mode;
>>  
>> - [irrelevant bits of this function]        
>>  
>>  	/* Only RWX matters for group/other mode bits */
>>  	mask &= 7;
>>  
>>  	/*
>>  	 * Are the group permissions different from
>>  	 * the other permissions in the bits we care
>>  	 * about? Need to check group ownership if so.
>>  	 */
>>  	if (mask & (mode ^ (mode >> 3))) {
>>  		if (in_group_p(inode->i_gid))
>>  			mode >>= 3;
>> +		/* Use the most restrictive permissions? */
>> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>> +			mode &= (mode >> 3);
>>  	}
>>  
>>  	/* Bits in 'mode' clear that we require? */
>>  	return (mask & ~mode) ? -EACCES : 0;
>>  }
>> 
>> As I read posix_acl_permission all of the posix acls for groups are
>> positive permissions.  So I think the only other code that would need to
>> be updated would be the filesystems that replace generic_permission with
>> something that doesn't call acl_permission check.
>> 
>> Userspace could then activate this mode with:
>> 	echo "safely_allow" > /proc/<pid>/setgroups
>> 
>> That looks very elegant and simple, and I don't think will cause
>> problems for anyone.  It might even make sense to make that the default
>> mode when creating a new user namespace.
>> 
>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>> 
>> Does anyone see any problems with tweaking the permissions this way so
>> that we can always allow setgroups in a user namespace?
>
> This looks sane and simple. I would still think that making it opt-in
> for a few kernel releases might be preferable to just making it the new
> default. We can then revisit flipping the default. Advanced enough
> container runtimes will quickly pick up on this and can make it the
> default for their unprivileged containers if they want to.

I think we can even do a little bit better than what I proposed above.
The downside of my code is that negtative acls won't work in containers
even if they do today.  (Not that I think negative acls are something to
encourage just that breaking them means we have to deal with the
question: "Does someone care?").

What we can very safely do is limit negative acls to filesystems that
are mounted in the same user namespace.  Like the code below.

 static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
 
- [irrelevant bits of this function]        
 
 	/* Only RWX matters for group/other mode bits */
 	mask &= 7;
 
 	/*
 	 * Are the group permissions different from
 	 * the other permissions in the bits we care
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/*
+		 * In a user namespace groups may have been dropped
+		 * so use the most restrictive permissions.
+		 */
+		else if (current->user_ns != inode->i_sb->user_ns)
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
 	return (mask & ~mode) ? -EACCES : 0;
 }

I would make the plan that we apply the fully fleshed out version of the
above (aka updating the permission methods that don't use
generic_permission), and then in a following kernel cycle we remove the
restrictions on setgroups because they are no longer needed.

The only possible user breaking issue I can see if a system with
negative acls where the containers rely on having access to the other
permissions for some reason.  If someone finds a system that does this
change would need to be reverted and another plan would need to be
found.  Otherwise I think/hope this is a safe semantic change.

Does anyone see any problems with my further simplification?

Eric

  reply	other threads:[~2020-10-18 13:05 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-30 14:39 LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Christian Brauner
2020-08-30 14:39 ` Christian Brauner
2020-10-10  4:26 ` Serge E. Hallyn
2020-10-10  4:26   ` Serge E. Hallyn
2020-10-11 20:53   ` Josh Triplett
2020-10-11 20:53     ` Josh Triplett
2020-10-12  0:38     ` Andy Lutomirski
2020-10-12  0:38       ` Andy Lutomirski
2020-10-12  5:01       ` Eric W. Biederman
2020-10-12  5:01         ` Eric W. Biederman
2020-10-12 15:00         ` Serge E. Hallyn
2020-10-12 15:00           ` Serge E. Hallyn
2020-10-14 19:46           ` Eric W. Biederman
2020-10-14 19:46             ` Eric W. Biederman
2020-10-15 14:27             ` Serge E. Hallyn
2020-10-15 14:27               ` Serge E. Hallyn
2020-10-17 15:04               ` Eric W. Biederman
2020-10-17 15:04                 ` Eric W. Biederman
2020-10-12 17:05     ` Giuseppe Scrivano
2020-10-12 17:05       ` Giuseppe Scrivano
2020-10-13 12:46       ` Serge E. Hallyn
2020-10-13 12:46         ` Serge E. Hallyn
2020-10-13 15:17         ` Giuseppe Scrivano
2020-10-13 15:17           ` Giuseppe Scrivano
2020-10-15 14:32           ` Serge E. Hallyn
2020-10-15 14:32             ` Serge E. Hallyn
2020-10-19 12:12             ` Giuseppe Scrivano
2020-10-19 12:12               ` Giuseppe Scrivano
2021-04-21 17:27               ` Snaipe via Containers
2021-04-21 17:27                 ` Snaipe
2021-04-22  9:18                 ` Giuseppe Scrivano
2021-04-22  9:18                   ` Giuseppe Scrivano
2021-04-23 14:36                   ` Franklin “Snaipe” Mathieu via Containers
2021-04-23 14:36                     ` Franklin “Snaipe” Mathieu
2021-05-07 13:37                   ` Serge E. Hallyn
2021-05-10 13:02                     ` Giuseppe Scrivano
2021-05-10 13:02                       ` Giuseppe Scrivano
2021-05-10 13:57                       ` Giuseppe Scrivano
2021-05-10 13:57                         ` Giuseppe Scrivano
2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
2020-10-15 15:31   ` Enrico Weigelt, metux IT consult
2020-10-17 16:51   ` Eric W. Biederman
2020-10-17 16:51     ` Eric W. Biederman
2020-10-18 10:20     ` Christian Brauner
2020-10-18 10:20       ` Christian Brauner
2020-10-18 13:05       ` Eric W. Biederman [this message]
2020-10-19  0:15         ` The problem of setgroups and containers Eric W. Biederman
2020-10-19  0:15           ` Eric W. Biederman
2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
2020-10-19 20:07             ` Eric W. Biederman
2020-10-20 14:11             ` Christian Brauner
2020-10-20 14:11               ` Christian Brauner
2020-10-29 13:42     ` LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Enrico Weigelt, metux IT consult

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7qradml.fsf_-_@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=alexander@mihalicyn.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=geofft@ldpreload.com \
    --cc=gscrivan@redhat.com \
    --cc=jcsible@cert.org \
    --cc=josh@joshtriplett.org \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkml@metux.net \
    --cc=luto@amacapital.net \
    --cc=mic@digikod.net \
    --cc=mpatel@redhat.com \
    --cc=ptikhomirov@virtuozzo.com \
    --cc=serge@hallyn.com \
    --cc=watl@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.