LKML Archive on lore.kernel.org
 help / color / Atom feed
* LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
@ 2020-08-30 14:39 Christian Brauner
  2020-10-10  4:26 ` Serge E. Hallyn
  2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
  0 siblings, 2 replies; 22+ messages in thread
From: Christian Brauner @ 2020-08-30 14:39 UTC (permalink / raw)
  To: containers
  Cc: Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Giuseppe Scrivano, Andy Lutomirski, Stephane Graber, Kees Cook,
	Sargun Dhillon, Josh Triplett, linux-kernel

Hello everyone,

## Preliminaries

This is the summary of the Hackroom session Stéphane and I led as a follow-up
to our presentations in the Containers & Checkpoint/Restore micro-conference at
Linux Plumbers 2020.

Please make sure to see the Action Items section below as it outlines the next
concrete steps that came up during the meeting and who seemed interested in
tackling them.

The background for this summary is:

1. Stéphane's and my talk "Isolated Dynamic User Namespaces"
   People interested in the full session can watch it on YouTube:
   https://youtu.be/fSyr_IXM21Y?t=8856

2. The Hackroom session on Wednesday, 25.08.2020 at 17:00 UTC
   This session has been recorded as well. It is not yet on YouTube because
   Hackroom sessions weren't streamed. However, I plan on cutting that video
   and putting it up on YouTube as well just so there's no chance of
   miscommunication.

All people that attended session 1. were asked to send me an e-mail if they
wanted to attend session 2. to hash out details. The following people requested
to attend session 2. and were informed either through the e-mail I sent out or IRC:

Aleksa Sarai
Alexander Mihalicyn
Andy Lutomirski
Christian Brauner
Eric W. Biederman
Geoffrey Thomas
Giuseppe Scrivano
Joseph Christopher Sible
Josh Triplett
Kees Cook
Mickaël Salaün
Mrunal Patel
Pavel Tikhomirov
Sargun Dhillon
Serge Hallyn
Stephane Graber
Vivek Goyal
Wat Lim

All of them should be Cced here. In case I forgot someone don't hesitate to
forward this mail to them!

## Summary

During the Containers & Checkpoint/Restore micro-conference and in the hackroom
session Stéphane Graber and I proposed a way to make using user namespaces
simpler and more isolated. The following current problems were identified:

P1. Isolated id mappings can only be guaranteed to be locally isolated.
    A container runtime/daemon can only guarantee non-overlapping id mappings
    when no other users on the system create containers.

P2. Enforcing isolated id mappings in userspace is difficult.
    It is always possible to create other processes with overlapping id
    mappings. Coordinating id mappings in userspace will always remain
    optional. Quite a few tools nowadays (including systemd) don't care about
    /etc/sub{g,u}id and actively advise against using it. This is made even
    more problematic since sub{g,u}iid delegation is done per-user rather than
    per-container-runtime.

P3. The range of the id mapping of a container can't be predetermined.
    While POSIX mandates that a standard system should use a range of 65536 ids
    reality is very different. Some programs allocate high ids for random
    processes or for network authentication. This means, in practice it is
    often necessary to assign a range of up to 10 million ids to a container.
    This limits a system to less than 500 containers total.

P4. Isolated id mappings severely restrict the number of containers that can be
    run on a system.
    This ties back to the point about pre-determining the id range of a
    container and how large range allocations tend to be on real systems. That
    becomes even more relevant when nesting containers.

P5. Container runtimes cannot reuse overlayfs lower directories if each
    container uses isolated ID mappings, leading to either needless storage
    overhead (LXD -- though the LXD folks don’t really mind), completely
    ignoring the benefits of isolating containers from each other (Docker), or
    not using them at all (Kubernetes). (This is a more general issue but bears
    repeating since it is closely tied to most userns proposals.)

P6. Rlimits pose a problem for containers that share the same id mapping.
    This means containers with overlapping id mappings can DOS each other by
    exhausting their rlimits. The reason for this lies with the current
    implementation of rlimits -- rlimits are currently tied to users and are
    not hierarchically limited like inotify limits are. This is a severe
    problem in unprivileged workloads. Eric and others identified that this
    issue can be fixed independently of the isolated user namespace proposal.

In response to these and other issues, we made the following proposal which was
floated around in less clear form already during Linux Plumber 2019 in Lisbon
during informal discussions:

## Proposal

Introduce an in-kernel concept of an isolated user namespace by switching the
id types in the kernel from 32 to 64 bits. Userspace will only get to see the
lower 32 bits as usual. The upper 32 bits are used for a unique, in-kernel user
namespace token. The owner of such a namespace will either be the effective id
of the creator of that namespace or optionally an owning id can be set (when
created by a privileged user).

The following advantages were identified by various people during the session:

S1. An isolated user namespace has access to the full 32 bit id range.
    This makes it compatible with every Linux workload and allows to support
    post-POSIX range users that allocate high-range ids (LDAP, systemd, etc). 
    This solves P3 and P4.

S2. Kernel-enforced user namespace isolation.
    This means, there is no need for different container runtimes to
    collaborate on id ranges with immediate benefits for everyone.
    This solves P1 and P2.

S3. The need to split existing id ranges is completely removed.
    Nested containers become trivial.

S4. Simplify the usage of user namespaces significantly for newcomers.
    This should hopefully finally increase adoption in userspace especially in
    the application container and Kubernetes space.

S5. The owning id concept of a user namespace makes monitoring and interacting
    with such containers way easier.

S6. When interacting with an isolated user namespace the owning id can be used
    as a credential when interacting with the container from an ancestor user
    namespace.

The need and desire for these features seemed to be expressed by most
participating parties.

### Issues

Two main issues were discussed:

1. How are interactions across isolated user namespaces handled?
   An isolated user namespace can interact with another isolated user namespace
   or an ancestor user namespace. A good example are socket credentials. They
   can be seen and received outside of the container. In those cases the id of
   the isolated user namespace needs to be represented.
   The proposals to solve this problem were:
   1.1. Use the owning id of the isolated user namespace.
	A parent user namespace would see the configured owning id of the
	isolated user namespace (mapped to that user namespace).
        A non ancestor user namespace would see the overflow ids.
   1.2. Always use the overflow id for isolated user namespaces.
	Any other user namespaces would see the overflow id configured on the
	system.
   Proposal 1.1 semmed prefered since it would allow an unprivileged
   user creating an isolated user namespace to kill/ptrace all processes
   in the isolated namespace they spawned. In contrast proposal 1.1
   would not allow for visible ownership of the container, not just
   making tracking down the container harder but also preventing the
   owner from accessing those processes through other APIs.
   
   Related to this proposal it was suggested to introduce a new sysctl
   which would allow the system administrator to prevent any id
   transitions to overflow ids, i.e. a process would not be able to
   set{g,u}id() to the overflow {g,u}id.  A distribution can then decide
   to select specific overflow ids (akin to a system id) and set them
   via the already existing /proc/sys/kernel/overflow{g,u}id sysctl
   interfaces. This increases the security that isolated user namespaces
   provide even more.

2. How is filesystem access in isolated user namespaces handled?
   (This is basically the problem outlined in P5).
   There were quite a few proposals pitched by Andy and some others and it
   would be difficult to summarize them all here, especially since a few of
   them were rather rudimentary sketches. Once the YouTube video of the
   Hackroom session is up people can listen to it in more detail.

   The first consensus reached seemed to be to decouple isolated user
   namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
   at the beginning as filesystems which can be mounted inside isolated
   user namespaces and so would have proper ownership. For mount points
   that originate from outside the namespace, everything will show as
   the overflow ids and access would be restricted to the most
   restricted permission bit for any path that can be accessed.

### Additional Requirements

Sargun pointed out that they make use of NFSv4 both id mapped, and non-id
mapped. Different id mappings between different filesystems in NFS is not part
of their use-case currently and so it is fine if the ids are passed through as
is. He additionally pointed out that they would like to be able use the
idmapper tool in such isolated containers. This tool maps a given process id to
the highest user id available. It seems that all of these use-cases would work
with the current setup.

It was proposed that for NFS an alternative solution should be considered,
namely making it possible to mount NFS inside of a user namespace. This work
would need to be done by someone well-versed in modern NFS.

### Action Items

The following consensus seemed to have been reached by the end of the session:

1. Fixing rlimits in user namespaces such that one namespace cannot affect
   another.
   This was identified as problem P6 above. During the session it seemed that
   Eric intended to investigate this.

2. Prototyping switching the kernel uid/gid types to 64bit, allowing for a
   hidden 32bit identifier and fully usable 32bit uid/gid range for the
   container.
   The consensus seemed to have been to implement a first version of this and
   do performance testing to see what the performance impact of this change
   would be.
   Aleksa Sarai and Christian Brauner stated they were interested in
   taking on this work jointly.

3. Find a way to allow setgroups() in a user namespace while keeping
   in mind the case of groups used for negative access control.
   This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
   investigate adding a prctl() to allow setgroups() to be called in a user
   namespace at the cost of restricting paths to the most restrictive
   permission. So if something is 0707 it needs to be treated as if it's 0000
   even though the caller is not in its owning group which is used for negative
   access control (how these new semantics will interact with ACLs will also
   need to be looked into).

4. Add optional enforcement that overflow uid/gid as set through
   sysctl cannot be used as regular uid/gid on the system, which will allow
   userspace to disambiguate credential IDs which are unmapped versus the
   “nobody” user (which is actually used by distributions) It seemed that this
   idea was pitched by Geoffrey Thomas.

Special thanks to Stéphane and Aleksa for corrections and additions!

Thanks!
Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-08-30 14:39 LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Christian Brauner
@ 2020-10-10  4:26 ` Serge E. Hallyn
  2020-10-11 20:53   ` Josh Triplett
  2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2020-10-10  4:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: containers, Alexander Mihalicyn, Mrunal Patel, Wat Lim,
	Aleksa Sarai, Pavel Tikhomirov, Geoffrey Thomas,
	Eric W. Biederman, Joseph Christopher Sible,
	Mickaël Salaün, Vivek Goyal, Giuseppe Scrivano,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	Josh Triplett, linux-kernel

> 3. Find a way to allow setgroups() in a user namespace while keeping
>    in mind the case of groups used for negative access control.
>    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>    investigate adding a prctl() to allow setgroups() to be called in a user
>    namespace at the cost of restricting paths to the most restrictive
>    permission. So if something is 0707 it needs to be treated as if it's 0000
>    even though the caller is not in its owning group which is used for negative
>    access control (how these new semantics will interact with ACLs will also
>    need to be looked into).

I should probably think this through more, but for this problem, would it
not suffice to add a new prevgroups grouplist to the struct cred, maybe
struct group_info *locked_groups, and every time an unprivileged task creates
a new user namespace, add all its current groups to this list?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-10  4:26 ` Serge E. Hallyn
@ 2020-10-11 20:53   ` Josh Triplett
  2020-10-12  0:38     ` Andy Lutomirski
  2020-10-12 17:05     ` Giuseppe Scrivano
  0 siblings, 2 replies; 22+ messages in thread
From: Josh Triplett @ 2020-10-11 20:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Christian Brauner, containers, Alexander Mihalicyn, Mrunal Patel,
	Wat Lim, Aleksa Sarai, Pavel Tikhomirov, Geoffrey Thomas,
	Eric W. Biederman, Joseph Christopher Sible,
	Mickaël Salaün, Vivek Goyal, Giuseppe Scrivano,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >    in mind the case of groups used for negative access control.
> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> >    investigate adding a prctl() to allow setgroups() to be called in a user
> >    namespace at the cost of restricting paths to the most restrictive
> >    permission. So if something is 0707 it needs to be treated as if it's 0000
> >    even though the caller is not in its owning group which is used for negative
> >    access control (how these new semantics will interact with ACLs will also
> >    need to be looked into).
> 
> I should probably think this through more, but for this problem, would it
> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> struct group_info *locked_groups, and every time an unprivileged task creates
> a new user namespace, add all its current groups to this list?

So, effectively, you would be allowed to drop permissions, but
locked_groups would still be checked for restrictions?

That seems like it'd introduce a new level of complexity (a new facet of
permission) to manage. Not opposed, but it does seem more complex than
just opting out of using groups for negative permissions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-11 20:53   ` Josh Triplett
@ 2020-10-12  0:38     ` Andy Lutomirski
  2020-10-12  5:01       ` Eric W. Biederman
  2020-10-12 17:05     ` Giuseppe Scrivano
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2020-10-12  0:38 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Serge E. Hallyn, Christian Brauner, Linux Containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Giuseppe Scrivano, Stephane Graber, Kees Cook, Sargun Dhillon,
	LKML

On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
>
> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> > >    in mind the case of groups used for negative access control.
> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> > >    investigate adding a prctl() to allow setgroups() to be called in a user
> > >    namespace at the cost of restricting paths to the most restrictive
> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
> > >    even though the caller is not in its owning group which is used for negative
> > >    access control (how these new semantics will interact with ACLs will also
> > >    need to be looked into).
> >
> > I should probably think this through more, but for this problem, would it
> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
> > struct group_info *locked_groups, and every time an unprivileged task creates
> > a new user namespace, add all its current groups to this list?
>
> So, effectively, you would be allowed to drop permissions, but
> locked_groups would still be checked for restrictions?
>
> That seems like it'd introduce a new level of complexity (a new facet of
> permission) to manage. Not opposed, but it does seem more complex than
> just opting out of using groups for negative permissions.

Is there any context other than regular UNIX DAC in which groups can
act as negative permissions or is this literally just an issue for
files with a more restrictive group mode than other mode?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-12  0:38     ` Andy Lutomirski
@ 2020-10-12  5:01       ` Eric W. Biederman
  2020-10-12 15:00         ` Serge E. Hallyn
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-12  5:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, Serge E. Hallyn, Christian Brauner,
	Linux Containers, Alexander Mihalicyn, Mrunal Patel, Wat Lim,
	Aleksa Sarai, Pavel Tikhomirov, Geoffrey Thomas,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Giuseppe Scrivano, Stephane Graber, Kees Cook, Sargun Dhillon,
	LKML

Andy Lutomirski <luto@kernel.org> writes:

> On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
>>
>> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> > > 3. Find a way to allow setgroups() in a user namespace while keeping
>> > >    in mind the case of groups used for negative access control.
>> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> > >    investigate adding a prctl() to allow setgroups() to be called in a user
>> > >    namespace at the cost of restricting paths to the most restrictive
>> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> > >    even though the caller is not in its owning group which is used for negative
>> > >    access control (how these new semantics will interact with ACLs will also
>> > >    need to be looked into).
>> >
>> > I should probably think this through more, but for this problem, would it
>> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> > struct group_info *locked_groups, and every time an unprivileged task creates
>> > a new user namespace, add all its current groups to this list?
>>
>> So, effectively, you would be allowed to drop permissions, but
>> locked_groups would still be checked for restrictions?
>>
>> That seems like it'd introduce a new level of complexity (a new facet of
>> permission) to manage. Not opposed, but it does seem more complex than
>> just opting out of using groups for negative permissions.
>
> Is there any context other than regular UNIX DAC in which groups can
> act as negative permissions or is this literally just an issue for
> files with a more restrictive group mode than other mode?

Just that.

The ideas kicked around in the conversation were some variant of having
a sysctl that says "This system never uses groups for negative
permissions".

It was also suggested that if the sysctl was set the the permission
checks would be altered such that even if someone tried to set a
negative permission, the more liberal permissions of other would be used
instead.

Given that creating /etc/subgid is effectively opting out of negative
permissions already have a sysctl that says that upfront feels like a
very clean solution.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-12  5:01       ` Eric W. Biederman
@ 2020-10-12 15:00         ` Serge E. Hallyn
  2020-10-14 19:46           ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2020-10-12 15:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Josh Triplett, Serge E. Hallyn,
	Christian Brauner, Linux Containers, Alexander Mihalicyn,
	Mrunal Patel, Wat Lim, Aleksa Sarai, Pavel Tikhomirov,
	Geoffrey Thomas, Joseph Christopher Sible,
	Mickaël Salaün, Vivek Goyal, Giuseppe Scrivano,
	Stephane Graber, Kees Cook, Sargun Dhillon, LKML

On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <luto@kernel.org> writes:
> 
> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
> >>
> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> > >    in mind the case of groups used for negative access control.
> >> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> >> > >    investigate adding a prctl() to allow setgroups() to be called in a user
> >> > >    namespace at the cost of restricting paths to the most restrictive
> >> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
> >> > >    even though the caller is not in its owning group which is used for negative
> >> > >    access control (how these new semantics will interact with ACLs will also
> >> > >    need to be looked into).
> >> >
> >> > I should probably think this through more, but for this problem, would it
> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> > struct group_info *locked_groups, and every time an unprivileged task creates
> >> > a new user namespace, add all its current groups to this list?
> >>
> >> So, effectively, you would be allowed to drop permissions, but
> >> locked_groups would still be checked for restrictions?
> >>
> >> That seems like it'd introduce a new level of complexity (a new facet of
> >> permission) to manage. Not opposed, but it does seem more complex than
> >> just opting out of using groups for negative permissions.

Yeah, it would, but I basically hoped that we could catch most of this at
e.g. generic_permission(), and/or we could introduce a helper which
automatically adds a check for permission denied from locked_groups, so
it shouldn't be too wide-spread.  If it does end up showing up all over
the place, then that's a good reason not to do this.

> > Is there any context other than regular UNIX DAC in which groups can
> > act as negative permissions or is this literally just an issue for
> > files with a more restrictive group mode than other mode?
> 
> Just that.
> 
> The ideas kicked around in the conversation were some variant of having
> a sysctl that says "This system never uses groups for negative
> permissions".
> 
> It was also suggested that if the sysctl was set the the permission
> checks would be altered such that even if someone tried to set a
> negative permission, the more liberal permissions of other would be used
> instead.

So then this would touch all the same code points which the
locked_groups approach would have to touch?

> Given that creating /etc/subgid is effectively opting out of negative
> permissions already have a sysctl that says that upfront feels like a
> very clean solution.
> 
> Eric

That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
she needs to run a container, so installs subuid package and sets that sysctl,
how does she know whether or not some previous admin, who has since retired and
did not keep good docs, set things up so that a negative acl is keeping nginx
from reading some supersecret doc?

Now personally I'm not a great believer in the negative acls so I think the
above is a very unlikely scenario, but if we're going to worry about it, then
we should worry about it :)

"Click this button if noone has ever used feature X on this server"

-serge

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-11 20:53   ` Josh Triplett
  2020-10-12  0:38     ` Andy Lutomirski
@ 2020-10-12 17:05     ` Giuseppe Scrivano
  2020-10-13 12:46       ` Serge E. Hallyn
  1 sibling, 1 reply; 22+ messages in thread
From: Giuseppe Scrivano @ 2020-10-12 17:05 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Serge E. Hallyn, Christian Brauner, containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

Josh Triplett <josh@joshtriplett.org> writes:

> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >    in mind the case of groups used for negative access control.
>> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> >    investigate adding a prctl() to allow setgroups() to be called in a user
>> >    namespace at the cost of restricting paths to the most restrictive
>> >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> >    even though the caller is not in its owning group which is used for negative
>> >    access control (how these new semantics will interact with ACLs will also
>> >    need to be looked into).
>> 
>> I should probably think this through more, but for this problem, would it
>> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> struct group_info *locked_groups, and every time an unprivileged task creates
>> a new user namespace, add all its current groups to this list?
>
> So, effectively, you would be allowed to drop permissions, but
> locked_groups would still be checked for restrictions?
>
> That seems like it'd introduce a new level of complexity (a new facet of
> permission) to manage. Not opposed, but it does seem more complex than
> just opting out of using groups for negative permissions.

I have played with something similar in the past.  At that time I've
discussed it only privately with Eric and we agreed it wasn't worth the
extra complexity:

https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af

instead of a prctl, I've added a new mode to /proc/PID/setgroups that
allows setgroups in a userns locking the current gids.

What do you think about using /proc/PID/setgroups instead of a new
prctl()?

Giuseppe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-12 17:05     ` Giuseppe Scrivano
@ 2020-10-13 12:46       ` Serge E. Hallyn
  2020-10-13 15:17         ` Giuseppe Scrivano
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2020-10-13 12:46 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Josh Triplett, Serge E. Hallyn, Christian Brauner, containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
> Josh Triplett <josh@joshtriplett.org> writes:
> 
> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >    in mind the case of groups used for negative access control.
> >> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> >> >    investigate adding a prctl() to allow setgroups() to be called in a user
> >> >    namespace at the cost of restricting paths to the most restrictive
> >> >    permission. So if something is 0707 it needs to be treated as if it's 0000
> >> >    even though the caller is not in its owning group which is used for negative
> >> >    access control (how these new semantics will interact with ACLs will also
> >> >    need to be looked into).
> >> 
> >> I should probably think this through more, but for this problem, would it
> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> struct group_info *locked_groups, and every time an unprivileged task creates
> >> a new user namespace, add all its current groups to this list?
> >
> > So, effectively, you would be allowed to drop permissions, but
> > locked_groups would still be checked for restrictions?
> >
> > That seems like it'd introduce a new level of complexity (a new facet of
> > permission) to manage. Not opposed, but it does seem more complex than
> > just opting out of using groups for negative permissions.
> 
> I have played with something similar in the past.  At that time I've
> discussed it only privately with Eric and we agreed it wasn't worth the
> extra complexity:
> 
> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af

Hi, you linked the setgroups patch, but do you also have a link to the
attempt which you deemed was not worth it?

> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
> allows setgroups in a userns locking the current gids.
> 
> What do you think about using /proc/PID/setgroups instead of a new
> prctl()?

It's better than not having it, but two concerns -

1. some userspace, especially testsuites, could become confused by the fact
that they can't drop groups no matter how hard they try, since these will all
still show up as regular groups.
2. whereas in my lockgroups proposal, lock_groups would only be taken into account
for permission denial, this proposal would count for permission grants too.  This
means that if I have a group which is permitted to read /foo/topsecret, and I
start a program in a new user namespace expecting it to drop that permission,
I can't have that, right?  The new program, will always have that permission?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-13 12:46       ` Serge E. Hallyn
@ 2020-10-13 15:17         ` Giuseppe Scrivano
  2020-10-15 14:32           ` Serge E. Hallyn
  0 siblings, 1 reply; 22+ messages in thread
From: Giuseppe Scrivano @ 2020-10-13 15:17 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Josh Triplett, Christian Brauner, containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
>> Josh Triplett <josh@joshtriplett.org> writes:
>> 
>> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> >    in mind the case of groups used for negative access control.
>> >> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> >> >    investigate adding a prctl() to allow setgroups() to be called in a user
>> >> >    namespace at the cost of restricting paths to the most restrictive
>> >> >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> >> >    even though the caller is not in its owning group which is used for negative
>> >> >    access control (how these new semantics will interact with ACLs will also
>> >> >    need to be looked into).
>> >> 
>> >> I should probably think this through more, but for this problem, would it
>> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> struct group_info *locked_groups, and every time an unprivileged task creates
>> >> a new user namespace, add all its current groups to this list?
>> >
>> > So, effectively, you would be allowed to drop permissions, but
>> > locked_groups would still be checked for restrictions?
>> >
>> > That seems like it'd introduce a new level of complexity (a new facet of
>> > permission) to manage. Not opposed, but it does seem more complex than
>> > just opting out of using groups for negative permissions.
>> 
>> I have played with something similar in the past.  At that time I've
>> discussed it only privately with Eric and we agreed it wasn't worth the
>> extra complexity:
>> 
>> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
>
> Hi, you linked the setgroups patch, but do you also have a link to the
> attempt which you deemed was not worth it?

it was just part of a private discussion; but was 4 years ago so we can
probably revisit and accept the additional complexity since setgroups()
is still an issue with user namespaces.


>> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
>> allows setgroups in a userns locking the current gids.
>> 
>> What do you think about using /proc/PID/setgroups instead of a new
>> prctl()?
>
> It's better than not having it, but two concerns -
>
> 1. some userspace, especially testsuites, could become confused by the fact
> that they can't drop groups no matter how hard they try, since these will all
> still show up as regular groups.

I forgot to send a link to a second patch :-) that completes the feature:
https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2

When the new mode is used, the gids that are not known in the userns do
not show up in userspace.

> 2. whereas in my lockgroups proposal, lock_groups would only be taken into account
> for permission denial, this proposal would count for permission grants too.  This
> means that if I have a group which is permitted to read /foo/topsecret, and I
> start a program in a new user namespace expecting it to drop that permission,
> I can't have that, right?  The new program, will always have that permission?

right.  The new mode I was working on cannot be used to drop granted permissions.

Giuseppe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-12 15:00         ` Serge E. Hallyn
@ 2020-10-14 19:46           ` Eric W. Biederman
  2020-10-15 14:27             ` Serge E. Hallyn
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-14 19:46 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Josh Triplett, Christian Brauner,
	Linux Containers, Alexander Mihalicyn, Mrunal Patel, Wat Lim,
	Aleksa Sarai, Pavel Tikhomirov, Geoffrey Thomas,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Giuseppe Scrivano, Stephane Graber, Kees Cook, Sargun Dhillon,
	LKML

"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <luto@kernel.org> writes:
>> 
>> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
>> >>
>> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> > >    in mind the case of groups used for negative access control.
>> >> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> >> > >    investigate adding a prctl() to allow setgroups() to be called in a user
>> >> > >    namespace at the cost of restricting paths to the most restrictive
>> >> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> >> > >    even though the caller is not in its owning group which is used for negative
>> >> > >    access control (how these new semantics will interact with ACLs will also
>> >> > >    need to be looked into).
>> >> >
>> >> > I should probably think this through more, but for this problem, would it
>> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> > struct group_info *locked_groups, and every time an unprivileged task creates
>> >> > a new user namespace, add all its current groups to this list?
>> >>
>> >> So, effectively, you would be allowed to drop permissions, but
>> >> locked_groups would still be checked for restrictions?
>> >>
>> >> That seems like it'd introduce a new level of complexity (a new facet of
>> >> permission) to manage. Not opposed, but it does seem more complex than
>> >> just opting out of using groups for negative permissions.
>
> Yeah, it would, but I basically hoped that we could catch most of this at
> e.g. generic_permission(), and/or we could introduce a helper which
> automatically adds a check for permission denied from locked_groups, so
> it shouldn't be too wide-spread.  If it does end up showing up all over
> the place, then that's a good reason not to do this.
>
>> > Is there any context other than regular UNIX DAC in which groups can
>> > act as negative permissions or is this literally just an issue for
>> > files with a more restrictive group mode than other mode?
>> 
>> Just that.
>> 
>> The ideas kicked around in the conversation were some variant of having
>> a sysctl that says "This system never uses groups for negative
>> permissions".
>> 
>> It was also suggested that if the sysctl was set the the permission
>> checks would be altered such that even if someone tried to set a
>> negative permission, the more liberal permissions of other would be used
>> instead.
>
> So then this would touch all the same code points which the
> locked_groups approach would have to touch?

No locked_groups would touch in_group_p and set_groups.  Especially what
set_groups means in that context.  It would have to handle what happens
when you start accumulating locked groups (because of multiple
namespaces).  How you dedup locked groups etc.

I was not able to convince myself that not being able to clear out
groups that a user has when they create a user namespace won't cause
other problems.  Especially as user namespaces had been in use for a
while at that point.

Not supporting negative groups would touch acl_permission and modify it
like:

 static int acl_permission_check(struct inode *inode, int mask)
 {
[irrelveant code snipped]
 	/* Only RWX matters for group/other mode bits */
 	mask &= 7;
 
 	/*
 	 * Are the group permissions different from
 	 * the other permissions in the bits we care
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
-		if (in_group_p(inode->i_gid))
+		if (in_group_p(inode->i_gid) &&
+		    (!sysctl_force_positive_groups ||
+		    (mask & ~(mode >> 3)))
 			mode >>= 3;
 	}
 
 	/* Bits in 'mode' clear that we require? */
 	return (mask & ~mode) ? -EACCES : 0;
 }


I don't know that we need to do that.  But it would might be a good way
of flushing out the issues.


>> Given that creating /etc/subgid is effectively opting out of negative
>> permissions already have a sysctl that says that upfront feels like a
>> very clean solution.
>> 
>> Eric
>
> That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
> she needs to run a container, so installs subuid package and sets that sysctl,
> how does she know whether or not some previous admin, who has since retired and
> did not keep good docs, set things up so that a negative acl is keeping nginx
> from reading some supersecret doc?
>
> Now personally I'm not a great believer in the negative acls so I think the
> above is a very unlikely scenario, but if we're going to worry about it, then
> we should worry about it :)

There is a different between guaranting we don't break existing setups
when a new feature is enabled, and supporting old very rare setups when
a new feature is enabled.


> "Click this button if noone has ever used feature X on this server"

My current thinking is that if we already don't honor negative groups
when /etc/subgid exists it would not hurt to make that more explicit.


From what we could tell at the time people that know negative groups are
honored much less systems that actually use negative groups are
exceedingly rare.

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-14 19:46           ` Eric W. Biederman
@ 2020-10-15 14:27             ` Serge E. Hallyn
  2020-10-17 15:04               ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2020-10-15 14:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Andy Lutomirski, Josh Triplett,
	Christian Brauner, Linux Containers, Alexander Mihalicyn,
	Mrunal Patel, Wat Lim, Aleksa Sarai, Pavel Tikhomirov,
	Geoffrey Thomas, Joseph Christopher Sible,
	Mickaël Salaün, Vivek Goyal, Giuseppe Scrivano,
	Stephane Graber, Kees Cook, Sargun Dhillon, LKML

On Wed, Oct 14, 2020 at 02:46:46PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
> >> Andy Lutomirski <luto@kernel.org> writes:
> >> 
> >> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
> >> >>
> >> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >> > >    in mind the case of groups used for negative access control.
> >> >> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> >> >> > >    investigate adding a prctl() to allow setgroups() to be called in a user
> >> >> > >    namespace at the cost of restricting paths to the most restrictive
> >> >> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
> >> >> > >    even though the caller is not in its owning group which is used for negative
> >> >> > >    access control (how these new semantics will interact with ACLs will also
> >> >> > >    need to be looked into).
> >> >> >
> >> >> > I should probably think this through more, but for this problem, would it
> >> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> >> > struct group_info *locked_groups, and every time an unprivileged task creates
> >> >> > a new user namespace, add all its current groups to this list?
> >> >>
> >> >> So, effectively, you would be allowed to drop permissions, but
> >> >> locked_groups would still be checked for restrictions?
> >> >>
> >> >> That seems like it'd introduce a new level of complexity (a new facet of
> >> >> permission) to manage. Not opposed, but it does seem more complex than
> >> >> just opting out of using groups for negative permissions.
> >
> > Yeah, it would, but I basically hoped that we could catch most of this at
> > e.g. generic_permission(), and/or we could introduce a helper which
> > automatically adds a check for permission denied from locked_groups, so
> > it shouldn't be too wide-spread.  If it does end up showing up all over
> > the place, then that's a good reason not to do this.
> >
> >> > Is there any context other than regular UNIX DAC in which groups can
> >> > act as negative permissions or is this literally just an issue for
> >> > files with a more restrictive group mode than other mode?
> >> 
> >> Just that.
> >> 
> >> The ideas kicked around in the conversation were some variant of having
> >> a sysctl that says "This system never uses groups for negative
> >> permissions".
> >> 
> >> It was also suggested that if the sysctl was set the the permission
> >> checks would be altered such that even if someone tried to set a
> >> negative permission, the more liberal permissions of other would be used
> >> instead.
> >
> > So then this would touch all the same code points which the
> > locked_groups approach would have to touch?
> 
> No locked_groups would touch in_group_p and set_groups.  Especially what
> set_groups means in that context.  It would have to handle what happens
> when you start accumulating locked groups (because of multiple
> namespaces).  How you dedup locked groups etc.

Well since group_info is sorted, you should be able to do a pretty
simple and quick merge of current->locked_groups and
current->group_info.  I suppose we'd have to consider a nasty user who
is allocated 100k groups, sticks them all in groupinfo, then unshare
twice, locking the kernel up for awhile, but that user can already hurt
us.

> I was not able to convince myself that not being able to clear out
> groups that a user has when they create a user namespace won't cause
> other problems.  Especially as user namespaces had been in use for a
> while at that point.

The locked_groups would *only* be considered for negative acls, right?
You would not *grant* any perms based on them.  It seems like exactly
what you want.  If any user is denied perms on account of it, then that
was the intent, and that's the whole reason we're having this problem.
We are discussing whether it's ok to let a new user_ns be a way to
bypass that restriction - not *looking* for a way to support bypassing
it.

I could state this as a more formal proof if you like.

> Not supporting negative groups would touch acl_permission and modify it
> like:
> 
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
> [irrelveant code snipped]
>  	/* Only RWX matters for group/other mode bits */
>  	mask &= 7;
>  
>  	/*
>  	 * Are the group permissions different from
>  	 * the other permissions in the bits we care
>  	 * about? Need to check group ownership if so.
>  	 */
>  	if (mask & (mode ^ (mode >> 3))) {
> -		if (in_group_p(inode->i_gid))
> +		if (in_group_p(inode->i_gid) &&
> +		    (!sysctl_force_positive_groups ||
> +		    (mask & ~(mode >> 3)))
>  			mode >>= 3;
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
>  	return (mask & ~mode) ? -EACCES : 0;
>  }
> 
> 
> I don't know that we need to do that.  But it would might be a good way
> of flushing out the issues.
> 
> 
> >> Given that creating /etc/subgid is effectively opting out of negative
> >> permissions already have a sysctl that says that upfront feels like a
> >> very clean solution.
> >> 
> >> Eric
> >
> > That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
> > she needs to run a container, so installs subuid package and sets that sysctl,
> > how does she know whether or not some previous admin, who has since retired and
> > did not keep good docs, set things up so that a negative acl is keeping nginx
> > from reading some supersecret doc?
> >
> > Now personally I'm not a great believer in the negative acls so I think the
> > above is a very unlikely scenario, but if we're going to worry about it, then
> > we should worry about it :)
> 
> There is a different between guaranting we don't break existing setups
> when a new feature is enabled, and supporting old very rare setups when
> a new feature is enabled.

If the new feature is enabled by default, then no.  (I know the kernel
kconfig =n, but these users likely are using a distribution, which
probably enables it)

And I also argue that if the new feature appears unrelated and is highly
desirable, then again no.  Everyone these days wants to "enable
container support" and it seems unrelated to using file permissions the
way they appear meant to be used (I know I'm contradicting myself here
:)  It's not like we are stopping support for an old architecture.  It's
a behavior resulting from a combination of ordinary configuration pieces
spread throught a system which is hard for a new admin to know is
actually being used.  I guess let me put it this way:  I don't think
that when a new admin is hired, she is told "Now remember, we use acls
to deny file permissions to certain users based on their groupid, watch
out for that."

> > "Click this button if noone has ever used feature X on this server"
> 
> My current thinking is that if we already don't honor negative groups
> when /etc/subgid exists it would not hurt to make that more explicit.

Ok, if that's how everyone feels, then I'll step back.

It was just an idea :)

I prefer, in this case, a simple host-wide sysctl to allow setgroups.

> >From what we could tell at the time people that know negative groups are
> honored much less systems that actually use negative groups are
> exceedingly rare.

You're probably right, but again I think if you did a survey, many
people who are using it probably answered no because they either
forgot or don't think of it in those terms.

-serge

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-13 15:17         ` Giuseppe Scrivano
@ 2020-10-15 14:32           ` Serge E. Hallyn
  2020-10-19 12:12             ` Giuseppe Scrivano
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2020-10-15 14:32 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Serge E. Hallyn, Josh Triplett, Christian Brauner, containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

On Tue, Oct 13, 2020 at 05:17:36PM +0200, Giuseppe Scrivano wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
> >> Josh Triplett <josh@joshtriplett.org> writes:
> >> 
> >> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >> >    in mind the case of groups used for negative access control.
> >> >> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
> >> >> >    investigate adding a prctl() to allow setgroups() to be called in a user
> >> >> >    namespace at the cost of restricting paths to the most restrictive
> >> >> >    permission. So if something is 0707 it needs to be treated as if it's 0000
> >> >> >    even though the caller is not in its owning group which is used for negative
> >> >> >    access control (how these new semantics will interact with ACLs will also
> >> >> >    need to be looked into).
> >> >> 
> >> >> I should probably think this through more, but for this problem, would it
> >> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> >> struct group_info *locked_groups, and every time an unprivileged task creates
> >> >> a new user namespace, add all its current groups to this list?
> >> >
> >> > So, effectively, you would be allowed to drop permissions, but
> >> > locked_groups would still be checked for restrictions?
> >> >
> >> > That seems like it'd introduce a new level of complexity (a new facet of
> >> > permission) to manage. Not opposed, but it does seem more complex than
> >> > just opting out of using groups for negative permissions.
> >> 
> >> I have played with something similar in the past.  At that time I've
> >> discussed it only privately with Eric and we agreed it wasn't worth the
> >> extra complexity:
> >> 
> >> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
> >
> > Hi, you linked the setgroups patch, but do you also have a link to the
> > attempt which you deemed was not worth it?
> 
> it was just part of a private discussion; but was 4 years ago so we can
> probably revisit and accept the additional complexity since setgroups()
> is still an issue with user namespaces.
> 
> 
> >> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
> >> allows setgroups in a userns locking the current gids.
> >> 
> >> What do you think about using /proc/PID/setgroups instead of a new
> >> prctl()?
> >
> > It's better than not having it, but two concerns -
> >
> > 1. some userspace, especially testsuites, could become confused by the fact
> > that they can't drop groups no matter how hard they try, since these will all
> > still show up as regular groups.
> 
> I forgot to send a link to a second patch :-) that completes the feature:
> https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2
> 
> When the new mode is used, the gids that are not known in the userns do
> not show up in userspace.

Ah, right - and of course those gids better not be mapped into the namespace :)

But so, this is the patch you said you agreed was not worth the extra
complexity?

> > 2. whereas in my lockgroups proposal, lock_groups would only be taken into account
> > for permission denial, this proposal would count for permission grants too.  This
> > means that if I have a group which is permitted to read /foo/topsecret, and I
> > start a program in a new user namespace expecting it to drop that permission,
> > I can't have that, right?  The new program, will always have that permission?
> 
> right.  The new mode I was working on cannot be used to drop granted permissions.
> 
> Giuseppe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-08-30 14:39 LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Christian Brauner
  2020-10-10  4:26 ` Serge E. Hallyn
@ 2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
  2020-10-17 16:51   ` Eric W. Biederman
  1 sibling, 1 reply; 22+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2020-10-15 15:31 UTC (permalink / raw)
  To: Christian Brauner, containers
  Cc: Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Eric W. Biederman, Mickaël Salaün, Wat Lim,
	Mrunal Patel, Pavel Tikhomirov, Geoffrey Thomas

On 30.08.20 16:39, Christian Brauner wrote:

Hi Christian,

> P1. Isolated id mappings can only be guaranteed to be locally isolated.
>     A container runtime/daemon can only guarantee non-overlapping id mappings
>     when no other users on the system create containers.

Indeed. But couldn't we just record the mappings in some standardized
place (eg. some file) which all engines maintain ?

I'd guess other solutions would need changes in the runtimes, too.

Please keep in mind that some scenarios actually need some overlaps, eg.
application containers that shall have direct access to home dirs.

> P2. Enforcing isolated id mappings in userspace is difficult.
>     It is always possible to create other processes with overlapping id
>     mappings. Coordinating id mappings in userspace will always remain
>     optional. Quite a few tools nowadays (including systemd) don't care about
>     /etc/sub{g,u}id and actively advise against using it. This is made even
>     more problematic since sub{g,u}iid delegation is done per-user rather than
>     per-container-runtime.

I believe subusers aren't meant for tyical containers (like docker or
lxc), but unprivileged user programs that wanna have further isolation
for subprocesses (eg. a browser's renderer or js engine).

Correct me if I'm wrong.

> P3. The range of the id mapping of a container can't be predetermined.
>     While POSIX mandates that a standard system should use a range of 65536 ids
>     reality is very different. Some programs allocate high ids for random
>     processes or for network authentication. This means, in practice it is
>     often necessary to assign a range of up to 10 million ids to a container.
>     This limits a system to less than 500 containers total.

In 25+ years, haven't seen such an application in the field. I'd
consider this a horrible and dangerous bug. Sane applications create
specific user entries (/etc/passwd) for that.

I'd say we're safe w/ max 2^16 users per container, which should give us
space for about 2^16 containers.

> P4. Isolated id mappings severely restrict the number of containers that can be
>     run on a system.
>     This ties back to the point about pre-determining the id range of a
>     container and how large range allocations tend to be on real systems. That
>     becomes even more relevant when nesting containers.

IMHO, all we need is to maintain a list of active ranges (more precisely
the 16bit prefixes, just like class B networks ;-)). As said, I'd
declare the scenario #P3 as invalid and rather fix those few broken
applications.

> P5. Container runtimes cannot reuse overlayfs lower directories if each
>     container uses isolated ID mappings, leading to either needless storage
>     overhead (LXD -- though the LXD folks don’t really mind), completely
>     ignoring the benefits of isolating containers from each other (Docker), or
>     not using them at all (Kubernetes). (This is a more general issue but bears
>     repeating since it is closely tied to most userns proposals.)

Indeed. That's IMHO the main problem. We somehow need to map the UIDs.
Maybe a synthetic filesystem that just does exactly the same uid<->kuid
translations we're already doing in other places ?

> P6. Rlimits pose a problem for containers that share the same id mapping.
>     This means containers with overlapping id mappings can DOS each other by
>     exhausting their rlimits. The reason for this lies with the current
>     implementation of rlimits -- rlimits are currently tied to users and are
>     not hierarchically limited like inotify limits are. This is a severe
>     problem in unprivileged workloads. Eric and others identified that this
>     issue can be fixed independently of the isolated user namespace proposal.

Is this really an practical isssue, when we're using uid namespaces ?

> S2. Kernel-enforced user namespace isolation.
>     This means, there is no need for different container runtimes to
>     collaborate on id ranges with immediate benefits for everyone.
>     This solves P1 and P2.

Okay, but how to support scenarios where some of the UIDs should
overlap on purpose ? (eg. mounting some of the host's user homedirs
into namespaces ?)

> S5. The owning id concept of a user namespace makes monitoring and interacting
>     with such containers way easier.

What exactly is the owning id ? How is it created and managed ?
Some magic id or an cryptographic token =

> 1. How are interactions across isolated user namespaces handled?

What kind of interaction do you have in mind ?
Data transfers ? Process manipulaton ? Namespace destruction ?

Can you please illustrate some actual use cases ?

>    Proposal 1.1 semmed prefered since it would allow an unprivileged
>    user creating an isolated user namespace to kill/ptrace all processes
>    in the isolated namespace they spawned. 

Don't we already have this if this user is mapped as root inside the
container ?

>    The first consensus reached seemed to be to decouple isolated user
>    namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
>    at the beginning as filesystems which can be mounted inside isolated
>    user namespaces and so would have proper ownership. 

So, I'd essentially have to run the whole rootfs through fuse and a
userland fileserver, which probably has to track things like ownerships
in its own db (when running under unprivileged user) ?

> For mount points
>    that originate from outside the namespace, everything will show as
>    the overflow ids and access would be restricted to the most
>    restricted permission bit for any path that can be accessed.

So, I can't just take a btrfs snapshot as rootfs anymore ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-15 14:27             ` Serge E. Hallyn
@ 2020-10-17 15:04               ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-17 15:04 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Josh Triplett, Christian Brauner,
	Linux Containers, Alexander Mihalicyn, Mrunal Patel, Wat Lim,
	Aleksa Sarai, Pavel Tikhomirov, Geoffrey Thomas,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Giuseppe Scrivano, Stephane Graber, Kees Cook, Sargun Dhillon,
	LKML

"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Wed, Oct 14, 2020 at 02:46:46PM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
>> >> Andy Lutomirski <luto@kernel.org> writes:
>> >> 
>> >> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett <josh@joshtriplett.org> wrote:
>> >> >>
>> >> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> >> > >    in mind the case of groups used for negative access control.
>> >> >> > >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> >> >> > >    investigate adding a prctl() to allow setgroups() to be called in a user
>> >> >> > >    namespace at the cost of restricting paths to the most restrictive
>> >> >> > >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> >> >> > >    even though the caller is not in its owning group which is used for negative
>> >> >> > >    access control (how these new semantics will interact with ACLs will also
>> >> >> > >    need to be looked into).
>> >> >> >
>> >> >> > I should probably think this through more, but for this problem, would it
>> >> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> >> > struct group_info *locked_groups, and every time an unprivileged task creates
>> >> >> > a new user namespace, add all its current groups to this list?
>> >> >>
>> >> >> So, effectively, you would be allowed to drop permissions, but
>> >> >> locked_groups would still be checked for restrictions?
>> >> >>
>> >> >> That seems like it'd introduce a new level of complexity (a new facet of
>> >> >> permission) to manage. Not opposed, but it does seem more complex than
>> >> >> just opting out of using groups for negative permissions.
>> >
>> > Yeah, it would, but I basically hoped that we could catch most of this at
>> > e.g. generic_permission(), and/or we could introduce a helper which
>> > automatically adds a check for permission denied from locked_groups, so
>> > it shouldn't be too wide-spread.  If it does end up showing up all over
>> > the place, then that's a good reason not to do this.
>> >
>> >> > Is there any context other than regular UNIX DAC in which groups can
>> >> > act as negative permissions or is this literally just an issue for
>> >> > files with a more restrictive group mode than other mode?
>> >> 
>> >> Just that.
>> >> 
>> >> The ideas kicked around in the conversation were some variant of having
>> >> a sysctl that says "This system never uses groups for negative
>> >> permissions".
>> >> 
>> >> It was also suggested that if the sysctl was set the the permission
>> >> checks would be altered such that even if someone tried to set a
>> >> negative permission, the more liberal permissions of other would be used
>> >> instead.
>> >
>> > So then this would touch all the same code points which the
>> > locked_groups approach would have to touch?
>> 
>> No locked_groups would touch in_group_p and set_groups.  Especially what
>> set_groups means in that context.  It would have to handle what happens
>> when you start accumulating locked groups (because of multiple
>> namespaces).  How you dedup locked groups etc.
>
> Well since group_info is sorted, you should be able to do a pretty
> simple and quick merge of current->locked_groups and
> current->group_info.  I suppose we'd have to consider a nasty user who
> is allocated 100k groups, sticks them all in groupinfo, then unshare
> twice, locking the kernel up for awhile, but that user can already hurt
> us.
>
>> I was not able to convince myself that not being able to clear out
>> groups that a user has when they create a user namespace won't cause
>> other problems.  Especially as user namespaces had been in use for a
>> while at that point.
>
> The locked_groups would *only* be considered for negative acls, right?

I had not seen that idea proposed.  I had assumed they would be
consulted in all cases for group membership in permission checks,
and that the only change would be to in_group_p and the code to
maintain the group lists.

> You would not *grant* any perms based on them.  It seems like exactly
> what you want.  If any user is denied perms on account of it, then that
> was the intent, and that's the whole reason we're having this problem.
> We are discussing whether it's ok to let a new user_ns be a way to
> bypass that restriction - not *looking* for a way to support bypassing
> it.
>
> I could state this as a more formal proof if you like.


If you modify the permission checks as you suggest it does seem easier
to reason about with respect to causing problems.  I would want to call
them denied_groups or something like that in the data structure for
clarity.

Howver there is a big question of how we deal with NFS.  NFS home
directories are currently a bit of a challenge for user namespaces
because the server performs the permission checks, and won't let
the userns root chown files from one user to another.

Negative groups would not be propogated to the NFS server (because they
are something new) and because they fail to propogate would fail to
preserve the property we are trying to preserve.

I think I need to reexamine the NFS issue.

Similarly user space processes outside the user namespace that check
if a process is a group for purposes of denying permissions might run
into the same issue if we had deny only groups.

>> Not supporting negative groups would touch acl_permission and modify it
>> like:
>> 
>>  static int acl_permission_check(struct inode *inode, int mask)
>>  {
>> [irrelveant code snipped]
>>  	/* Only RWX matters for group/other mode bits */
>>  	mask &= 7;
>>  
>>  	/*
>>  	 * Are the group permissions different from
>>  	 * the other permissions in the bits we care
>>  	 * about? Need to check group ownership if so.
>>  	 */
>>  	if (mask & (mode ^ (mode >> 3))) {
>> -		if (in_group_p(inode->i_gid))
>> +		if (in_group_p(inode->i_gid) &&
>> +		    (!sysctl_force_positive_groups ||
>> +		    (mask & ~(mode >> 3)))
>>  			mode >>= 3;
>>  	}
>>  
>>  	/* Bits in 'mode' clear that we require? */
>>  	return (mask & ~mode) ? -EACCES : 0;
>>  }
>> 
>> 
>> I don't know that we need to do that.  But it would might be a good way
>> of flushing out the issues.
>> 
>> 
>> >> Given that creating /etc/subgid is effectively opting out of negative
>> >> permissions already have a sysctl that says that upfront feels like a
>> >> very clean solution.
>> >> 
>> >> Eric
>> >
>> > That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
>> > she needs to run a container, so installs subuid package and sets that sysctl,
>> > how does she know whether or not some previous admin, who has since retired and
>> > did not keep good docs, set things up so that a negative acl is keeping nginx
>> > from reading some supersecret doc?
>> >
>> > Now personally I'm not a great believer in the negative acls so I think the
>> > above is a very unlikely scenario, but if we're going to worry about it, then
>> > we should worry about it :)
>> 
>> There is a different between guaranting we don't break existing setups
>> when a new feature is enabled, and supporting old very rare setups when
>> a new feature is enabled.
>
> If the new feature is enabled by default, then no.  (I know the kernel
> kconfig =n, but these users likely are using a distribution, which
> probably enables it)
>
> And I also argue that if the new feature appears unrelated and is highly
> desirable, then again no.  Everyone these days wants to "enable
> container support" and it seems unrelated to using file permissions the
> way they appear meant to be used (I know I'm contradicting myself here
> :)  It's not like we are stopping support for an old architecture.  It's
> a behavior resulting from a combination of ordinary configuration pieces
> spread throught a system which is hard for a new admin to know is
> actually being used.  I guess let me put it this way:  I don't think
> that when a new admin is hired, she is told "Now remember, we use acls
> to deny file permissions to certain users based on their groupid, watch
> out for that."

I think it is a valid concern.

The question is do enough of those systems exist to matter.  Maybe.

I think our moral imperative is to honor the no regression rule.  You
are right that it is difficult to detect regressions in this area.

So far I have received exactly one report of a system using negative
groups.  That was Casey who was using this for some intel designed
distrubution that used smack.  I think the Nokia phones were playing
with it at the end before Nokia sold their phone division.

Casey still mentions on occasion that he is grumpy at the way setgroups
and user namespaces worked out.


>> > "Click this button if noone has ever used feature X on this server"
>> 
>> My current thinking is that if we already don't honor negative groups
>> when /etc/subgid exists it would not hurt to make that more explicit.
>
> Ok, if that's how everyone feels, then I'll step back.
>
> It was just an idea :)
>
> I prefer, in this case, a simple host-wide sysctl to allow setgroups.

Do you mean that you prefer the sysctl to just allow setgroups in a user
namespace not have the sysctl also disable negative groups?

>> >From what we could tell at the time people that know negative groups are
>> honored much less systems that actually use negative groups are
>> exceedingly rare.
>
> You're probably right, but again I think if you did a survey, many
> people who are using it probably answered no because they either
> forgot or don't think of it in those terms.

Quite possibly.

Still I haven't see any of the come forward and report problems.  Which
seems to indicate that the solution we came up with, with setgroups is
working and has not caused any regressions.

Eric

p.s.  One alternative that sticks slightly closer to what we have today
is to enable setgroups during the login process, like rlimits are set
today.  That has the same consequences for system security as
/etc/subuid and /etc/subgid and like a sysctl should just be a set and
forget option.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
@ 2020-10-17 16:51   ` Eric W. Biederman
  2020-10-18 10:20     ` Christian Brauner
  2020-10-29 13:42     ` LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Enrico Weigelt, metux IT consult
  0 siblings, 2 replies; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-17 16:51 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Christian Brauner, containers, Alexander Mihalicyn,
	Giuseppe Scrivano, Joseph Christopher Sible, Kees Cook,
	linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn

"Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:

> On 30.08.20 16:39, Christian Brauner wrote:
>
> Hi Christian,
>
>> P1. Isolated id mappings can only be guaranteed to be locally isolated.
>>     A container runtime/daemon can only guarantee non-overlapping id mappings
>>     when no other users on the system create containers.
>
> Indeed. But couldn't we just record the mappings in some standardized
> place (eg. some file) which all engines maintain ?
>
> I'd guess other solutions would need changes in the runtimes, too.
>
> Please keep in mind that some scenarios actually need some overlaps, eg.
> application containers that shall have direct access to home dirs.
>
>> P2. Enforcing isolated id mappings in userspace is difficult.
>>     It is always possible to create other processes with overlapping id
>>     mappings. Coordinating id mappings in userspace will always remain
>>     optional. Quite a few tools nowadays (including systemd) don't care about
>>     /etc/sub{g,u}id and actively advise against using it. This is made even
>>     more problematic since sub{g,u}iid delegation is done per-user rather than
>>     per-container-runtime.
>
> I believe subusers aren't meant for tyical containers (like docker or
> lxc), but unprivileged user programs that wanna have further isolation
> for subprocesses (eg. a browser's renderer or js engine).
>
> Correct me if I'm wrong.

There is an on-going trend to make unprivileged containers typical
containers.

>> P3. The range of the id mapping of a container can't be predetermined.
>>     While POSIX mandates that a standard system should use a range of 65536 ids
>>     reality is very different. Some programs allocate high ids for random
>>     processes or for network authentication. This means, in practice it is
>>     often necessary to assign a range of up to 10 million ids to a container.
>>     This limits a system to less than 500 containers total.
>
> In 25+ years, haven't seen such an application in the field. I'd
> consider this a horrible and dangerous bug. Sane applications create
> specific user entries (/etc/passwd) for that.
>
> I'd say we're safe w/ max 2^16 users per container, which should give us
> space for about 2^16 containers.

I forget the details but systemd has a feature where it will randomly
allocate a uid for a service.  Calling them something like temporariy uids.

>> P4. Isolated id mappings severely restrict the number of containers that can be
>>     run on a system.
>>     This ties back to the point about pre-determining the id range of a
>>     container and how large range allocations tend to be on real systems. That
>>     becomes even more relevant when nesting containers.
>
> IMHO, all we need is to maintain a list of active ranges (more precisely
> the 16bit prefixes, just like class B networks ;-)). As said, I'd
> declare the scenario #P3 as invalid and rather fix those few broken
> applications.

Which is /etc/subuid and /etc/subgid, and it was very much inspired from
the same source.

>> P5. Container runtimes cannot reuse overlayfs lower directories if each
>>     container uses isolated ID mappings, leading to either needless storage
>>     overhead (LXD -- though the LXD folks don’t really mind), completely
>>     ignoring the benefits of isolating containers from each other (Docker), or
>>     not using them at all (Kubernetes). (This is a more general issue but bears
>>     repeating since it is closely tied to most userns proposals.)
>
> Indeed. That's IMHO the main problem. We somehow need to map the UIDs.
> Maybe a synthetic filesystem that just does exactly the same uid<->kuid
> translations we're already doing in other places ?
>
>> P6. Rlimits pose a problem for containers that share the same id mapping.
>>     This means containers with overlapping id mappings can DOS each other by
>>     exhausting their rlimits. The reason for this lies with the current
>>     implementation of rlimits -- rlimits are currently tied to users and are
>>     not hierarchically limited like inotify limits are. This is a severe
>>     problem in unprivileged workloads. Eric and others identified that this
>>     issue can be fixed independently of the isolated user namespace proposal.
>
> Is this really an practical isssue, when we're using uid namespaces ?

Very much so.  There are containers who otherwise would use the same uid
range. (AKA they have the same set of users).  But can't because there
are cases like daemons that set their RLIMIT_NPROC to 1.  Because the
daemon knows that user for that daemon will never run any other
processes.

Run two containers with the same mappings and that daemon DOS's itself.

>> S2. Kernel-enforced user namespace isolation.
>>     This means, there is no need for different container runtimes to
>>     collaborate on id ranges with immediate benefits for everyone.
>>     This solves P1 and P2.
>
> Okay, but how to support scenarios where some of the UIDs should
> overlap on purpose ? (eg. mounting some of the host's user homedirs
> into namespaces ?)

Just have a limited number of mappings for the cases that actually need
on-disk storage.  The key idea is adding uids that don't need to be
mapped.  Everything else stays the same.

>> S5. The owning id concept of a user namespace makes monitoring and interacting
>>     with such containers way easier.
>
> What exactly is the owning id ? How is it created and managed ?
> Some magic id or an cryptographic token =

Not a new thing.  Just the user that created the user namespace.
It is suggested to refine the idea so that users that don't map
anywhere show up as the creator of the user namespace.

>> 1. How are interactions across isolated user namespaces handled?
>
> What kind of interaction do you have in mind ?
> Data transfers ? Process manipulaton ? Namespace destruction ?
>
> Can you please illustrate some actual use cases ?
>
>>    Proposal 1.1 semmed prefered since it would allow an unprivileged
>>    user creating an isolated user namespace to kill/ptrace all processes
>>    in the isolated namespace they spawned. 
>
> Don't we already have this if this user is mapped as root inside the
> container ?

I think there were more concerns raised that I think actually exist.
The owner/creator of a user namespace can already manage an container
and send it signals.  That is built into the capability system call.
Nothing needs to change there.

The only real question I see is which uids and gids do we show to
processes that are outside of the user namespace, when the uids and gids
don't map.

>>    The first consensus reached seemed to be to decouple isolated user
>>    namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
>>    at the beginning as filesystems which can be mounted inside isolated
>>    user namespaces and so would have proper ownership. 
>
> So, I'd essentially have to run the whole rootfs through fuse and a
> userland fileserver, which probably has to track things like ownerships
> in its own db (when running under unprivileged user) ?

The consensus was to start with what is working now.

Users that don't map outside of the user namespace will show up and work
properly in on tmpfs.  Or a fuse implementation of ext4 on top of a
file.

>> For mount points
>>    that originate from outside the namespace, everything will show as
>>    the overflow ids and access would be restricted to the most
>>    restricted permission bit for any path that can be accessed.
>
> So, I can't just take a btrfs snapshot as rootfs anymore ?

Interesting until reading through your commentary I had missed the
proposal to effectively effectively change the permissions to:
((mode >> 3) & (mode >> 6) & mode & 7).

The challenge is that in a permission triple it is possible to set
lower permissions for the owner of the file, or for a specific group,
than for everyone else.

Today we require root permissions to be able to map users and groups in
/proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
permissions to be able to drop groups with setgroups.

Now we are discussiong moving to a world where we can use users and
groups that don't map to any other user namespace in uid_map and
gid_map.  It should be completely safe to use those users and groups
except for negative permissions in filesystems.  So a big question is
how do we arrange the system so anyone can use those files without
negative permission causing problems.


I believe it is safe to not limit the owner of a file, as the
owner of a file can always chmode the file and remove any restrictions.
Which is no worse than calling setuid to a different uid.

Which leaves where we have been dealing with the ability to drop groups
with setgroups.

I guess the practical proposal is when the !in_group_p and we are
looking at the other permission.  Treat the permissions as:
((mode >> 3) & mode & 7).  Instead of just (mode & 7).

Which for systems who don't use negative group permissions is a no-op.
So this should not effect your btrfs snapshots at all (unless you use
negative group permissions).

It denies things before we get to an NFS server or other interesting
case so it should work for pretty much everything the kernel deals with.

Userspace repeating permission checks could break.  But that is just a
problem of inconsistency, and will always be a problem.

We could make it more precise as Serge was suggesting with a set of that
were dropped from setgroups, but under the assumption that negative
groups are sufficient rare we can avoid that overhead.

 static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
 
- [irrelevant bits of this function]        
 
 	/* Only RWX matters for group/other mode bits */
 	mask &= 7;
 
 	/*
 	 * Are the group permissions different from
 	 * the other permissions in the bits we care
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/* Use the most restrictive permissions? */
+		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
 	return (mask & ~mode) ? -EACCES : 0;
 }

As I read posix_acl_permission all of the posix acls for groups are
positive permissions.  So I think the only other code that would need to
be updated would be the filesystems that replace generic_permission with
something that doesn't call acl_permission check.

Userspace could then activate this mode with:
	echo "safely_allow" > /proc/<pid>/setgroups

That looks very elegant and simple, and I don't think will cause
problems for anyone.  It might even make sense to make that the default
mode when creating a new user namespace.

I guess we owe this idea to Josh Triplett and Geoffrey Thomas.

Does anyone see any problems with tweaking the permissions this way so
that we can always allow setgroups in a user namespace?

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-17 16:51   ` Eric W. Biederman
@ 2020-10-18 10:20     ` Christian Brauner
  2020-10-18 13:05       ` The problem of setgroups and containers Eric W. Biederman
  2020-10-29 13:42     ` LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2020-10-18 10:20 UTC (permalink / raw)
  To: Eric W. Biederman, Enrico Weigelt, metux IT consult
  Cc: containers, Alexander Mihalicyn, Giuseppe Scrivano,
	Joseph Christopher Sible, Kees Cook, linux-kernel, Josh Triplett,
	Andy Lutomirski, Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn

On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
> 
> > On 30.08.20 16:39, Christian Brauner wrote:
> >
> > Hi Christian,
> >
> >> P1. Isolated id mappings can only be guaranteed to be locally isolated.
> >>     A container runtime/daemon can only guarantee non-overlapping id mappings
> >>     when no other users on the system create containers.
> >
> > Indeed. But couldn't we just record the mappings in some standardized
> > place (eg. some file) which all engines maintain ?
> >
> > I'd guess other solutions would need changes in the runtimes, too.
> >
> > Please keep in mind that some scenarios actually need some overlaps, eg.
> > application containers that shall have direct access to home dirs.
> >
> >> P2. Enforcing isolated id mappings in userspace is difficult.
> >>     It is always possible to create other processes with overlapping id
> >>     mappings. Coordinating id mappings in userspace will always remain
> >>     optional. Quite a few tools nowadays (including systemd) don't care about
> >>     /etc/sub{g,u}id and actively advise against using it. This is made even
> >>     more problematic since sub{g,u}iid delegation is done per-user rather than
> >>     per-container-runtime.
> >
> > I believe subusers aren't meant for tyical containers (like docker or
> > lxc), but unprivileged user programs that wanna have further isolation
> > for subprocesses (eg. a browser's renderer or js engine).
> >
> > Correct me if I'm wrong.
> 
> There is an on-going trend to make unprivileged containers typical
> containers.

In general, this is something we all have been collectively pushing on
for years. Our users running LXD run unprivileged containers by default.
The daemon requires you to explicitly request running privileged
containers. All Linux workloads on Chromebooks are LXD-based and are
thus run in fully unprivileged containers so are all workloads on
ppc/arm64/s390x on Travis.
And now we're finally also see more runC based container managers like
Podman/cri-o adopting unprivileged containers too. So this is becoming
more and more common and in the interest of security we have an
obligation to help push for more adoption.

> 
> >> P3. The range of the id mapping of a container can't be predetermined.
> >>     While POSIX mandates that a standard system should use a range of 65536 ids
> >>     reality is very different. Some programs allocate high ids for random
> >>     processes or for network authentication. This means, in practice it is
> >>     often necessary to assign a range of up to 10 million ids to a container.
> >>     This limits a system to less than 500 containers total.
> >
> > In 25+ years, haven't seen such an application in the field. I'd
> > consider this a horrible and dangerous bug. Sane applications create
> > specific user entries (/etc/passwd) for that.
> >
> > I'd say we're safe w/ max 2^16 users per container, which should give us
> > space for about 2^16 containers.
> 
> I forget the details but systemd has a feature where it will randomly
> allocate a uid for a service.  Calling them something like temporariy uids.

and things like ldap, pam, or samba. The number is growing with
applications becoming more security aware. Here's an example from a user
reported bug:

Jun 13 02:05:39 xenial-template sshd[390]: Accepted password for sokoow from 10.21.34.100 port 37532 ssh2
Jun 13 02:05:39 xenial-template sshd[390]: pam_keyinit(sshd:session): Unable to change GID to 99000 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_unix(sshd:session): session opened for user sokoow by (uid=0)
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): pam_modutil_drop_priv: change_gid failed: Success
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to change UID to 10003 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): pam_modutil_regain_priv: called with invalid state
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to change UID back to -1
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): pam_modutil_drop_priv: change_gid failed: Success
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to change UID to 10003 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): pam_modutil_regain_priv: called with invalid state
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to change UID back to -1
Jun 13 02:05:39 xenial-template sshd[390]: pam_mail(sshd:session): pam_modutil_drop_priv: change_gid failed: Success

Maybe running application containers that problem is not as pressing
immediately but for containers running full systems bug reports
involving high id allocations are pretty common
https://github.com/lxc/lxd/issues/2111

There's nothing wrong with dropping to high ids technically and we can't
really enforce userspace stick to a 65536 range.

> 
> >> P4. Isolated id mappings severely restrict the number of containers that can be
> >>     run on a system.
> >>     This ties back to the point about pre-determining the id range of a
> >>     container and how large range allocations tend to be on real systems. That
> >>     becomes even more relevant when nesting containers.
> >
> > IMHO, all we need is to maintain a list of active ranges (more precisely
> > the 16bit prefixes, just like class B networks ;-)). As said, I'd
> > declare the scenario #P3 as invalid and rather fix those few broken
> > applications.
> 
> Which is /etc/subuid and /etc/subgid, and it was very much inspired from
> the same source.
> 
> >> P5. Container runtimes cannot reuse overlayfs lower directories if each
> >>     container uses isolated ID mappings, leading to either needless storage
> >>     overhead (LXD -- though the LXD folks don’t really mind), completely
> >>     ignoring the benefits of isolating containers from each other (Docker), or
> >>     not using them at all (Kubernetes). (This is a more general issue but bears
> >>     repeating since it is closely tied to most userns proposals.)
> >
> > Indeed. That's IMHO the main problem. We somehow need to map the UIDs.
> > Maybe a synthetic filesystem that just does exactly the same uid<->kuid
> > translations we're already doing in other places ?

I have another concrete proposal that I'm working on with Tycho, Serge,
and Aleksa and input from Seth hoping to have it ready somewhat soonish
after the merge window and I hope we can have a good discussion around
this.

> >
> >> P6. Rlimits pose a problem for containers that share the same id mapping.
> >>     This means containers with overlapping id mappings can DOS each other by
> >>     exhausting their rlimits. The reason for this lies with the current
> >>     implementation of rlimits -- rlimits are currently tied to users and are
> >>     not hierarchically limited like inotify limits are. This is a severe
> >>     problem in unprivileged workloads. Eric and others identified that this
> >>     issue can be fixed independently of the isolated user namespace proposal.
> >
> > Is this really an practical isssue, when we're using uid namespaces ?
> 
> Very much so.  There are containers who otherwise would use the same uid
> range. (AKA they have the same set of users).  But can't because there
> are cases like daemons that set their RLIMIT_NPROC to 1.  Because the
> daemon knows that user for that daemon will never run any other
> processes.
> 
> Run two containers with the same mappings and that daemon DOS's itself.

We have seen this problem and received bug reports about this. It is
especially a concern on shared infrastructure such as on aforementioned
Travis workloads.

> 
> >> S2. Kernel-enforced user namespace isolation.
> >>     This means, there is no need for different container runtimes to
> >>     collaborate on id ranges with immediate benefits for everyone.
> >>     This solves P1 and P2.
> >
> > Okay, but how to support scenarios where some of the UIDs should
> > overlap on purpose ? (eg. mounting some of the host's user homedirs
> > into namespaces ?)
> 
> Just have a limited number of mappings for the cases that actually need
> on-disk storage.  The key idea is adding uids that don't need to be
> mapped.  Everything else stays the same.
> 
> >> S5. The owning id concept of a user namespace makes monitoring and interacting
> >>     with such containers way easier.
> >
> > What exactly is the owning id ? How is it created and managed ?
> > Some magic id or an cryptographic token =
> 
> Not a new thing.  Just the user that created the user namespace.
> It is suggested to refine the idea so that users that don't map
> anywhere show up as the creator of the user namespace.
> 
> >> 1. How are interactions across isolated user namespaces handled?
> >
> > What kind of interaction do you have in mind ?
> > Data transfers ? Process manipulaton ? Namespace destruction ?
> >
> > Can you please illustrate some actual use cases ?
> >
> >>    Proposal 1.1 semmed prefered since it would allow an unprivileged
> >>    user creating an isolated user namespace to kill/ptrace all processes
> >>    in the isolated namespace they spawned. 
> >
> > Don't we already have this if this user is mapped as root inside the
> > container ?
> 
> I think there were more concerns raised that I think actually exist.
> The owner/creator of a user namespace can already manage an container
> and send it signals.  That is built into the capability system call.
> Nothing needs to change there.
> 
> The only real question I see is which uids and gids do we show to
> processes that are outside of the user namespace, when the uids and gids
> don't map.

The idea was to use the overflow uid and gid and to introduce a new
sysctl that would allow an administrator to enforce that processes can't
setuid()/setgid() to the overflow uid or gid. The overflow uid and gid
are already configurable. This really does not seem that different from
seeing a bunch of extremely high-range uid and gids for unprivileged
containers in the ps output as we do today or even looking at a
filesystem from within a user namespace where the ids don't map.

> 
> >>    The first consensus reached seemed to be to decouple isolated user
> >>    namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
> >>    at the beginning as filesystems which can be mounted inside isolated
> >>    user namespaces and so would have proper ownership. 
> >
> > So, I'd essentially have to run the whole rootfs through fuse and a
> > userland fileserver, which probably has to track things like ownerships
> > in its own db (when running under unprivileged user) ?
> 
> The consensus was to start with what is working now.
> 
> Users that don't map outside of the user namespace will show up and work
> properly in on tmpfs.  Or a fuse implementation of ext4 on top of a
> file.
> 
> >> For mount points
> >>    that originate from outside the namespace, everything will show as
> >>    the overflow ids and access would be restricted to the most
> >>    restricted permission bit for any path that can be accessed.
> >
> > So, I can't just take a btrfs snapshot as rootfs anymore ?
> 
> Interesting until reading through your commentary I had missed the
> proposal to effectively effectively change the permissions to:
> ((mode >> 3) & (mode >> 6) & mode & 7).
> 
> The challenge is that in a permission triple it is possible to set
> lower permissions for the owner of the file, or for a specific group,
> than for everyone else.
> 
> Today we require root permissions to be able to map users and groups in
> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
> permissions to be able to drop groups with setgroups.
> 
> Now we are discussiong moving to a world where we can use users and
> groups that don't map to any other user namespace in uid_map and
> gid_map.  It should be completely safe to use those users and groups
> except for negative permissions in filesystems.  So a big question is
> how do we arrange the system so anyone can use those files without
> negative permission causing problems.
> 
> 
> I believe it is safe to not limit the owner of a file, as the
> owner of a file can always chmode the file and remove any restrictions.
> Which is no worse than calling setuid to a different uid.
> 
> Which leaves where we have been dealing with the ability to drop groups
> with setgroups.
> 
> I guess the practical proposal is when the !in_group_p and we are
> looking at the other permission.  Treat the permissions as:
> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
> 
> Which for systems who don't use negative group permissions is a no-op.
> So this should not effect your btrfs snapshots at all (unless you use
> negative group permissions).
> 
> It denies things before we get to an NFS server or other interesting
> case so it should work for pretty much everything the kernel deals with.
> 
> Userspace repeating permission checks could break.  But that is just a
> problem of inconsistency, and will always be a problem.
> 
> We could make it more precise as Serge was suggesting with a set of that
> were dropped from setgroups, but under the assumption that negative
> groups are sufficient rare we can avoid that overhead.

I'm tempted to agree and say that it's safe to assume that they are used
very much. Negative acls have been brought up a couple of times in
related contexts though. One being a potential bug in newgidmap which we
discussed back in
https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
But I think if we have this under a sysctl as proposed earlier is good
enough.

> 
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
>  	unsigned int mode = inode->i_mode;
>  
> - [irrelevant bits of this function]        
>  
>  	/* Only RWX matters for group/other mode bits */
>  	mask &= 7;
>  
>  	/*
>  	 * Are the group permissions different from
>  	 * the other permissions in the bits we care
>  	 * about? Need to check group ownership if so.
>  	 */
>  	if (mask & (mode ^ (mode >> 3))) {
>  		if (in_group_p(inode->i_gid))
>  			mode >>= 3;
> +		/* Use the most restrictive permissions? */
> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
> +			mode &= (mode >> 3);
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
>  	return (mask & ~mode) ? -EACCES : 0;
>  }
> 
> As I read posix_acl_permission all of the posix acls for groups are
> positive permissions.  So I think the only other code that would need to
> be updated would be the filesystems that replace generic_permission with
> something that doesn't call acl_permission check.
> 
> Userspace could then activate this mode with:
> 	echo "safely_allow" > /proc/<pid>/setgroups
> 
> That looks very elegant and simple, and I don't think will cause
> problems for anyone.  It might even make sense to make that the default
> mode when creating a new user namespace.
> 
> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
> 
> Does anyone see any problems with tweaking the permissions this way so
> that we can always allow setgroups in a user namespace?

This looks sane and simple. I would still think that making it opt-in
for a few kernel releases might be preferable to just making it the new
default. We can then revisit flipping the default. Advanced enough
container runtimes will quickly pick up on this and can make it the
default for their unprivileged containers if they want to.

Christian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* The problem of setgroups and containers
  2020-10-18 10:20     ` Christian Brauner
@ 2020-10-18 13:05       ` Eric W. Biederman
  2020-10-19  0:15         ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-18 13:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

[ Added linux-api because we are talking about a subtle semantic
  change to the permission checks ]

Christian Brauner <christian.brauner@ubuntu.com> writes:

> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
>> 
>> > On 30.08.20 16:39, Christian Brauner wrote:
>> >> For mount points
>> >>    that originate from outside the namespace, everything will show as
>> >>    the overflow ids and access would be restricted to the most
>> >>    restricted permission bit for any path that can be accessed.
>> >
>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>> 
>> Interesting until reading through your commentary I had missed the
>> proposal to effectively effectively change the permissions to:
>> ((mode >> 3) & (mode >> 6) & mode & 7).
>> 
>> The challenge is that in a permission triple it is possible to set
>> lower permissions for the owner of the file, or for a specific group,
>> than for everyone else.
>> 
>> Today we require root permissions to be able to map users and groups in
>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>> permissions to be able to drop groups with setgroups.
>> 
>> Now we are discussiong moving to a world where we can use users and
>> groups that don't map to any other user namespace in uid_map and
>> gid_map.  It should be completely safe to use those users and groups
>> except for negative permissions in filesystems.  So a big question is
>> how do we arrange the system so anyone can use those files without
>> negative permission causing problems.
>> 
>> 
>> I believe it is safe to not limit the owner of a file, as the
>> owner of a file can always chmode the file and remove any restrictions.
>> Which is no worse than calling setuid to a different uid.
>> 
>> Which leaves where we have been dealing with the ability to drop groups
>> with setgroups.
>> 
>> I guess the practical proposal is when the !in_group_p and we are
>> looking at the other permission.  Treat the permissions as:
>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>> 
>> Which for systems who don't use negative group permissions is a no-op.
>> So this should not effect your btrfs snapshots at all (unless you use
>> negative group permissions).
>> 
>> It denies things before we get to an NFS server or other interesting
>> case so it should work for pretty much everything the kernel deals with.
>> 
>> Userspace repeating permission checks could break.  But that is just a
>> problem of inconsistency, and will always be a problem.
>> 
>> We could make it more precise as Serge was suggesting with a set of that
>> were dropped from setgroups, but under the assumption that negative
>> groups are sufficient rare we can avoid that overhead.
>
> I'm tempted to agree and say that it's safe to assume that they are used
> very much. Negative acls have been brought up a couple of times in
> related contexts though. One being a potential bug in newgidmap which we
> discussed back in
> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
> But I think if we have this under a sysctl as proposed earlier is good
> enough.
>
>> 
>>  static int acl_permission_check(struct inode *inode, int mask)
>>  {
>>  	unsigned int mode = inode->i_mode;
>>  
>> - [irrelevant bits of this function]        
>>  
>>  	/* Only RWX matters for group/other mode bits */
>>  	mask &= 7;
>>  
>>  	/*
>>  	 * Are the group permissions different from
>>  	 * the other permissions in the bits we care
>>  	 * about? Need to check group ownership if so.
>>  	 */
>>  	if (mask & (mode ^ (mode >> 3))) {
>>  		if (in_group_p(inode->i_gid))
>>  			mode >>= 3;
>> +		/* Use the most restrictive permissions? */
>> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>> +			mode &= (mode >> 3);
>>  	}
>>  
>>  	/* Bits in 'mode' clear that we require? */
>>  	return (mask & ~mode) ? -EACCES : 0;
>>  }
>> 
>> As I read posix_acl_permission all of the posix acls for groups are
>> positive permissions.  So I think the only other code that would need to
>> be updated would be the filesystems that replace generic_permission with
>> something that doesn't call acl_permission check.
>> 
>> Userspace could then activate this mode with:
>> 	echo "safely_allow" > /proc/<pid>/setgroups
>> 
>> That looks very elegant and simple, and I don't think will cause
>> problems for anyone.  It might even make sense to make that the default
>> mode when creating a new user namespace.
>> 
>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>> 
>> Does anyone see any problems with tweaking the permissions this way so
>> that we can always allow setgroups in a user namespace?
>
> This looks sane and simple. I would still think that making it opt-in
> for a few kernel releases might be preferable to just making it the new
> default. We can then revisit flipping the default. Advanced enough
> container runtimes will quickly pick up on this and can make it the
> default for their unprivileged containers if they want to.

I think we can even do a little bit better than what I proposed above.
The downside of my code is that negtative acls won't work in containers
even if they do today.  (Not that I think negative acls are something to
encourage just that breaking them means we have to deal with the
question: "Does someone care?").

What we can very safely do is limit negative acls to filesystems that
are mounted in the same user namespace.  Like the code below.

 static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
 
- [irrelevant bits of this function]        
 
 	/* Only RWX matters for group/other mode bits */
 	mask &= 7;
 
 	/*
 	 * Are the group permissions different from
 	 * the other permissions in the bits we care
 	 * about? Need to check group ownership if so.
 	 */
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/*
+		 * In a user namespace groups may have been dropped
+		 * so use the most restrictive permissions.
+		 */
+		else if (current->user_ns != inode->i_sb->user_ns)
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
 	return (mask & ~mode) ? -EACCES : 0;
 }

I would make the plan that we apply the fully fleshed out version of the
above (aka updating the permission methods that don't use
generic_permission), and then in a following kernel cycle we remove the
restrictions on setgroups because they are no longer needed.

The only possible user breaking issue I can see if a system with
negative acls where the containers rely on having access to the other
permissions for some reason.  If someone finds a system that does this
change would need to be reverted and another plan would need to be
found.  Otherwise I think/hope this is a safe semantic change.

Does anyone see any problems with my further simplification?

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: The problem of setgroups and containers
  2020-10-18 13:05       ` The problem of setgroups and containers Eric W. Biederman
@ 2020-10-19  0:15         ` Eric W. Biederman
  2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-19  0:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

ebiederm@xmission.com (Eric W. Biederman) writes:

> [ Added linux-api because we are talking about a subtle semantic
>   change to the permission checks ]
>
> Christian Brauner <christian.brauner@ubuntu.com> writes:
>
>> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>>> "Enrico Weigelt, metux IT consult" <lkml@metux.net> writes:
>>> 
>>> > On 30.08.20 16:39, Christian Brauner wrote:
>>> >> For mount points
>>> >>    that originate from outside the namespace, everything will show as
>>> >>    the overflow ids and access would be restricted to the most
>>> >>    restricted permission bit for any path that can be accessed.
>>> >
>>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>>> 
>>> Interesting until reading through your commentary I had missed the
>>> proposal to effectively effectively change the permissions to:
>>> ((mode >> 3) & (mode >> 6) & mode & 7).
>>> 
>>> The challenge is that in a permission triple it is possible to set
>>> lower permissions for the owner of the file, or for a specific group,
>>> than for everyone else.
>>> 
>>> Today we require root permissions to be able to map users and groups in
>>> /proc/<pid>/uid_map and /proc/<pid>/gid_map, and we require root
>>> permissions to be able to drop groups with setgroups.
>>> 
>>> Now we are discussiong moving to a world where we can use users and
>>> groups that don't map to any other user namespace in uid_map and
>>> gid_map.  It should be completely safe to use those users and groups
>>> except for negative permissions in filesystems.  So a big question is
>>> how do we arrange the system so anyone can use those files without
>>> negative permission causing problems.
>>> 
>>> 
>>> I believe it is safe to not limit the owner of a file, as the
>>> owner of a file can always chmode the file and remove any restrictions.
>>> Which is no worse than calling setuid to a different uid.
>>> 
>>> Which leaves where we have been dealing with the ability to drop groups
>>> with setgroups.
>>> 
>>> I guess the practical proposal is when the !in_group_p and we are
>>> looking at the other permission.  Treat the permissions as:
>>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>>> 
>>> Which for systems who don't use negative group permissions is a no-op.
>>> So this should not effect your btrfs snapshots at all (unless you use
>>> negative group permissions).
>>> 
>>> It denies things before we get to an NFS server or other interesting
>>> case so it should work for pretty much everything the kernel deals with.
>>> 
>>> Userspace repeating permission checks could break.  But that is just a
>>> problem of inconsistency, and will always be a problem.
>>> 
>>> We could make it more precise as Serge was suggesting with a set of that
>>> were dropped from setgroups, but under the assumption that negative
>>> groups are sufficient rare we can avoid that overhead.
>>
>> I'm tempted to agree and say that it's safe to assume that they are used
>> very much. Negative acls have been brought up a couple of times in
>> related contexts though. One being a potential bug in newgidmap which we
>> discussed back in
>> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
>> But I think if we have this under a sysctl as proposed earlier is good
>> enough.
>>
>>> 
>>>  static int acl_permission_check(struct inode *inode, int mask)
>>>  {
>>>  	unsigned int mode = inode->i_mode;
>>>  
>>> - [irrelevant bits of this function]        
>>>  
>>>  	/* Only RWX matters for group/other mode bits */
>>>  	mask &= 7;
>>>  
>>>  	/*
>>>  	 * Are the group permissions different from
>>>  	 * the other permissions in the bits we care
>>>  	 * about? Need to check group ownership if so.
>>>  	 */
>>>  	if (mask & (mode ^ (mode >> 3))) {
>>>  		if (in_group_p(inode->i_gid))
>>>  			mode >>= 3;
>>> +		/* Use the most restrictive permissions? */
>>> +		else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>>> +			mode &= (mode >> 3);
>>>  	}
>>>  
>>>  	/* Bits in 'mode' clear that we require? */
>>>  	return (mask & ~mode) ? -EACCES : 0;
>>>  }
>>> 
>>> As I read posix_acl_permission all of the posix acls for groups are
>>> positive permissions.  So I think the only other code that would need to
>>> be updated would be the filesystems that replace generic_permission with
>>> something that doesn't call acl_permission check.
>>> 
>>> Userspace could then activate this mode with:
>>> 	echo "safely_allow" > /proc/<pid>/setgroups
>>> 
>>> That looks very elegant and simple, and I don't think will cause
>>> problems for anyone.  It might even make sense to make that the default
>>> mode when creating a new user namespace.
>>> 
>>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>>> 
>>> Does anyone see any problems with tweaking the permissions this way so
>>> that we can always allow setgroups in a user namespace?
>>
>> This looks sane and simple. I would still think that making it opt-in
>> for a few kernel releases might be preferable to just making it the new
>> default. We can then revisit flipping the default. Advanced enough
>> container runtimes will quickly pick up on this and can make it the
>> default for their unprivileged containers if they want to.
>
> I think we can even do a little bit better than what I proposed above.
> The downside of my code is that negtative acls won't work in containers
> even if they do today.  (Not that I think negative acls are something to
> encourage just that breaking them means we have to deal with the
> question: "Does someone care?").
>
> What we can very safely do is limit negative acls to filesystems that
> are mounted in the same user namespace.  Like the code below.
>
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
>  	unsigned int mode = inode->i_mode;
>  
> - [irrelevant bits of this function]        
>  
>  	/* Only RWX matters for group/other mode bits */
>  	mask &= 7;
>  
>  	/*
>  	 * Are the group permissions different from
>  	 * the other permissions in the bits we care
>  	 * about? Need to check group ownership if so.
>  	 */
>  	if (mask & (mode ^ (mode >> 3))) {
>  		if (in_group_p(inode->i_gid))
>  			mode >>= 3;
> +		/*
> +		 * In a user namespace groups may have been dropped
> +		 * so use the most restrictive permissions.
> +		 */
> +		else if (current->user_ns != inode->i_sb->user_ns)
> +			mode &= (mode >> 3);
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
>  	return (mask & ~mode) ? -EACCES : 0;
>  }
>
> I would make the plan that we apply the fully fleshed out version of the
> above (aka updating the permission methods that don't use
> generic_permission), and then in a following kernel cycle we remove the
> restrictions on setgroups because they are no longer needed.
>
> The only possible user breaking issue I can see if a system with
> negative acls where the containers rely on having access to the other
> permissions for some reason.  If someone finds a system that does this
> change would need to be reverted and another plan would need to be
> found.  Otherwise I think/hope this is a safe semantic change.
>
> Does anyone see any problems with my further simplification?

Ugh.  I do see a problem.  Not with the approach so much but with my
argument that it is fine to ignore users.

I have just re-read through posix_acl_permission, and the logic is just
like acl_permission_check except that instead of having one user (the
owner of the file) and one group.  There can be the owner of the file
and other users (each with their distinct permissions) followed by the
one or more groups each with their distinct permissions followed by a
mask of maximum permissions followed by permissions for other users.

Which means that we need to take the owner of the user namespace into
account to preserve the invariant that we have no more permissions than
that owner had.

So I am thinking for the other permission check we need to limit the
permissions that are available to based on recursively the owner of the
user namespace and the owner's groups when the user namespace was
created.


Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-15 14:32           ` Serge E. Hallyn
@ 2020-10-19 12:12             ` Giuseppe Scrivano
  0 siblings, 0 replies; 22+ messages in thread
From: Giuseppe Scrivano @ 2020-10-19 12:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Josh Triplett, Christian Brauner, containers,
	Alexander Mihalicyn, Mrunal Patel, Wat Lim, Aleksa Sarai,
	Pavel Tikhomirov, Geoffrey Thomas, Eric W. Biederman,
	Joseph Christopher Sible, Mickaël Salaün, Vivek Goyal,
	Andy Lutomirski, Stephane Graber, Kees Cook, Sargun Dhillon,
	linux-kernel

"Serge E. Hallyn" <serge@hallyn.com> writes:

> On Tue, Oct 13, 2020 at 05:17:36PM +0200, Giuseppe Scrivano wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>> > On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
>> >> Josh Triplett <josh@joshtriplett.org> writes:
>> >> 
>> >> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> >> >    in mind the case of groups used for negative access control.
>> >> >> >    This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>> >> >> >    investigate adding a prctl() to allow setgroups() to be called in a user
>> >> >> >    namespace at the cost of restricting paths to the most restrictive
>> >> >> >    permission. So if something is 0707 it needs to be treated as if it's 0000
>> >> >> >    even though the caller is not in its owning group which is used for negative
>> >> >> >    access control (how these new semantics will interact with ACLs will also
>> >> >> >    need to be looked into).
>> >> >> 
>> >> >> I should probably think this through more, but for this problem, would it
>> >> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> >> struct group_info *locked_groups, and every time an unprivileged task creates
>> >> >> a new user namespace, add all its current groups to this list?
>> >> >
>> >> > So, effectively, you would be allowed to drop permissions, but
>> >> > locked_groups would still be checked for restrictions?
>> >> >
>> >> > That seems like it'd introduce a new level of complexity (a new facet of
>> >> > permission) to manage. Not opposed, but it does seem more complex than
>> >> > just opting out of using groups for negative permissions.
>> >> 
>> >> I have played with something similar in the past.  At that time I've
>> >> discussed it only privately with Eric and we agreed it wasn't worth the
>> >> extra complexity:
>> >> 
>> >> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
>> >
>> > Hi, you linked the setgroups patch, but do you also have a link to the
>> > attempt which you deemed was not worth it?
>> 
>> it was just part of a private discussion; but was 4 years ago so we can
>> probably revisit and accept the additional complexity since setgroups()
>> is still an issue with user namespaces.
>> 
>> 
>> >> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
>> >> allows setgroups in a userns locking the current gids.
>> >> 
>> >> What do you think about using /proc/PID/setgroups instead of a new
>> >> prctl()?
>> >
>> > It's better than not having it, but two concerns -
>> >
>> > 1. some userspace, especially testsuites, could become confused by the fact
>> > that they can't drop groups no matter how hard they try, since these will all
>> > still show up as regular groups.
>> 
>> I forgot to send a link to a second patch :-) that completes the feature:
>> https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2
>> 
>> When the new mode is used, the gids that are not known in the userns do
>> not show up in userspace.
>
> Ah, right - and of course those gids better not be mapped into the namespace :)
>
> But so, this is the patch you said you agreed was not worth the extra
> complexity?

yes, these two patches are what looked too complex at that time.  The
problem still exists though, we could perhaps reconsider if the
extra-complexity is acceptable to address it.

Regards,
Giuseppe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed
  2020-10-19  0:15         ` Eric W. Biederman
@ 2020-10-19 20:07           ` Eric W. Biederman
  2020-10-20 14:11             ` Christian Brauner
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2020-10-19 20:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api


Ordinary unix permissions and posix acls have the ability to
expression that processes show uid or gid match have fewer permissions
than processes without matches that use the other permissions.

The fact a root user in a user namespace can call setgroups and setuid
allows these restrictive permissions to be avoided.  To limit the problems
this can cause populationg the the set of uids that can be switched to,
and the set of gids that can be switched to is an operation that requires
priviliege outside of the user namespace.

This restriction is currently being reexamined as it appears that
there is a way to implement uids and gids that do not map outside of
the user namespace.  Such uids and gids would not require privilege
from outside of the usernamespace to use.  So it becomes important to
find a way to allow calling setuid and setgroups in a user namespace
without allowing processes in a user namespace to do more than the
creator of the user namespace.

To that end capture the groups set with setgroups of the creator of a
user_namespace.  Update the affected permission checks to notice when
something is being allowed with other permissions and only allow the
operation if the creator of the user namespace does not have a user or
group match that would disallow the operation.

The goal is to ensure that creating a user namespace and allowing
the user namespace root to setuid and setgroups does not result
in being permitted to do more than before the user namespace was
created, while still supporting explicitly specified users and
groups to have fewer permissions.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

So far only generic_permission is covered, but I think this demonstrates
that the goal is achievable.  Preserving negative acls while allowing
setuid and setgroups.

 fs/namei.c                     |  7 +++++
 fs/posix_acl.c                 | 51 ++++++++++++++++++++++++++++++++++
 include/linux/user_namespace.h | 16 +++++++++++
 kernel/user_namespace.c        | 29 +++++++++++++++++++
 4 files changed, 103 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7..ca06bd81d4e4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -322,6 +322,13 @@ static int acl_permission_check(struct inode *inode, int mask)
 	if (mask & (mode ^ (mode >> 3))) {
 		if (in_group_p(inode->i_gid))
 			mode >>= 3;
+		/*
+		 * Only allow the intersection of what the creator of
+		 * the user namespace is allowed and what everyone is
+		 * allowed.
+		 */
+		else if (userns_in_group_p(inode->i_sb->s_user_ns, inode->i_gid))
+			mode &= (mode >> 3);
 	}
 
 	/* Bits in 'mode' clear that we require? */
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 95882b3f5f62..525803f8f70c 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -340,6 +340,53 @@ posix_acl_from_mode(umode_t mode, gfp_t flags)
 }
 EXPORT_SYMBOL(posix_acl_from_mode);
 
+static bool userns_creator_allowed(struct inode *inode,
+				   const struct posix_acl *acl, int want)
+{
+	/* Don't allow access the creator of the user namespace does not have */
+	struct user_namespace *ns = inode->i_sb->s_user_ns;
+	const struct posix_acl_entry *pa, *pe;
+	unsigned short min_perm;
+	bool found = false;
+
+	min_perm = MAY_READ | MAY_WRITE | MAY_EXEC;
+	FOREACH_ACL_ENTRY(pa, acl, pe) {
+                switch(pa->e_tag) {
+                        case ACL_USER_OBJ:
+				/* No need to limit the owner of a file */
+                                break;
+                        case ACL_USER:
+				if (is_userns_creator(ns, pa->e_uid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+				break;
+                        case ACL_GROUP_OBJ:
+				if (userns_in_group_p(ns, inode->i_gid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+				break;
+                        case ACL_GROUP:
+				if (userns_in_group_p(ns, pa->e_gid)) {
+					found = true;
+					min_perm &= pa->e_perm;
+				}
+                                break;
+                        case ACL_MASK:
+				if (found)
+					min_perm &= pa->e_perm;
+                                break;
+                        case ACL_OTHER:
+				if (found &&
+				    ((pa->e_perm & want & min_perm) != want))
+					return false;
+				return true;
+                }
+        }
+	return false;
+}
+
 /*
  * Return 0 if current is granted want access to the inode
  * by the acl. Returns -E... otherwise.
@@ -382,6 +429,10 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                         case ACL_OTHER:
 				if (found)
 					return -EACCES;
+				else if ((current_user_ns() != inode->i_sb->s_user_ns) &&
+					 ((pa->e_perm & want) == want) &&
+					 userns_creator_allowed(inode, acl, want))
+					return 0;
 				else
 					goto check_perm;
 			default:
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..b4bcb49bed7a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -62,6 +62,7 @@ struct user_namespace {
 	int			level;
 	kuid_t			owner;
 	kgid_t			group;
+	struct group_info 	*groups;
 	struct ns_common	ns;
 	unsigned long		flags;
 
@@ -137,6 +138,10 @@ extern bool in_userns(const struct user_namespace *ancestor,
 		       const struct user_namespace *child);
 extern bool current_in_userns(const struct user_namespace *target_ns);
 struct ns_common *ns_get_owner(struct ns_common *ns);
+
+bool is_userns_creator(struct user_namespace *ns, kuid_t uid);
+bool userns_in_group_p(struct user_namespace *ns, kgid_t group);
+
 #else
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -181,6 +186,17 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
 {
 	return ERR_PTR(-EPERM);
 }
+
+static inline bool is_userns_creator(struct user_namespace *ns, kuid_t uid)
+{
+	return false;
+}
+
+static inline bool userns_in_group_p(struct user_namespace *ns, kgid_t group)
+{
+	return false;
+}
+
 #endif
 
 #endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 87804e0371fe..8a4949a32e36 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -117,6 +117,7 @@ int create_user_ns(struct cred *new)
 	ns->level = parent_ns->level + 1;
 	ns->owner = owner;
 	ns->group = group;
+	ns->groups = get_group_info(new->group_info);
 	INIT_WORK(&ns->work, free_user_ns);
 	for (i = 0; i < UCOUNT_COUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
@@ -143,6 +144,7 @@ int create_user_ns(struct cred *new)
 	key_put(ns->persistent_keyring_register);
 #endif
 	ns_free_inum(&ns->ns);
+	put_group_info(ns->groups);
 fail_free:
 	kmem_cache_free(user_ns_cachep, ns);
 fail_dec:
@@ -194,6 +196,7 @@ static void free_user_ns(struct work_struct *work)
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
+		put_group_info(ns->groups);
 		kmem_cache_free(user_ns_cachep, ns);
 		dec_user_namespaces(ucounts);
 		ns = parent;
@@ -1317,6 +1320,32 @@ const struct proc_ns_operations userns_operations = {
 	.get_parent	= ns_get_owner,
 };
 
+bool is_userns_creator(struct user_namespace *target_ns, kuid_t uid)
+{
+	struct user_namespace *user_ns = current_user_ns();
+
+	for (user_ns = current_user_ns();
+	     (user_ns != target_ns) && user_ns;
+	     user_ns = user_ns->parent) {
+		if (uid_eq(uid, user_ns->owner))
+			return true;
+	}
+	return false;
+}
+
+bool userns_in_group_p(struct user_namespace *target_ns, kgid_t group)
+{
+	struct user_namespace *user_ns;
+
+	for (user_ns = current_user_ns();
+	     (user_ns != target_ns) && user_ns;
+	     user_ns = user_ns->parent) {
+		if (groups_search(user_ns->groups, group))
+			return true;
+	}
+	return false;
+}
+
 static __init int user_namespaces_init(void)
 {
 	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
-- 
2.20.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed
  2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
@ 2020-10-20 14:11             ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2020-10-20 14:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Enrico Weigelt, metux IT consult, containers,
	Alexander Mihalicyn, Giuseppe Scrivano, Joseph Christopher Sible,
	Kees Cook, linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn, linux-api

On Mon, Oct 19, 2020 at 03:07:02PM -0500, Eric W. Biederman wrote:
> Ordinary unix permissions and posix acls have the ability to
> expression that processes show uid or gid match have fewer permissions
> than processes without matches that use the other permissions.

I'm stumbling a bit reading that sentence but that may just me parsing
it wrong:

"[...] have the ability to express that processes whose uid or gid match
nonetheless have fewer permissions than processes without matching uid
or gid that use other permissions."

is how I'm understanding this.

> 
> The fact a root user in a user namespace can call setgroups and setuid
> allows these restrictive permissions to be avoided.  To limit the problems
> this can cause populationg the the set of uids that can be switched to,
> and the set of gids that can be switched to is an operation that requires
> priviliege outside of the user namespace.
> 
> This restriction is currently being reexamined as it appears that
> there is a way to implement uids and gids that do not map outside of
> the user namespace.  Such uids and gids would not require privilege
> from outside of the usernamespace to use.  So it becomes important to
> find a way to allow calling setuid and setgroups in a user namespace
> without allowing processes in a user namespace to do more than the
> creator of the user namespace.
> 
> To that end capture the groups set with setgroups of the creator of a
> user_namespace.  Update the affected permission checks to notice when
> something is being allowed with other permissions and only allow the
> operation if the creator of the user namespace does not have a user or
> group match that would disallow the operation.
> 
> The goal is to ensure that creating a user namespace and allowing
> the user namespace root to setuid and setgroups does not result
> in being permitted to do more than before the user namespace was
> created, while still supporting explicitly specified users and
> groups to have fewer permissions.
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
> 
> So far only generic_permission is covered, but I think this demonstrates
> that the goal is achievable.  Preserving negative acls while allowing
> setuid and setgroups.

I think that looks good. I have run our test-suite with this patch
applied and it survived no problem so I don't see any regressions for
current use-cases so far.

> 
>  fs/namei.c                     |  7 +++++
>  fs/posix_acl.c                 | 51 ++++++++++++++++++++++++++++++++++
>  include/linux/user_namespace.h | 16 +++++++++++
>  kernel/user_namespace.c        | 29 +++++++++++++++++++
>  4 files changed, 103 insertions(+)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index e99e2a9da0f7..ca06bd81d4e4 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -322,6 +322,13 @@ static int acl_permission_check(struct inode *inode, int mask)
>  	if (mask & (mode ^ (mode >> 3))) {
>  		if (in_group_p(inode->i_gid))
>  			mode >>= 3;
> +		/*
> +		 * Only allow the intersection of what the creator of
> +		 * the user namespace is allowed and what everyone is
> +		 * allowed.
> +		 */
> +		else if (userns_in_group_p(inode->i_sb->s_user_ns, inode->i_gid))
> +			mode &= (mode >> 3);
>  	}
>  
>  	/* Bits in 'mode' clear that we require? */
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 95882b3f5f62..525803f8f70c 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -340,6 +340,53 @@ posix_acl_from_mode(umode_t mode, gfp_t flags)
>  }
>  EXPORT_SYMBOL(posix_acl_from_mode);
>  
> +static bool userns_creator_allowed(struct inode *inode,
> +				   const struct posix_acl *acl, int want)
> +{
> +	/* Don't allow access the creator of the user namespace does not have */
> +	struct user_namespace *ns = inode->i_sb->s_user_ns;
> +	const struct posix_acl_entry *pa, *pe;
> +	unsigned short min_perm;
> +	bool found = false;
> +
> +	min_perm = MAY_READ | MAY_WRITE | MAY_EXEC;
> +	FOREACH_ACL_ENTRY(pa, acl, pe) {
> +                switch(pa->e_tag) {
> +                        case ACL_USER_OBJ:
> +				/* No need to limit the owner of a file */
> +                                break;
> +                        case ACL_USER:
> +				if (is_userns_creator(ns, pa->e_uid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +				break;
> +                        case ACL_GROUP_OBJ:
> +				if (userns_in_group_p(ns, inode->i_gid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +				break;
> +                        case ACL_GROUP:
> +				if (userns_in_group_p(ns, pa->e_gid)) {
> +					found = true;
> +					min_perm &= pa->e_perm;
> +				}
> +                                break;
> +                        case ACL_MASK:
> +				if (found)
> +					min_perm &= pa->e_perm;
> +                                break;
> +                        case ACL_OTHER:
> +				if (found &&
> +				    ((pa->e_perm & want & min_perm) != want))
> +					return false;
> +				return true;
> +                }
> +        }
> +	return false;
> +}
> +
>  /*
>   * Return 0 if current is granted want access to the inode
>   * by the acl. Returns -E... otherwise.
> @@ -382,6 +429,10 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
>                          case ACL_OTHER:
>  				if (found)
>  					return -EACCES;
> +				else if ((current_user_ns() != inode->i_sb->s_user_ns) &&
> +					 ((pa->e_perm & want) == want) &&
> +					 userns_creator_allowed(inode, acl, want))
> +					return 0;
>  				else
>  					goto check_perm;
>  			default:
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 6ef1c7109fc4..b4bcb49bed7a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -62,6 +62,7 @@ struct user_namespace {
>  	int			level;
>  	kuid_t			owner;
>  	kgid_t			group;
> +	struct group_info 	*groups;
>  	struct ns_common	ns;
>  	unsigned long		flags;
>  
> @@ -137,6 +138,10 @@ extern bool in_userns(const struct user_namespace *ancestor,
>  		       const struct user_namespace *child);
>  extern bool current_in_userns(const struct user_namespace *target_ns);
>  struct ns_common *ns_get_owner(struct ns_common *ns);
> +
> +bool is_userns_creator(struct user_namespace *ns, kuid_t uid);
> +bool userns_in_group_p(struct user_namespace *ns, kgid_t group);
> +
>  #else
>  
>  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -181,6 +186,17 @@ static inline struct ns_common *ns_get_owner(struct ns_common *ns)
>  {
>  	return ERR_PTR(-EPERM);
>  }
> +
> +static inline bool is_userns_creator(struct user_namespace *ns, kuid_t uid)
> +{
> +	return false;
> +}
> +
> +static inline bool userns_in_group_p(struct user_namespace *ns, kgid_t group)
> +{
> +	return false;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 87804e0371fe..8a4949a32e36 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -117,6 +117,7 @@ int create_user_ns(struct cred *new)
>  	ns->level = parent_ns->level + 1;
>  	ns->owner = owner;
>  	ns->group = group;
> +	ns->groups = get_group_info(new->group_info);
>  	INIT_WORK(&ns->work, free_user_ns);
>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>  		ns->ucount_max[i] = INT_MAX;
> @@ -143,6 +144,7 @@ int create_user_ns(struct cred *new)
>  	key_put(ns->persistent_keyring_register);
>  #endif
>  	ns_free_inum(&ns->ns);
> +	put_group_info(ns->groups);
>  fail_free:
>  	kmem_cache_free(user_ns_cachep, ns);
>  fail_dec:
> @@ -194,6 +196,7 @@ static void free_user_ns(struct work_struct *work)
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
> +		put_group_info(ns->groups);
>  		kmem_cache_free(user_ns_cachep, ns);
>  		dec_user_namespaces(ucounts);
>  		ns = parent;
> @@ -1317,6 +1320,32 @@ const struct proc_ns_operations userns_operations = {
>  	.get_parent	= ns_get_owner,
>  };
>  
> +bool is_userns_creator(struct user_namespace *target_ns, kuid_t uid)
> +{
> +	struct user_namespace *user_ns = current_user_ns();
> +
> +	for (user_ns = current_user_ns();
> +	     (user_ns != target_ns) && user_ns;
> +	     user_ns = user_ns->parent) {
> +		if (uid_eq(uid, user_ns->owner))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +bool userns_in_group_p(struct user_namespace *target_ns, kgid_t group)
> +{
> +	struct user_namespace *user_ns;
> +
> +	for (user_ns = current_user_ns();
> +	     (user_ns != target_ns) && user_ns;
> +	     user_ns = user_ns->parent) {
> +		if (groups_search(user_ns->groups, group))
> +			return true;
> +	}
> +	return false;
> +}
> +
>  static __init int user_namespaces_init(void)
>  {
>  	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces
  2020-10-17 16:51   ` Eric W. Biederman
  2020-10-18 10:20     ` Christian Brauner
@ 2020-10-29 13:42     ` Enrico Weigelt, metux IT consult
  1 sibling, 0 replies; 22+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2020-10-29 13:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, containers, Alexander Mihalicyn,
	Giuseppe Scrivano, Joseph Christopher Sible, Kees Cook,
	linux-kernel, Josh Triplett, Andy Lutomirski,
	Mickaël Salaün, Wat Lim, Mrunal Patel,
	Pavel Tikhomirov, Geoffrey Thomas, Serge E. Hallyn

On 17.10.20 18:51, Eric W. Biederman wrote:

Hi folks,

>> I believe subusers aren't meant for tyical containers (like docker or
>> lxc), but unprivileged user programs that wanna have further isolation
>> for subprocesses (eg. a browser's renderer or js engine).
>>
>> Correct me if I'm wrong.
> 
> There is an on-going trend to make unprivileged containers typical
> containers.

Yes, that's what I hope for :)
But I'm still unsure whether these files really fit into the scenarios
we're currently discussing.

What still puzzles me: we've got several quite different scenarios
related to uid allocation and mapping. Maybe we should first work out,
what they all have in common ?

Some quick examples:

a) arbitrary user wants to run certain programs (eg. daemons) with
   limited privileges (eg. can access only certain resources, eg.
   subdir of is homedir), possibly under some different UID, but still
   have full control over them (signals, strace, ...) - without any
   special help by root.

b) arbitrary user wants to run some programs with different mounts
   (plan9 style) w/o any special help by root. (unprivileged mount_ns
   still needs user_ns, right ?)

c) arbitrary user wants to run some (docker-style) containerized
   GUI application, which needs access to certain files in his homedir,
   just if it would run directly

d) classical container workload (really being root inside it) with
   shared images and possibly shared directories w/ the calling user.

Steps to care of are eg:

* allocate new user-visible UIDs (usually w/ names assigned), either
  permanently or temporarily
* sane mapping between several namespaces (which ones exactly shall
  appear from inside vs outside ?)
* map file system permissions and fs uids

Tricky. How can we decide which mappings an unprivileged user shall be
allowed to do under which circumstances ?

Scanario a) container is running with (parts of) the host fs
            --> we need to make sure it cannot escape and access some
                sensible files
            --> different fs-UID mappings per fs ?
         b) container is running with its own fs image
            --> the image could be entirely under the unprivileged
                user's control (maybe created by him itself)
            --> uids recorded in the fs probably should be exactly those
                visible inside the container

Maybe we should put in a separate UID/permission translation layer
into VFS, which would process different policies (not just plain range
shifting, more possibly more complex translations) depending on the
namespace ?

> I forget the details but systemd has a feature where it will randomly
> allocate a uid for a service.  Calling them something like temporariy uids.

I'd consider this an horrible bug - especially from operating
perspective. As operator, I really want to know what users (uids) I've
got on the system. and what's running under them.
(I never user systemd, for tons of other reasons, anyways)

>> IMHO, all we need is to maintain a list of active ranges (more precisely
>> the 16bit prefixes, just like class B networks ;-)). As said, I'd
>> declare the scenario #P3 as invalid and rather fix those few broken
>> applications.
> 
> Which is /etc/subuid and /etc/subgid, and it was very much inspired from
> the same source.

Why not just moving this into some common daemon or access pattern ?
(outside the kernel)

>> Is this really an practical isssue, when we're using uid namespaces ?
> 
> Very much so.  There are containers who otherwise would use the same uid
> range. (AKA they have the same set of users).  But can't because there
> are cases like daemons that set their RLIMIT_NPROC to 1.  Because the
> daemon knows that user for that daemon will never run any other
> processes.

Just curious: why are these containers (smells like typical server
workloads) running with the same UIDs in the first place ?

Maybe because the lack of proper mapping of fs-uids ? (see above).
Or are there any reasons why they should run oder the same uid.

>>> S2. Kernel-enforced user namespace isolation.
>>>     This means, there is no need for different container runtimes to
>>>     collaborate on id ranges with immediate benefits for everyone.
>>>     This solves P1 and P2.
>>
>> Okay, but how to support scenarios where some of the UIDs should
>> overlap on purpose ? (eg. mounting some of the host's user homedirs
>> into namespaces ?)
> 
> Just have a limited number of mappings for the cases that actually need
> on-disk storage.  The key idea is adding uids that don't need to be
> mapped.  Everything else stays the same.

Okay, but the interesting question becomes: what does not to be mapped,
what not ? How exactly shall find that out in a generic manner ?

I guess your proposal only helps for those UIDs which are really random
allocated - or anything outside the explicitly given ranges, which
(IMHO) now is mapped to -1. Correct ?

Just a weird though: shall we introduce an 'mapping policy' object,
with callbacks for doing the actual translation ? The default one would
be the the current implementation, but we could add others as well.

>> What exactly is the owning id ? How is it created and managed ?
>> Some magic id or an cryptographic token =
> 
> Not a new thing.  Just the user that created the user namespace.
> It is suggested to refine the idea so that users that don't map
> anywhere show up as the creator of the user namespace.

So, an random UID outside the mapped ranges would map to the 'root'
inside the namespace and seen as the user that created the ns outside
it ?

> I think there were more concerns raised that I think actually exist.
> The owner/creator of a user namespace can already manage an container
> and send it signals.  That is built into the capability system call.
> Nothing needs to change there.

Can he do that with all users inside the namespace ?

> The only real question I see is which uids and gids do we show to
> processes that are outside of the user namespace, when the uids and gids
> don't map.

At least they should not overlap with other users within the parent ns,
and they should be unique. If they weren't IDs but actual names (like in
Plan9), we could just add the owning user name as prefix.

>>>    The first consensus reached seemed to be to decouple isolated user
>>>    namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse
>>>    at the beginning as filesystems which can be mounted inside isolated
>>>    user namespaces and so would have proper ownership. 
>>
>> So, I'd essentially have to run the whole rootfs through fuse and a
>> userland fileserver, which probably has to track things like ownerships
>> in its own db (when running under unprivileged user) ?
> 
> The consensus was to start with what is working now.
> 
> Users that don't map outside of the user namespace will show up and work
> properly in on tmpfs.  Or a fuse implementation of ext4 on top of a
> file.

At that point, we could run everything via 9P ... ;-)


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-30 14:39 LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Christian Brauner
2020-10-10  4:26 ` Serge E. Hallyn
2020-10-11 20:53   ` Josh Triplett
2020-10-12  0:38     ` Andy Lutomirski
2020-10-12  5:01       ` Eric W. Biederman
2020-10-12 15:00         ` Serge E. Hallyn
2020-10-14 19:46           ` Eric W. Biederman
2020-10-15 14:27             ` Serge E. Hallyn
2020-10-17 15:04               ` Eric W. Biederman
2020-10-12 17:05     ` Giuseppe Scrivano
2020-10-13 12:46       ` Serge E. Hallyn
2020-10-13 15:17         ` Giuseppe Scrivano
2020-10-15 14:32           ` Serge E. Hallyn
2020-10-19 12:12             ` Giuseppe Scrivano
2020-10-15 15:31 ` Enrico Weigelt, metux IT consult
2020-10-17 16:51   ` Eric W. Biederman
2020-10-18 10:20     ` Christian Brauner
2020-10-18 13:05       ` The problem of setgroups and containers Eric W. Biederman
2020-10-19  0:15         ` Eric W. Biederman
2020-10-19 20:07           ` [RFC][PATCH] userns: Limit process in a user namespace to what the creator is allowed Eric W. Biederman
2020-10-20 14:11             ` Christian Brauner
2020-10-29 13:42     ` LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces Enrico Weigelt, metux IT consult

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git