All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Per-user namespace process accounting
@ 2014-05-29  6:37 ` Marian Marinov
  0 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-05-29  6:37 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	LXC development mailing-list, Linux Containers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman-/eSpBmjxGS4dnm+yROfE0A@public.gmane.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T
JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3
=8M+P
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC] Per-user namespace process accounting
@ 2014-05-29  6:37 ` Marian Marinov
  0 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-05-29  6:37 UTC (permalink / raw)
  To: linux-kernel, Eric W. Biederman, LXC development mailing-list,
	Linux Containers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T
JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3
=8M+P
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-05-29  6:37 ` Marian Marinov
@ 2014-05-29 10:06     ` Eric W. Biederman
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-05-29 10:06 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list

Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:

> Hello,
>
> I have the following proposition.
>
> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> containers in different user namespaces share the process counters.

That is deliberate.

> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> processes with ist own UID 99.
>
> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> this brings another problem.
>
> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these
> causes a lot of I/O and also slows down provisioning considerably.
>
> The other problem is that when we migrate one container from one host machine to another the IDs may be already in use
> on the new machine and we need to chown all the files again.

You should have the same uid allocations for all machines in your fleet
as much as possible.   That has been true ever since NFS was invented
and is not new here.  You can avoid the cost of chowning if you untar
your files inside of your user namespace.  You can have different maps
per machine if you are crazy enough to do that.  You can even have
shared uids that you use to share files between containers as long as
none of those files is setuid.  And map those shared files to some kind
of nobody user in your user namespace.

> Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already
> in use.
>
> So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the
> first task creating the user_ns and free it in exit_creds().

I do not like the idea of having user_structs be per user namespace, and
deliberately made the code not work that way.

> Can you please comment on that?

I have been pondering having some recursive resources limits that are
per user namespace and if all you are worried about are process counts
that might work.  I don't honestly know what makes sense at the moment.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-05-29 10:06     ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-05-29 10:06 UTC (permalink / raw)
  To: Marian Marinov
  Cc: linux-kernel, LXC development mailing-list, Linux Containers

Marian Marinov <mm@1h.com> writes:

> Hello,
>
> I have the following proposition.
>
> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> containers in different user namespaces share the process counters.

That is deliberate.

> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> processes with ist own UID 99.
>
> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> this brings another problem.
>
> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these
> causes a lot of I/O and also slows down provisioning considerably.
>
> The other problem is that when we migrate one container from one host machine to another the IDs may be already in use
> on the new machine and we need to chown all the files again.

You should have the same uid allocations for all machines in your fleet
as much as possible.   That has been true ever since NFS was invented
and is not new here.  You can avoid the cost of chowning if you untar
your files inside of your user namespace.  You can have different maps
per machine if you are crazy enough to do that.  You can even have
shared uids that you use to share files between containers as long as
none of those files is setuid.  And map those shared files to some kind
of nobody user in your user namespace.

> Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already
> in use.
>
> So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the
> first task creating the user_ns and free it in exit_creds().

I do not like the idea of having user_structs be per user namespace, and
deliberately made the code not work that way.

> Can you please comment on that?

I have been pondering having some recursive resources limits that are
per user namespace and if all you are worried about are process counts
that might work.  I don't honestly know what makes sense at the moment.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-05-29 10:06     ` Eric W. Biederman
@ 2014-05-29 10:40         ` Marian Marinov
  -1 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-05-29 10:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
> 
>> Hello,
>> 
>> I have the following proposition.
>> 
>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>> multiple containers in different user namespaces share the process counters.
> 
> That is deliberate.

And I understand that very well ;)

> 
>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>> processes with ist own UID 99.
>> 
>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>> but this brings another problem.
>> 
>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>> these causes a lot of I/O and also slows down provisioning considerably.
>> 
>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>> in use on the new machine and we need to chown all the files again.
> 
> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> those shared files to some kind of nobody user in your user namespace.

We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
do not believe we should go backwards.

We do not share filesystems between containers, we offer them block devices.

> 
>> Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be
>> already in use.
>> 
>> So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for
>> the first task creating the user_ns and free it in exit_creds().
> 
> I do not like the idea of having user_structs be per user namespace, and deliberately made the code not work that
> way.
> 
>> Can you please comment on that?
> 
> I have been pondering having some recursive resources limits that are per user namespace and if all you are worried
> about are process counts that might work.  I don't honestly know what makes sense at the moment.

It seams to me that the only limit(from RLIMIT) that are generally a problem for the namespaces is number of processes
and pending signals.
This is why I proposed the above modification. However I'm not sure if the places I have chosen are right and also I'm
not really convinced that having per-namespace user_struct is the right approach for the process counter.

> 
> Eric
> 
Marian

- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman-/eSpBmjxGS4dnm+yROfE0A@public.gmane.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI
ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB
=/Ii4
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-05-29 10:40         ` Marian Marinov
  0 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-05-29 10:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, LXC development mailing-list, Linux Containers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> Marian Marinov <mm@1h.com> writes:
> 
>> Hello,
>> 
>> I have the following proposition.
>> 
>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>> multiple containers in different user namespaces share the process counters.
> 
> That is deliberate.

And I understand that very well ;)

> 
>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>> processes with ist own UID 99.
>> 
>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>> but this brings another problem.
>> 
>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>> these causes a lot of I/O and also slows down provisioning considerably.
>> 
>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>> in use on the new machine and we need to chown all the files again.
> 
> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> those shared files to some kind of nobody user in your user namespace.

We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
do not believe we should go backwards.

We do not share filesystems between containers, we offer them block devices.

> 
>> Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be
>> already in use.
>> 
>> So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for
>> the first task creating the user_ns and free it in exit_creds().
> 
> I do not like the idea of having user_structs be per user namespace, and deliberately made the code not work that
> way.
> 
>> Can you please comment on that?
> 
> I have been pondering having some recursive resources limits that are per user namespace and if all you are worried
> about are process counts that might work.  I don't honestly know what makes sense at the moment.

It seams to me that the only limit(from RLIMIT) that are generally a problem for the namespaces is number of processes
and pending signals.
This is why I proposed the above modification. However I'm not sure if the places I have chosen are right and also I'm
not really convinced that having per-namespace user_struct is the right approach for the process counter.

> 
> Eric
> 
Marian

- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI
ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB
=/Ii4
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-05-29 10:40         ` Marian Marinov
@ 2014-05-29 15:32             ` Serge Hallyn
  -1 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-05-29 15:32 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Linux Containers, Eric W. Biederman,
	LXC development mailing-list,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> > Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
> > 
> >> Hello,
> >> 
> >> I have the following proposition.
> >> 
> >> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >> multiple containers in different user namespaces share the process counters.
> > 
> > That is deliberate.
> 
> And I understand that very well ;)
> 
> > 
> >> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >> processes with ist own UID 99.
> >> 
> >> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >> but this brings another problem.
> >> 
> >> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >> these causes a lot of I/O and also slows down provisioning considerably.
> >> 
> >> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >> in use on the new machine and we need to chown all the files again.
> > 
> > You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> > ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> > of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> > have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> > those shared files to some kind of nobody user in your user namespace.
> 
> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> do not believe we should go backwards.
> 
> We do not share filesystems between containers, we offer them block devices.

Yes, this is a real nuisance for openstack style deployments.

One nice solution to this imo would be a very thin stackable filesystem
which does uid shifting, or, better yet, a non-stackable way of shifting
uids at mount.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-05-29 15:32             ` Serge Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-05-29 15:32 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Eric W. Biederman, Linux Containers, linux-kernel,
	LXC development mailing-list

Quoting Marian Marinov (mm@1h.com):
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> > Marian Marinov <mm@1h.com> writes:
> > 
> >> Hello,
> >> 
> >> I have the following proposition.
> >> 
> >> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >> multiple containers in different user namespaces share the process counters.
> > 
> > That is deliberate.
> 
> And I understand that very well ;)
> 
> > 
> >> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >> processes with ist own UID 99.
> >> 
> >> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >> but this brings another problem.
> >> 
> >> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >> these causes a lot of I/O and also slows down provisioning considerably.
> >> 
> >> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >> in use on the new machine and we need to chown all the files again.
> > 
> > You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> > ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> > of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> > have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> > those shared files to some kind of nobody user in your user namespace.
> 
> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> do not believe we should go backwards.
> 
> We do not share filesystems between containers, we offer them block devices.

Yes, this is a real nuisance for openstack style deployments.

One nice solution to this imo would be a very thin stackable filesystem
which does uid shifting, or, better yet, a non-stackable way of shifting
uids at mount.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-05-29 15:32             ` Serge Hallyn
@ 2014-06-03 17:01               ` Pavel Emelyanov
  -1 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2014-06-03 17:01 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	LXC development mailing-list, Eric W. Biederman

On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
>>>
>>>> Hello,
>>>>
>>>> I have the following proposition.
>>>>
>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>> multiple containers in different user namespaces share the process counters.
>>>
>>> That is deliberate.
>>
>> And I understand that very well ;)
>>
>>>
>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>> processes with ist own UID 99.
>>>>
>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>> but this brings another problem.
>>>>
>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>
>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>> in use on the new machine and we need to chown all the files again.
>>>
>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>> those shared files to some kind of nobody user in your user namespace.
>>
>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>> do not believe we should go backwards.
>>
>> We do not share filesystems between containers, we offer them block devices.
> 
> Yes, this is a real nuisance for openstack style deployments.
> 
> One nice solution to this imo would be a very thin stackable filesystem
> which does uid shifting, or, better yet, a non-stackable way of shifting
> uids at mount.

I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
don't bother with it. From what I've seen, even simple stacking is quite a challenge.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 17:01               ` Pavel Emelyanov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2014-06-03 17:01 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Marian Marinov, Linux Containers, Eric W. Biederman,
	LXC development mailing-list, linux-kernel

On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> Quoting Marian Marinov (mm@1h.com):
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>> Marian Marinov <mm@1h.com> writes:
>>>
>>>> Hello,
>>>>
>>>> I have the following proposition.
>>>>
>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>> multiple containers in different user namespaces share the process counters.
>>>
>>> That is deliberate.
>>
>> And I understand that very well ;)
>>
>>>
>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>> processes with ist own UID 99.
>>>>
>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>> but this brings another problem.
>>>>
>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>
>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>> in use on the new machine and we need to chown all the files again.
>>>
>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>> those shared files to some kind of nobody user in your user namespace.
>>
>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>> do not believe we should go backwards.
>>
>> We do not share filesystems between containers, we offer them block devices.
> 
> Yes, this is a real nuisance for openstack style deployments.
> 
> One nice solution to this imo would be a very thin stackable filesystem
> which does uid shifting, or, better yet, a non-stackable way of shifting
> uids at mount.

I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
don't bother with it. From what I've seen, even simple stacking is quite a challenge.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:01               ` Pavel Emelyanov
@ 2014-06-03 17:26                   ` Serge Hallyn
  -1 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-03 17:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	LXC development mailing-list, Eric W. Biederman

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> > Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> >>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
> >>>
> >>>> Hello,
> >>>>
> >>>> I have the following proposition.
> >>>>
> >>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >>>> multiple containers in different user namespaces share the process counters.
> >>>
> >>> That is deliberate.
> >>
> >> And I understand that very well ;)
> >>
> >>>
> >>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >>>> processes with ist own UID 99.
> >>>>
> >>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >>>> but this brings another problem.
> >>>>
> >>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >>>> these causes a lot of I/O and also slows down provisioning considerably.
> >>>>
> >>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >>>> in use on the new machine and we need to chown all the files again.
> >>>
> >>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> >>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> >>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> >>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> >>> those shared files to some kind of nobody user in your user namespace.
> >>
> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >> do not believe we should go backwards.
> >>
> >> We do not share filesystems between containers, we offer them block devices.
> > 
> > Yes, this is a real nuisance for openstack style deployments.
> > 
> > One nice solution to this imo would be a very thin stackable filesystem
> > which does uid shifting, or, better yet, a non-stackable way of shifting
> > uids at mount.
> 
> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> don't bother with it. From what I've seen, even simple stacking is quite a challenge.

Do you have any ideas for how to go about it?  It seems like we'd have
to have separate inodes per mapping for each file, which is why of
course stacking seems "natural" here.

Trying to catch the uid/gid at every kernel-userspace crossing seems
like a design regression from the current userns approach.  I suppose we
could continue in the kuid theme and introduce a iiud/igid for the
in-kernel inode uid/gid owners.  Then allow a user privileged in some
ns to create a new mount associated with a different mapping for any
ids over which he is privileged.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 17:26                   ` Serge Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-03 17:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Marian Marinov, Linux Containers, Eric W. Biederman,
	LXC development mailing-list, linux-kernel

Quoting Pavel Emelyanov (xemul@parallels.com):
> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> > Quoting Marian Marinov (mm@1h.com):
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> >>> Marian Marinov <mm@1h.com> writes:
> >>>
> >>>> Hello,
> >>>>
> >>>> I have the following proposition.
> >>>>
> >>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >>>> multiple containers in different user namespaces share the process counters.
> >>>
> >>> That is deliberate.
> >>
> >> And I understand that very well ;)
> >>
> >>>
> >>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >>>> processes with ist own UID 99.
> >>>>
> >>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >>>> but this brings another problem.
> >>>>
> >>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >>>> these causes a lot of I/O and also slows down provisioning considerably.
> >>>>
> >>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >>>> in use on the new machine and we need to chown all the files again.
> >>>
> >>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> >>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> >>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> >>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> >>> those shared files to some kind of nobody user in your user namespace.
> >>
> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >> do not believe we should go backwards.
> >>
> >> We do not share filesystems between containers, we offer them block devices.
> > 
> > Yes, this is a real nuisance for openstack style deployments.
> > 
> > One nice solution to this imo would be a very thin stackable filesystem
> > which does uid shifting, or, better yet, a non-stackable way of shifting
> > uids at mount.
> 
> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> don't bother with it. From what I've seen, even simple stacking is quite a challenge.

Do you have any ideas for how to go about it?  It seems like we'd have
to have separate inodes per mapping for each file, which is why of
course stacking seems "natural" here.

Trying to catch the uid/gid at every kernel-userspace crossing seems
like a design regression from the current userns approach.  I suppose we
could continue in the kuid theme and introduce a iiud/igid for the
in-kernel inode uid/gid owners.  Then allow a user privileged in some
ns to create a new mount associated with a different mapping for any
ids over which he is privileged.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:26                   ` Serge Hallyn
@ 2014-06-03 17:39                     ` Pavel Emelyanov
  -1 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2014-06-03 17:39 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	LXC development mailing-list, Eric W. Biederman

On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have the following proposition.
>>>>>>
>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>
>>>>> That is deliberate.
>>>>
>>>> And I understand that very well ;)
>>>>
>>>>>
>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>> processes with ist own UID 99.
>>>>>>
>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>> but this brings another problem.
>>>>>>
>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>
>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>
>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>
>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>> do not believe we should go backwards.
>>>>
>>>> We do not share filesystems between containers, we offer them block devices.
>>>
>>> Yes, this is a real nuisance for openstack style deployments.
>>>
>>> One nice solution to this imo would be a very thin stackable filesystem
>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>> uids at mount.
>>
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> 
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.

I was thinking about "lightweight mapping" which is simple shifting. Since
we're trying to make this co-work with user-ns mappings, simple uid/gid shift
should be enough. Please, correct me if I'm wrong.

If I'm not, then it looks to be enough to have two per-sb or per-mnt values
for uid and gid shift. Per-mnt for now looks more promising, since container's
FS may be just a bind-mount from shared disk.

> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

User-space crossing? From my point of view it would be enough if we just turn
uid/gid read from disk (well, from whenever FS gets them) into uids, that would
match the user-ns's ones, this sould cover the VFS layer and related syscalls
only, which is, IIRC stat-s family and chown.

Ouch, and the whole quota engine :\

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 17:39                     ` Pavel Emelyanov
  0 siblings, 0 replies; 32+ messages in thread
From: Pavel Emelyanov @ 2014-06-03 17:39 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Marian Marinov, Linux Containers, Eric W. Biederman,
	LXC development mailing-list, linux-kernel

On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> Quoting Pavel Emelyanov (xemul@parallels.com):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>> Quoting Marian Marinov (mm@1h.com):
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>> Marian Marinov <mm@1h.com> writes:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have the following proposition.
>>>>>>
>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>
>>>>> That is deliberate.
>>>>
>>>> And I understand that very well ;)
>>>>
>>>>>
>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>> processes with ist own UID 99.
>>>>>>
>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>> but this brings another problem.
>>>>>>
>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>
>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>
>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>
>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>> do not believe we should go backwards.
>>>>
>>>> We do not share filesystems between containers, we offer them block devices.
>>>
>>> Yes, this is a real nuisance for openstack style deployments.
>>>
>>> One nice solution to this imo would be a very thin stackable filesystem
>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>> uids at mount.
>>
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> 
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.

I was thinking about "lightweight mapping" which is simple shifting. Since
we're trying to make this co-work with user-ns mappings, simple uid/gid shift
should be enough. Please, correct me if I'm wrong.

If I'm not, then it looks to be enough to have two per-sb or per-mnt values
for uid and gid shift. Per-mnt for now looks more promising, since container's
FS may be just a bind-mount from shared disk.

> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

User-space crossing? From my point of view it would be enough if we just turn
uid/gid read from disk (well, from whenever FS gets them) into uids, that would
match the user-ns's ones, this sould cover the VFS layer and related syscalls
only, which is, IIRC stat-s family and chown.

Ouch, and the whole quota engine :\

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:39                     ` Pavel Emelyanov
@ 2014-06-03 17:47                         ` Serge Hallyn
  -1 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-03 17:47 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list, Eric W. Biederman

Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA1
> >>>>
> >>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> >>>>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have the following proposition.
> >>>>>>
> >>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >>>>>> multiple containers in different user namespaces share the process counters.
> >>>>>
> >>>>> That is deliberate.
> >>>>
> >>>> And I understand that very well ;)
> >>>>
> >>>>>
> >>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >>>>>> processes with ist own UID 99.
> >>>>>>
> >>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >>>>>> but this brings another problem.
> >>>>>>
> >>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >>>>>> these causes a lot of I/O and also slows down provisioning considerably.
> >>>>>>
> >>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >>>>>> in use on the new machine and we need to chown all the files again.
> >>>>>
> >>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> >>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> >>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> >>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> >>>>> those shared files to some kind of nobody user in your user namespace.
> >>>>
> >>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >>>> do not believe we should go backwards.
> >>>>
> >>>> We do not share filesystems between containers, we offer them block devices.
> >>>
> >>> Yes, this is a real nuisance for openstack style deployments.
> >>>
> >>> One nice solution to this imo would be a very thin stackable filesystem
> >>> which does uid shifting, or, better yet, a non-stackable way of shifting
> >>> uids at mount.
> >>
> >> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> >> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> > 
> > Do you have any ideas for how to go about it?  It seems like we'd have
> > to have separate inodes per mapping for each file, which is why of
> > course stacking seems "natural" here.
> 
> I was thinking about "lightweight mapping" which is simple shifting. Since
> we're trying to make this co-work with user-ns mappings, simple uid/gid shift
> should be enough. Please, correct me if I'm wrong.
> 
> If I'm not, then it looks to be enough to have two per-sb or per-mnt values
> for uid and gid shift. Per-mnt for now looks more promising, since container's
> FS may be just a bind-mount from shared disk.

per-sb would work.  per-mnt would as you say be nicer, but I don't see how it
can be done since parts of the vfs which get inodes but no mnt information
would not be able to figure out the shifts.

> > Trying to catch the uid/gid at every kernel-userspace crossing seems
> > like a design regression from the current userns approach.  I suppose we
> > could continue in the kuid theme and introduce a iiud/igid for the
> > in-kernel inode uid/gid owners.  Then allow a user privileged in some
> > ns to create a new mount associated with a different mapping for any
> > ids over which he is privileged.
> 
> User-space crossing? From my point of view it would be enough if we just turn
> uid/gid read from disk (well, from whenever FS gets them) into uids, that would
> match the user-ns's ones, this sould cover the VFS layer and related syscalls
> only, which is, IIRC stat-s family and chown.
> 
> Ouch, and the whole quota engine :\
> 
> Thanks,
> Pavel
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 17:47                         ` Serge Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-03 17:47 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-kernel, Linux Containers, LXC development mailing-list,
	Eric W. Biederman

Quoting Pavel Emelyanov (xemul@parallels.com):
> On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> > Quoting Pavel Emelyanov (xemul@parallels.com):
> >> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >>> Quoting Marian Marinov (mm@1h.com):
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA1
> >>>>
> >>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> >>>>> Marian Marinov <mm@1h.com> writes:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have the following proposition.
> >>>>>>
> >>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
> >>>>>> multiple containers in different user namespaces share the process counters.
> >>>>>
> >>>>> That is deliberate.
> >>>>
> >>>> And I understand that very well ;)
> >>>>
> >>>>>
> >>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
> >>>>>> processes with ist own UID 99.
> >>>>>>
> >>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
> >>>>>> but this brings another problem.
> >>>>>>
> >>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
> >>>>>> these causes a lot of I/O and also slows down provisioning considerably.
> >>>>>>
> >>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
> >>>>>> in use on the new machine and we need to chown all the files again.
> >>>>>
> >>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
> >>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
> >>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
> >>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
> >>>>> those shared files to some kind of nobody user in your user namespace.
> >>>>
> >>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >>>> do not believe we should go backwards.
> >>>>
> >>>> We do not share filesystems between containers, we offer them block devices.
> >>>
> >>> Yes, this is a real nuisance for openstack style deployments.
> >>>
> >>> One nice solution to this imo would be a very thin stackable filesystem
> >>> which does uid shifting, or, better yet, a non-stackable way of shifting
> >>> uids at mount.
> >>
> >> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> >> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> > 
> > Do you have any ideas for how to go about it?  It seems like we'd have
> > to have separate inodes per mapping for each file, which is why of
> > course stacking seems "natural" here.
> 
> I was thinking about "lightweight mapping" which is simple shifting. Since
> we're trying to make this co-work with user-ns mappings, simple uid/gid shift
> should be enough. Please, correct me if I'm wrong.
> 
> If I'm not, then it looks to be enough to have two per-sb or per-mnt values
> for uid and gid shift. Per-mnt for now looks more promising, since container's
> FS may be just a bind-mount from shared disk.

per-sb would work.  per-mnt would as you say be nicer, but I don't see how it
can be done since parts of the vfs which get inodes but no mnt information
would not be able to figure out the shifts.

> > Trying to catch the uid/gid at every kernel-userspace crossing seems
> > like a design regression from the current userns approach.  I suppose we
> > could continue in the kuid theme and introduce a iiud/igid for the
> > in-kernel inode uid/gid owners.  Then allow a user privileged in some
> > ns to create a new mount associated with a different mapping for any
> > ids over which he is privileged.
> 
> User-space crossing? From my point of view it would be enough if we just turn
> uid/gid read from disk (well, from whenever FS gets them) into uids, that would
> match the user-ns's ones, this sould cover the VFS layer and related syscalls
> only, which is, IIRC stat-s family and chown.
> 
> Ouch, and the whole quota engine :\
> 
> Thanks,
> Pavel
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:26                   ` Serge Hallyn
@ 2014-06-03 17:54                     ` Eric W. Biederman
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-03 17:54 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	LXC development mailing-list

Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>> > Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>> >> do not believe we should go backwards.
>> >>
>> >> We do not share filesystems between containers, we offer them block devices.
>> > 
>> > Yes, this is a real nuisance for openstack style deployments.
>> > 
>> > One nice solution to this imo would be a very thin stackable filesystem
>> > which does uid shifting, or, better yet, a non-stackable way of shifting
>> > uids at mount.
>> 
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
>
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.
>
> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

There is a simple solution.

We pick the filesystems we choose to support.
We add privileged mounting in a user namespace.
We create the user and mount namespace.
Global root goes into the target mount namespace with setns and performs
the mounts.

90% of that work is already done.

As long as we don't plan to support XFS (as it XFS likes to expose it's
implementation details to userspace) it should be quite straight
forward.

The permission check change would probably only need to be:


@@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
                return -ENODEV;
 
        if (user_ns != &init_user_ns) {
+               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN)) {
+                       put_filesystem(type);
+                       return -EPERM;
+               }
                if (!(type->fs_flags & FS_USERNS_MOUNT)) {
                        put_filesystem(type);
                        return -EPERM;


There are also a few funnies with capturing the user namespace of the
filesystem when we perform the mount (in the superblock?), and not
allowing a mount of that same filesystem in a different user namespace.

But as long as the kuid conversions don't measurably slow down the
filesystem when mounted in the initial mount and user namespaces I don't
see how this would be a problem for anyone, and is very little code.


Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 17:54                     ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-03 17:54 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Pavel Emelyanov, Marian Marinov, Linux Containers,
	LXC development mailing-list, linux-kernel

Serge Hallyn <serge.hallyn@ubuntu.com> writes:

> Quoting Pavel Emelyanov (xemul@parallels.com):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>> > Quoting Marian Marinov (mm@1h.com):
>> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>> >> do not believe we should go backwards.
>> >>
>> >> We do not share filesystems between containers, we offer them block devices.
>> > 
>> > Yes, this is a real nuisance for openstack style deployments.
>> > 
>> > One nice solution to this imo would be a very thin stackable filesystem
>> > which does uid shifting, or, better yet, a non-stackable way of shifting
>> > uids at mount.
>> 
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
>
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.
>
> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

There is a simple solution.

We pick the filesystems we choose to support.
We add privileged mounting in a user namespace.
We create the user and mount namespace.
Global root goes into the target mount namespace with setns and performs
the mounts.

90% of that work is already done.

As long as we don't plan to support XFS (as it XFS likes to expose it's
implementation details to userspace) it should be quite straight
forward.

The permission check change would probably only need to be:


@@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
                return -ENODEV;
 
        if (user_ns != &init_user_ns) {
+               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN)) {
+                       put_filesystem(type);
+                       return -EPERM;
+               }
                if (!(type->fs_flags & FS_USERNS_MOUNT)) {
                        put_filesystem(type);
                        return -EPERM;


There are also a few funnies with capturing the user namespace of the
filesystem when we perform the mount (in the superblock?), and not
allowing a mount of that same filesystem in a different user namespace.

But as long as the kuid conversions don't measurably slow down the
filesystem when mounted in the initial mount and user namespaces I don't
see how this would be a problem for anyone, and is very little code.


Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:39                     ` Pavel Emelyanov
@ 2014-06-03 18:18                         ` Eric W. Biederman
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-03 18:18 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Containers, Serge Hallyn, LXC development mailing-list,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> On 06/03/2014 09:26 PM, Serge Hallyn wrote:
>> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have the following proposition.
>>>>>>>
>>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>>
>>>>>> That is deliberate.
>>>>>
>>>>> And I understand that very well ;)
>>>>>
>>>>>>
>>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>>> processes with ist own UID 99.
>>>>>>>
>>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>>> but this brings another problem.
>>>>>>>
>>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>>
>>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>>
>>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>>
>>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>>> do not believe we should go backwards.
>>>>>
>>>>> We do not share filesystems between containers, we offer them block devices.
>>>>
>>>> Yes, this is a real nuisance for openstack style deployments.
>>>>
>>>> One nice solution to this imo would be a very thin stackable filesystem
>>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>>> uids at mount.
>>>
>>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
>> 
>> Do you have any ideas for how to go about it?  It seems like we'd have
>> to have separate inodes per mapping for each file, which is why of
>> course stacking seems "natural" here.
>
> I was thinking about "lightweight mapping" which is simple shifting. Since
> we're trying to make this co-work with user-ns mappings, simple uid/gid shift
> should be enough. Please, correct me if I'm wrong.
>
> If I'm not, then it looks to be enough to have two per-sb or per-mnt values
> for uid and gid shift. Per-mnt for now looks more promising, since container's
> FS may be just a bind-mount from shared disk.
>
>> Trying to catch the uid/gid at every kernel-userspace crossing seems
>> like a design regression from the current userns approach.  I suppose we
>> could continue in the kuid theme and introduce a iiud/igid for the
>> in-kernel inode uid/gid owners.  Then allow a user privileged in some
>> ns to create a new mount associated with a different mapping for any
>> ids over which he is privileged.
>
> User-space crossing? From my point of view it would be enough if we just turn
> uid/gid read from disk (well, from whenever FS gets them) into uids, that would
> match the user-ns's ones, this sould cover the VFS layer and related syscalls
> only, which is, IIRC stat-s family and chown.
>
> Ouch, and the whole quota engine :\

And posix acls.

But all of this is 90% done already.  I think today we just have
conversions to the initial user namespace. We just need a few tweaks to
allow it and a per superblock user namespace setting.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 18:18                         ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-03 18:18 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Serge Hallyn, Marian Marinov, Linux Containers,
	LXC development mailing-list, linux-kernel

Pavel Emelyanov <xemul@parallels.com> writes:

> On 06/03/2014 09:26 PM, Serge Hallyn wrote:
>> Quoting Pavel Emelyanov (xemul@parallels.com):
>>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov (mm@1h.com):
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>>> Marian Marinov <mm@1h.com> writes:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have the following proposition.
>>>>>>>
>>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>>
>>>>>> That is deliberate.
>>>>>
>>>>> And I understand that very well ;)
>>>>>
>>>>>>
>>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>>> processes with ist own UID 99.
>>>>>>>
>>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>>> but this brings another problem.
>>>>>>>
>>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>>
>>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>>
>>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>>
>>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>>> do not believe we should go backwards.
>>>>>
>>>>> We do not share filesystems between containers, we offer them block devices.
>>>>
>>>> Yes, this is a real nuisance for openstack style deployments.
>>>>
>>>> One nice solution to this imo would be a very thin stackable filesystem
>>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>>> uids at mount.
>>>
>>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
>> 
>> Do you have any ideas for how to go about it?  It seems like we'd have
>> to have separate inodes per mapping for each file, which is why of
>> course stacking seems "natural" here.
>
> I was thinking about "lightweight mapping" which is simple shifting. Since
> we're trying to make this co-work with user-ns mappings, simple uid/gid shift
> should be enough. Please, correct me if I'm wrong.
>
> If I'm not, then it looks to be enough to have two per-sb or per-mnt values
> for uid and gid shift. Per-mnt for now looks more promising, since container's
> FS may be just a bind-mount from shared disk.
>
>> Trying to catch the uid/gid at every kernel-userspace crossing seems
>> like a design regression from the current userns approach.  I suppose we
>> could continue in the kuid theme and introduce a iiud/igid for the
>> in-kernel inode uid/gid owners.  Then allow a user privileged in some
>> ns to create a new mount associated with a different mapping for any
>> ids over which he is privileged.
>
> User-space crossing? From my point of view it would be enough if we just turn
> uid/gid read from disk (well, from whenever FS gets them) into uids, that would
> match the user-ns's ones, this sould cover the VFS layer and related syscalls
> only, which is, IIRC stat-s family and chown.
>
> Ouch, and the whole quota engine :\

And posix acls.

But all of this is 90% done already.  I think today we just have
conversions to the initial user namespace. We just need a few tweaks to
allow it and a per superblock user namespace setting.

Eric


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:54                     ` Eric W. Biederman
@ 2014-06-03 21:39                         ` Marian Marinov
  -1 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-06-03 21:39 UTC (permalink / raw)
  To: Eric W. Biederman, Serge Hallyn
  Cc: Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> 
>> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new
>>>>> containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes
>>>>> it can be done and no, I do not believe we should go backwards.
>>>>> 
>>>>> We do not share filesystems between containers, we offer them block devices.
>>>> 
>>>> Yes, this is a real nuisance for openstack style deployments.
>>>> 
>>>> One nice solution to this imo would be a very thin stackable filesystem which does uid shifting, or, better
>>>> yet, a non-stackable way of shifting uids at mount.
>>> 
>>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems don't bother with it. From
>>> what I've seen, even simple stacking is quite a challenge.
>> 
>> Do you have any ideas for how to go about it?  It seems like we'd have to have separate inodes per mapping for
>> each file, which is why of course stacking seems "natural" here.
>> 
>> Trying to catch the uid/gid at every kernel-userspace crossing seems like a design regression from the current
>> userns approach.  I suppose we could continue in the kuid theme and introduce a iiud/igid for the in-kernel inode
>> uid/gid owners.  Then allow a user privileged in some ns to create a new mount associated with a different
>> mapping for any ids over which he is privileged.
> 
> There is a simple solution.
> 
> We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user
> and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts.
> 
> 90% of that work is already done.
> 
> As long as we don't plan to support XFS (as it XFS likes to expose it's implementation details to userspace) it
> should be quite straight forward.
> 
> The permission check change would probably only need to be:
> 
> 
> @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, return -ENODEV;
> 
> if (user_ns != &init_user_ns) { +               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN))
> { +                       put_filesystem(type); +                       return -EPERM; +               } if
> (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
> 
> 
> There are also a few funnies with capturing the user namespace of the filesystem when we perform the mount (in the
> superblock?), and not allowing a mount of that same filesystem in a different user namespace.
> 
> But as long as the kuid conversions don't measurably slow down the filesystem when mounted in the initial mount and
> user namespaces I don't see how this would be a problem for anyone, and is very little code.
> 

This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user
namespaces.
In our cases, this means breaking container migration mechanisms.

Will this at all be addressed or I'm the only one here that has this sort of requirement?

Marian


> 
> Eric
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman-/eSpBmjxGS4dnm+yROfE0A@public.gmane.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin
HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P
=B9xA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-03 21:39                         ` Marian Marinov
  0 siblings, 0 replies; 32+ messages in thread
From: Marian Marinov @ 2014-06-03 21:39 UTC (permalink / raw)
  To: Eric W. Biederman, Serge Hallyn
  Cc: Pavel Emelyanov, Linux Containers, LXC development mailing-list,
	linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> 
>> Quoting Pavel Emelyanov (xemul@parallels.com):
>>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov (mm@1h.com):
>>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new
>>>>> containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes
>>>>> it can be done and no, I do not believe we should go backwards.
>>>>> 
>>>>> We do not share filesystems between containers, we offer them block devices.
>>>> 
>>>> Yes, this is a real nuisance for openstack style deployments.
>>>> 
>>>> One nice solution to this imo would be a very thin stackable filesystem which does uid shifting, or, better
>>>> yet, a non-stackable way of shifting uids at mount.
>>> 
>>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems don't bother with it. From
>>> what I've seen, even simple stacking is quite a challenge.
>> 
>> Do you have any ideas for how to go about it?  It seems like we'd have to have separate inodes per mapping for
>> each file, which is why of course stacking seems "natural" here.
>> 
>> Trying to catch the uid/gid at every kernel-userspace crossing seems like a design regression from the current
>> userns approach.  I suppose we could continue in the kuid theme and introduce a iiud/igid for the in-kernel inode
>> uid/gid owners.  Then allow a user privileged in some ns to create a new mount associated with a different
>> mapping for any ids over which he is privileged.
> 
> There is a simple solution.
> 
> We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user
> and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts.
> 
> 90% of that work is already done.
> 
> As long as we don't plan to support XFS (as it XFS likes to expose it's implementation details to userspace) it
> should be quite straight forward.
> 
> The permission check change would probably only need to be:
> 
> 
> @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, return -ENODEV;
> 
> if (user_ns != &init_user_ns) { +               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN))
> { +                       put_filesystem(type); +                       return -EPERM; +               } if
> (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
> 
> 
> There are also a few funnies with capturing the user namespace of the filesystem when we perform the mount (in the
> superblock?), and not allowing a mount of that same filesystem in a different user namespace.
> 
> But as long as the kuid conversions don't measurably slow down the filesystem when mounted in the initial mount and
> user namespaces I don't see how this would be a problem for anyone, and is very little code.
> 

This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user
namespaces.
In our cases, this means breaking container migration mechanisms.

Will this at all be addressed or I'm the only one here that has this sort of requirement?

Marian


> 
> Eric
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hackman@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin
HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P
=B9xA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 17:54                     ` Eric W. Biederman
@ 2014-06-07 21:39                         ` James Bottomley
  -1 siblings, 0 replies; 32+ messages in thread
From: James Bottomley @ 2014-06-07 21:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list

On Tue, 2014-06-03 at 10:54 -0700, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >> > Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> >> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >> >> do not believe we should go backwards.
> >> >>
> >> >> We do not share filesystems between containers, we offer them block devices.
> >> > 
> >> > Yes, this is a real nuisance for openstack style deployments.
> >> > 
> >> > One nice solution to this imo would be a very thin stackable filesystem
> >> > which does uid shifting, or, better yet, a non-stackable way of shifting
> >> > uids at mount.
> >> 
> >> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> >> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> >
> > Do you have any ideas for how to go about it?  It seems like we'd have
> > to have separate inodes per mapping for each file, which is why of
> > course stacking seems "natural" here.
> >
> > Trying to catch the uid/gid at every kernel-userspace crossing seems
> > like a design regression from the current userns approach.  I suppose we
> > could continue in the kuid theme and introduce a iiud/igid for the
> > in-kernel inode uid/gid owners.  Then allow a user privileged in some
> > ns to create a new mount associated with a different mapping for any
> > ids over which he is privileged.
> 
> There is a simple solution.
> 
> We pick the filesystems we choose to support.
> We add privileged mounting in a user namespace.
> We create the user and mount namespace.
> Global root goes into the target mount namespace with setns and performs
> the mounts.
> 
> 90% of that work is already done.
> 
> As long as we don't plan to support XFS (as it XFS likes to expose it's
> implementation details to userspace) it should be quite straight
> forward.

Any implementation which doesn't support XFS is unviable from a distro
point of view.  The whole reason we're fighting to get USER_NS enabled
in distros goes back to lack of XFS support (they basically refused to
turn it on until it wasn't a choice between XFS and USER_NS).  If we put
them in a position where they choose a namespace feature or XFS, they'll
choose XFS.

XFS developers aren't unreasonable ... they'll help if we ask.  I mean
it was them who eventually helped us get USER_NS turned on in the first
place.

James

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-07 21:39                         ` James Bottomley
  0 siblings, 0 replies; 32+ messages in thread
From: James Bottomley @ 2014-06-07 21:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge Hallyn, linux-kernel, Linux Containers,
	LXC development mailing-list

On Tue, 2014-06-03 at 10:54 -0700, Eric W. Biederman wrote:
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> 
> > Quoting Pavel Emelyanov (xemul@parallels.com):
> >> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >> > Quoting Marian Marinov (mm@1h.com):
> >> >> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
> >> >> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
> >> >> do not believe we should go backwards.
> >> >>
> >> >> We do not share filesystems between containers, we offer them block devices.
> >> > 
> >> > Yes, this is a real nuisance for openstack style deployments.
> >> > 
> >> > One nice solution to this imo would be a very thin stackable filesystem
> >> > which does uid shifting, or, better yet, a non-stackable way of shifting
> >> > uids at mount.
> >> 
> >> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
> >> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> >
> > Do you have any ideas for how to go about it?  It seems like we'd have
> > to have separate inodes per mapping for each file, which is why of
> > course stacking seems "natural" here.
> >
> > Trying to catch the uid/gid at every kernel-userspace crossing seems
> > like a design regression from the current userns approach.  I suppose we
> > could continue in the kuid theme and introduce a iiud/igid for the
> > in-kernel inode uid/gid owners.  Then allow a user privileged in some
> > ns to create a new mount associated with a different mapping for any
> > ids over which he is privileged.
> 
> There is a simple solution.
> 
> We pick the filesystems we choose to support.
> We add privileged mounting in a user namespace.
> We create the user and mount namespace.
> Global root goes into the target mount namespace with setns and performs
> the mounts.
> 
> 90% of that work is already done.
> 
> As long as we don't plan to support XFS (as it XFS likes to expose it's
> implementation details to userspace) it should be quite straight
> forward.

Any implementation which doesn't support XFS is unviable from a distro
point of view.  The whole reason we're fighting to get USER_NS enabled
in distros goes back to lack of XFS support (they basically refused to
turn it on until it wasn't a choice between XFS and USER_NS).  If we put
them in a position where they choose a namespace feature or XFS, they'll
choose XFS.

XFS developers aren't unreasonable ... they'll help if we ask.  I mean
it was them who eventually helped us get USER_NS turned on in the first
place.

James



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-07 21:39                         ` James Bottomley
@ 2014-06-08  3:25                             ` Eric W. Biederman
  -1 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-08  3:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	LXC development mailing-list

James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> writes:

> On Tue, 2014-06-03 at 10:54 -0700, Eric W. Biederman wrote:
>> 
>> 90% of that work is already done.
>> 
>> As long as we don't plan to support XFS (as it XFS likes to expose it's
>> implementation details to userspace) it should be quite straight
>> forward.
>
> Any implementation which doesn't support XFS is unviable from a distro
> point of view.  The whole reason we're fighting to get USER_NS enabled
> in distros goes back to lack of XFS support (they basically refused to
> turn it on until it wasn't a choice between XFS and USER_NS).  If we put
> them in a position where they choose a namespace feature or XFS, they'll
> choose XFS.

This isn't the same dicotomy.  This is a simple case of not being able
to use XFS mounted inside of a user namespace.  Which does not cause any
regression from the current use cases.  The previous case was that XFS
would not build at all.

There were valid technical reasons but part of the reason the XFS
conversion took so long was my social engineering the distro's to not
enable the latest bling until there was a chance for the initial crop of
bugs to be fixed.

> XFS developers aren't unreasonable ... they'll help if we ask.  I mean
> it was them who eventually helped us get USER_NS turned on in the first
> place.

Fair enough.  But XFS is not the place to start.

For most filesystems the only really hard part is finding the handful of
places where we actually need some form of error handling when on disk
uids don't map to in core kuids.  Which ultimately should wind up with
maybe a 20 line patch for most filesystems.

For XFS there are two large obstacles to overcome. 

- XFS journal replay does not work when the XFS filesystem is moved from
  a host with one combination of wordsize and endianness to a host with
  a different combination of wordsize and edianness.  This makes XFS a
  bad choice of a filesystem to move between hosts in a sparse file.
  Every other filesystem in the kernel handles this better.

- The XFS code base has a large the largest number of any ioctls of any
  filesystem in the linux kernel.  This increases the amount of code
  that has to be converted.  Combine that with the fact that the XFS
  developers chose to convert from kuids and kgids at the VFS<->FS layer
  instead of at the FS<->disk layer it becomes quite easy to miss
  changing code in an ioctl or a quota check by accident.  Which all
  adds up to the fact that converting XFS to be mountable with a non 1-1
  mapping of filesystem uids and system kuids is going to be a lot more
  than a simple 20 line patch.

All of that said what becomes attractive about this approach is that it
gets us to the point where people can ask questions about mounting
normal filesystems unprivileged and the entire reason it won't be
allowed are (no block devices to mount from) and concern that the
filesystem error handling code is not sufficient to ward off evil users
that create bad filesystem images.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-08  3:25                             ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2014-06-08  3:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Serge Hallyn, linux-kernel, Linux Containers,
	LXC development mailing-list

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> On Tue, 2014-06-03 at 10:54 -0700, Eric W. Biederman wrote:
>> 
>> 90% of that work is already done.
>> 
>> As long as we don't plan to support XFS (as it XFS likes to expose it's
>> implementation details to userspace) it should be quite straight
>> forward.
>
> Any implementation which doesn't support XFS is unviable from a distro
> point of view.  The whole reason we're fighting to get USER_NS enabled
> in distros goes back to lack of XFS support (they basically refused to
> turn it on until it wasn't a choice between XFS and USER_NS).  If we put
> them in a position where they choose a namespace feature or XFS, they'll
> choose XFS.

This isn't the same dicotomy.  This is a simple case of not being able
to use XFS mounted inside of a user namespace.  Which does not cause any
regression from the current use cases.  The previous case was that XFS
would not build at all.

There were valid technical reasons but part of the reason the XFS
conversion took so long was my social engineering the distro's to not
enable the latest bling until there was a chance for the initial crop of
bugs to be fixed.

> XFS developers aren't unreasonable ... they'll help if we ask.  I mean
> it was them who eventually helped us get USER_NS turned on in the first
> place.

Fair enough.  But XFS is not the place to start.

For most filesystems the only really hard part is finding the handful of
places where we actually need some form of error handling when on disk
uids don't map to in core kuids.  Which ultimately should wind up with
maybe a 20 line patch for most filesystems.

For XFS there are two large obstacles to overcome. 

- XFS journal replay does not work when the XFS filesystem is moved from
  a host with one combination of wordsize and endianness to a host with
  a different combination of wordsize and edianness.  This makes XFS a
  bad choice of a filesystem to move between hosts in a sparse file.
  Every other filesystem in the kernel handles this better.

- The XFS code base has a large the largest number of any ioctls of any
  filesystem in the linux kernel.  This increases the amount of code
  that has to be converted.  Combine that with the fact that the XFS
  developers chose to convert from kuids and kgids at the VFS<->FS layer
  instead of at the FS<->disk layer it becomes quite easy to miss
  changing code in an ioctl or a quota check by accident.  Which all
  adds up to the fact that converting XFS to be mountable with a non 1-1
  mapping of filesystem uids and system kuids is going to be a lot more
  than a simple 20 line patch.

All of that said what becomes attractive about this approach is that it
gets us to the point where people can ask questions about mounting
normal filesystems unprivileged and the entire reason it won't be
allowed are (no block devices to mount from) and concern that the
filesystem error handling code is not sufficient to ward off evil users
that create bad filesystem images.

Eric



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-05-29  6:37 ` Marian Marinov
@ 2014-06-12 14:37     ` Alin Dobre
  -1 siblings, 0 replies; 32+ messages in thread
From: Alin Dobre @ 2014-06-12 14:37 UTC (permalink / raw)
  Cc: containers-qjLDD68F18O7TbgM5vRIOg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I

On 29/05/14 07:37, Marian Marinov wrote:
> Hello,
> 
> I have the following proposition.
> 
> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> containers in different user namespaces share the process counters.
> 
> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> processes with ist own UID 99.
> 
> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> this brings another problem.

If this matters, we also suffer from the same problem here. So we
support any implementation that would address it.

Cheers,
Alin.

_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-12 14:37     ` Alin Dobre
  0 siblings, 0 replies; 32+ messages in thread
From: Alin Dobre @ 2014-06-12 14:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: containers, lxc-devel

On 29/05/14 07:37, Marian Marinov wrote:
> Hello,
> 
> I have the following proposition.
> 
> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> containers in different user namespaces share the process counters.
> 
> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> processes with ist own UID 99.
> 
> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> this brings another problem.

If this matters, we also suffer from the same problem here. So we
support any implementation that would address it.

Cheers,
Alin.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-12 14:37     ` Alin Dobre
@ 2014-06-12 15:08         ` Serge Hallyn
  -1 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-12 15:08 UTC (permalink / raw)
  To: LXC development mailing-list
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Alin Dobre (alin.dobre@elastichosts.com):
> On 29/05/14 07:37, Marian Marinov wrote:
> > Hello,
> > 
> > I have the following proposition.
> > 
> > Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> > containers in different user namespaces share the process counters.

Most ppl here probably are aware of this, but the previous, never-completed
user namespace implementation provided this and only this.  We (mostly Eric
and I) spent years looking for clean ways to make that implementation, which
had some advantages (including this one), complete.  We did have a few POCs
which worked but were unsatisfying.  The two things which were never convincing
were (a) conversion of all uid checks to be namespace-safe, and (b) storing
namespace identifiers on disk.  (As I say we did have solutions to these, but
not satisfying ones).  These are the two things which the new implementation
address *beautifully*.

> > So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> > processes with ist own UID 99.
> > 
> > I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> > this brings another problem.
> 
> If this matters, we also suffer from the same problem here. So we
> support any implementation that would address it.

ISTM the only reasonable answer here (at least for now) is to make it more
convenient to isolate uid ranges, by providing a way to shift uids at mount
time as has been discussed a bit.

If we go down the route of talking about uid 99 in ns 1 vs uid 99 in ns 2,
then people will also expect isolation at file access time, and we're back
to all the disadvantages of the first userns implementation.

(If someone proves me wrong by suggesting a clean solution, then awesome)

-serge
_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [lxc-devel] [RFC] Per-user namespace process accounting
@ 2014-06-12 15:08         ` Serge Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2014-06-12 15:08 UTC (permalink / raw)
  To: LXC development mailing-list; +Cc: containers, linux-kernel

Quoting Alin Dobre (alin.dobre@elastichosts.com):
> On 29/05/14 07:37, Marian Marinov wrote:
> > Hello,
> > 
> > I have the following proposition.
> > 
> > Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple
> > containers in different user namespaces share the process counters.

Most ppl here probably are aware of this, but the previous, never-completed
user namespace implementation provided this and only this.  We (mostly Eric
and I) spent years looking for clean ways to make that implementation, which
had some advantages (including this one), complete.  We did have a few POCs
which worked but were unsatisfying.  The two things which were never convincing
were (a) conversion of all uid checks to be namespace-safe, and (b) storing
namespace identifiers on disk.  (As I say we did have solutions to these, but
not satisfying ones).  These are the two things which the new implementation
address *beautifully*.

> > So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any
> > processes with ist own UID 99.
> > 
> > I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but
> > this brings another problem.
> 
> If this matters, we also suffer from the same problem here. So we
> support any implementation that would address it.

ISTM the only reasonable answer here (at least for now) is to make it more
convenient to isolate uid ranges, by providing a way to shift uids at mount
time as has been discussed a bit.

If we go down the route of talking about uid 99 in ns 1 vs uid 99 in ns 2,
then people will also expect isolation at file access time, and we're back
to all the disadvantages of the first userns implementation.

(If someone proves me wrong by suggesting a clean solution, then awesome)

-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
  2014-06-03 21:39                         ` Marian Marinov
@ 2014-06-23  4:07                             ` Serge E. Hallyn
  -1 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2014-06-23  4:07 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Linux Containers, Serge Hallyn, Eric W. Biederman,
	LXC development mailing-list,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
> > Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> > 
> >> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
> >>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new
> >>>>> containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes
> >>>>> it can be done and no, I do not believe we should go backwards.
> >>>>> 
> >>>>> We do not share filesystems between containers, we offer them block devices.
> >>>> 
> >>>> Yes, this is a real nuisance for openstack style deployments.
> >>>> 
> >>>> One nice solution to this imo would be a very thin stackable filesystem which does uid shifting, or, better
> >>>> yet, a non-stackable way of shifting uids at mount.
> >>> 
> >>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems don't bother with it. From
> >>> what I've seen, even simple stacking is quite a challenge.
> >> 
> >> Do you have any ideas for how to go about it?  It seems like we'd have to have separate inodes per mapping for
> >> each file, which is why of course stacking seems "natural" here.
> >> 
> >> Trying to catch the uid/gid at every kernel-userspace crossing seems like a design regression from the current
> >> userns approach.  I suppose we could continue in the kuid theme and introduce a iiud/igid for the in-kernel inode
> >> uid/gid owners.  Then allow a user privileged in some ns to create a new mount associated with a different
> >> mapping for any ids over which he is privileged.
> > 
> > There is a simple solution.
> > 
> > We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user
> > and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts.
> > 
> > 90% of that work is already done.
> > 
> > As long as we don't plan to support XFS (as it XFS likes to expose it's implementation details to userspace) it
> > should be quite straight forward.
> > 
> > The permission check change would probably only need to be:
> > 
> > 
> > @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, return -ENODEV;
> > 
> > if (user_ns != &init_user_ns) { +               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN))
> > { +                       put_filesystem(type); +                       return -EPERM; +               } if
> > (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
> > 
> > 
> > There are also a few funnies with capturing the user namespace of the filesystem when we perform the mount (in the
> > superblock?), and not allowing a mount of that same filesystem in a different user namespace.
> > 
> > But as long as the kuid conversions don't measurably slow down the filesystem when mounted in the initial mount and
> > user namespaces I don't see how this would be a problem for anyone, and is very little code.
> > 
> 
> This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user
> namespaces.
> In our cases, this means breaking container migration mechanisms.
> 
> Will this at all be addressed or I'm the only one here that has this sort of requirement?

You're not.  The openstack scenario has the same problem.  So we have a
single base rootfs in a qcow2 or raw file which we want to mount into
multiple containers, each of which has a distinct set of uid mappings.

We'd like some way to identify uid mappings at mount time, without having
to walk the whole rootfs to chown every file.

(Of course safety would demand that the shared qcow2 use a set of high
subuids, NOT host uids - i.e. if we end up allowing a container to
own files owned by 0 on the host - even in a usually unmapped qcow2 -
there's danger we'd rather avoid, see again Andy's suggestions of
accidentidally auto-mounted filesystem images which happen to share a
UUID with host's / or /etc.  So we'd want to map uids 100000-106536
in the qcow2 to uids 0-65536 in the container, which in turn map to
uids 200000-206536 on the host)

-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] Per-user namespace process accounting
@ 2014-06-23  4:07                             ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2014-06-23  4:07 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Eric W. Biederman, Serge Hallyn, Linux Containers, linux-kernel,
	LXC development mailing-list

Quoting Marian Marinov (mm@1h.com):
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
> > Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> > 
> >> Quoting Pavel Emelyanov (xemul@parallels.com):
> >>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
> >>>> Quoting Marian Marinov (mm@1h.com):
> >>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new
> >>>>> containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes
> >>>>> it can be done and no, I do not believe we should go backwards.
> >>>>> 
> >>>>> We do not share filesystems between containers, we offer them block devices.
> >>>> 
> >>>> Yes, this is a real nuisance for openstack style deployments.
> >>>> 
> >>>> One nice solution to this imo would be a very thin stackable filesystem which does uid shifting, or, better
> >>>> yet, a non-stackable way of shifting uids at mount.
> >>> 
> >>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems don't bother with it. From
> >>> what I've seen, even simple stacking is quite a challenge.
> >> 
> >> Do you have any ideas for how to go about it?  It seems like we'd have to have separate inodes per mapping for
> >> each file, which is why of course stacking seems "natural" here.
> >> 
> >> Trying to catch the uid/gid at every kernel-userspace crossing seems like a design regression from the current
> >> userns approach.  I suppose we could continue in the kuid theme and introduce a iiud/igid for the in-kernel inode
> >> uid/gid owners.  Then allow a user privileged in some ns to create a new mount associated with a different
> >> mapping for any ids over which he is privileged.
> > 
> > There is a simple solution.
> > 
> > We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user
> > and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts.
> > 
> > 90% of that work is already done.
> > 
> > As long as we don't plan to support XFS (as it XFS likes to expose it's implementation details to userspace) it
> > should be quite straight forward.
> > 
> > The permission check change would probably only need to be:
> > 
> > 
> > @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, return -ENODEV;
> > 
> > if (user_ns != &init_user_ns) { +               if (!(type->fs_flags & FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN))
> > { +                       put_filesystem(type); +                       return -EPERM; +               } if
> > (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
> > 
> > 
> > There are also a few funnies with capturing the user namespace of the filesystem when we perform the mount (in the
> > superblock?), and not allowing a mount of that same filesystem in a different user namespace.
> > 
> > But as long as the kuid conversions don't measurably slow down the filesystem when mounted in the initial mount and
> > user namespaces I don't see how this would be a problem for anyone, and is very little code.
> > 
> 
> This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user
> namespaces.
> In our cases, this means breaking container migration mechanisms.
> 
> Will this at all be addressed or I'm the only one here that has this sort of requirement?

You're not.  The openstack scenario has the same problem.  So we have a
single base rootfs in a qcow2 or raw file which we want to mount into
multiple containers, each of which has a distinct set of uid mappings.

We'd like some way to identify uid mappings at mount time, without having
to walk the whole rootfs to chown every file.

(Of course safety would demand that the shared qcow2 use a set of high
subuids, NOT host uids - i.e. if we end up allowing a container to
own files owned by 0 on the host - even in a usually unmapped qcow2 -
there's danger we'd rather avoid, see again Andy's suggestions of
accidentidally auto-mounted filesystem images which happen to share a
UUID with host's / or /etc.  So we'd want to map uids 100000-106536
in the qcow2 to uids 0-65536 in the container, which in turn map to
uids 200000-206536 on the host)

-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-06-23  4:07 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-29  6:37 [RFC] Per-user namespace process accounting Marian Marinov
2014-05-29  6:37 ` Marian Marinov
     [not found] ` <5386D58D.2080809-108MBtLGafw@public.gmane.org>
2014-05-29 10:06   ` Eric W. Biederman
2014-05-29 10:06     ` Eric W. Biederman
     [not found]     ` <87tx88nbko.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-05-29 10:40       ` Marian Marinov
2014-05-29 10:40         ` Marian Marinov
     [not found]         ` <53870EAA.4060101-108MBtLGafw@public.gmane.org>
2014-05-29 15:32           ` Serge Hallyn
2014-05-29 15:32             ` Serge Hallyn
2014-06-03 17:01             ` Pavel Emelyanov
2014-06-03 17:01               ` Pavel Emelyanov
     [not found]               ` <538DFF72.7000209-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2014-06-03 17:26                 ` Serge Hallyn
2014-06-03 17:26                   ` Serge Hallyn
2014-06-03 17:39                   ` Pavel Emelyanov
2014-06-03 17:39                     ` Pavel Emelyanov
     [not found]                     ` <538E0848.6060900-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2014-06-03 17:47                       ` Serge Hallyn
2014-06-03 17:47                         ` Serge Hallyn
2014-06-03 18:18                       ` Eric W. Biederman
2014-06-03 18:18                         ` Eric W. Biederman
2014-06-03 17:54                   ` Eric W. Biederman
2014-06-03 17:54                     ` Eric W. Biederman
     [not found]                     ` <8738flkhf0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-06-03 21:39                       ` Marian Marinov
2014-06-03 21:39                         ` Marian Marinov
     [not found]                         ` <538E4088.7010605-108MBtLGafw@public.gmane.org>
2014-06-23  4:07                           ` Serge E. Hallyn
2014-06-23  4:07                             ` Serge E. Hallyn
2014-06-07 21:39                       ` James Bottomley
2014-06-07 21:39                         ` James Bottomley
     [not found]                         ` <1402177144.2236.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-06-08  3:25                           ` Eric W. Biederman
2014-06-08  3:25                             ` Eric W. Biederman
2014-06-12 14:37   ` Alin Dobre
2014-06-12 14:37     ` Alin Dobre
     [not found]     ` <5399BB42.60304-1hSFou9RDDldEee+Cai+ZQ@public.gmane.org>
2014-06-12 15:08       ` Serge Hallyn
2014-06-12 15:08         ` [lxc-devel] " Serge Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.