All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Cc: "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Linux Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	LXC development mailing-list
	<lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I@public.gmane.org>,
	"Eric W. Biederman"
	<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Subject: Re: [RFC] Per-user namespace process accounting
Date: Tue, 3 Jun 2014 21:39:20 +0400	[thread overview]
Message-ID: <538E0848.6060900@parallels.com> (raw)
In-Reply-To: <20140603172631.GL9714@ubuntumail>

On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> Quoting Pavel Emelyanov (xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>> Quoting Marian Marinov (mm-108MBtLGafw@public.gmane.org):
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>> Marian Marinov <mm-108MBtLGafw@public.gmane.org> writes:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have the following proposition.
>>>>>>
>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>
>>>>> That is deliberate.
>>>>
>>>> And I understand that very well ;)
>>>>
>>>>>
>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>> processes with ist own UID 99.
>>>>>>
>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>> but this brings another problem.
>>>>>>
>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>
>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>
>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>
>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>> do not believe we should go backwards.
>>>>
>>>> We do not share filesystems between containers, we offer them block devices.
>>>
>>> Yes, this is a real nuisance for openstack style deployments.
>>>
>>> One nice solution to this imo would be a very thin stackable filesystem
>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>> uids at mount.
>>
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> 
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.

I was thinking about "lightweight mapping" which is simple shifting. Since
we're trying to make this co-work with user-ns mappings, simple uid/gid shift
should be enough. Please, correct me if I'm wrong.

If I'm not, then it looks to be enough to have two per-sb or per-mnt values
for uid and gid shift. Per-mnt for now looks more promising, since container's
FS may be just a bind-mount from shared disk.

> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

User-space crossing? From my point of view it would be enough if we just turn
uid/gid read from disk (well, from whenever FS gets them) into uids, that would
match the user-ns's ones, this sould cover the VFS layer and related syscalls
only, which is, IIRC stat-s family and chown.

Ouch, and the whole quota engine :\

Thanks,
Pavel

WARNING: multiple messages have this Message-ID (diff)
From: Pavel Emelyanov <xemul@parallels.com>
To: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Marian Marinov <mm@1h.com>,
	Linux Containers <containers@lists.linux-foundation.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	LXC development mailing-list 
	<lxc-devel@lists.linuxcontainers.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] Per-user namespace process accounting
Date: Tue, 3 Jun 2014 21:39:20 +0400	[thread overview]
Message-ID: <538E0848.6060900@parallels.com> (raw)
In-Reply-To: <20140603172631.GL9714@ubuntumail>

On 06/03/2014 09:26 PM, Serge Hallyn wrote:
> Quoting Pavel Emelyanov (xemul@parallels.com):
>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>> Quoting Marian Marinov (mm@1h.com):
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
>>>>> Marian Marinov <mm@1h.com> writes:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have the following proposition.
>>>>>>
>>>>>> Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that
>>>>>> multiple containers in different user namespaces share the process counters.
>>>>>
>>>>> That is deliberate.
>>>>
>>>> And I understand that very well ;)
>>>>
>>>>>
>>>>>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any 
>>>>>> processes with ist own UID 99.
>>>>>>
>>>>>> I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps,
>>>>>> but this brings another problem.
>>>>>>
>>>>>> We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning
>>>>>> these causes a lot of I/O and also slows down provisioning considerably.
>>>>>>
>>>>>> The other problem is that when we migrate one container from one host machine to another the IDs may be already
>>>>>> in use on the new machine and we need to chown all the files again.
>>>>>
>>>>> You should have the same uid allocations for all machines in your fleet as much as possible.   That has been true
>>>>> ever since NFS was invented and is not new here.  You can avoid the cost of chowning if you untar your files inside
>>>>> of your user namespace.  You can have different maps per machine if you are crazy enough to do that.  You can even
>>>>> have shared uids that you use to share files between containers as long as none of those files is setuid.  And map
>>>>> those shared files to some kind of nobody user in your user namespace.
>>>>
>>>> We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is
>>>> extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I
>>>> do not believe we should go backwards.
>>>>
>>>> We do not share filesystems between containers, we offer them block devices.
>>>
>>> Yes, this is a real nuisance for openstack style deployments.
>>>
>>> One nice solution to this imo would be a very thin stackable filesystem
>>> which does uid shifting, or, better yet, a non-stackable way of shifting
>>> uids at mount.
>>
>> I vote for non-stackable way too. Maybe on generic VFS level so that filesystems 
>> don't bother with it. From what I've seen, even simple stacking is quite a challenge.
> 
> Do you have any ideas for how to go about it?  It seems like we'd have
> to have separate inodes per mapping for each file, which is why of
> course stacking seems "natural" here.

I was thinking about "lightweight mapping" which is simple shifting. Since
we're trying to make this co-work with user-ns mappings, simple uid/gid shift
should be enough. Please, correct me if I'm wrong.

If I'm not, then it looks to be enough to have two per-sb or per-mnt values
for uid and gid shift. Per-mnt for now looks more promising, since container's
FS may be just a bind-mount from shared disk.

> Trying to catch the uid/gid at every kernel-userspace crossing seems
> like a design regression from the current userns approach.  I suppose we
> could continue in the kuid theme and introduce a iiud/igid for the
> in-kernel inode uid/gid owners.  Then allow a user privileged in some
> ns to create a new mount associated with a different mapping for any
> ids over which he is privileged.

User-space crossing? From my point of view it would be enough if we just turn
uid/gid read from disk (well, from whenever FS gets them) into uids, that would
match the user-ns's ones, this sould cover the VFS layer and related syscalls
only, which is, IIRC stat-s family and chown.

Ouch, and the whole quota engine :\

Thanks,
Pavel

  reply	other threads:[~2014-06-03 17:39 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-29  6:37 [RFC] Per-user namespace process accounting Marian Marinov
2014-05-29  6:37 ` Marian Marinov
     [not found] ` <5386D58D.2080809-108MBtLGafw@public.gmane.org>
2014-05-29 10:06   ` Eric W. Biederman
2014-05-29 10:06     ` Eric W. Biederman
     [not found]     ` <87tx88nbko.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-05-29 10:40       ` Marian Marinov
2014-05-29 10:40         ` Marian Marinov
     [not found]         ` <53870EAA.4060101-108MBtLGafw@public.gmane.org>
2014-05-29 15:32           ` Serge Hallyn
2014-05-29 15:32             ` Serge Hallyn
2014-06-03 17:01             ` Pavel Emelyanov
2014-06-03 17:01               ` Pavel Emelyanov
     [not found]               ` <538DFF72.7000209-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2014-06-03 17:26                 ` Serge Hallyn
2014-06-03 17:26                   ` Serge Hallyn
2014-06-03 17:39                   ` Pavel Emelyanov [this message]
2014-06-03 17:39                     ` Pavel Emelyanov
     [not found]                     ` <538E0848.6060900-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2014-06-03 17:47                       ` Serge Hallyn
2014-06-03 17:47                         ` Serge Hallyn
2014-06-03 18:18                       ` Eric W. Biederman
2014-06-03 18:18                         ` Eric W. Biederman
2014-06-03 17:54                   ` Eric W. Biederman
2014-06-03 17:54                     ` Eric W. Biederman
     [not found]                     ` <8738flkhf0.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2014-06-03 21:39                       ` Marian Marinov
2014-06-03 21:39                         ` Marian Marinov
     [not found]                         ` <538E4088.7010605-108MBtLGafw@public.gmane.org>
2014-06-23  4:07                           ` Serge E. Hallyn
2014-06-23  4:07                             ` Serge E. Hallyn
2014-06-07 21:39                       ` James Bottomley
2014-06-07 21:39                         ` James Bottomley
     [not found]                         ` <1402177144.2236.26.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-06-08  3:25                           ` Eric W. Biederman
2014-06-08  3:25                             ` Eric W. Biederman
2014-06-12 14:37   ` Alin Dobre
2014-06-12 14:37     ` Alin Dobre
     [not found]     ` <5399BB42.60304-1hSFou9RDDldEee+Cai+ZQ@public.gmane.org>
2014-06-12 15:08       ` Serge Hallyn
2014-06-12 15:08         ` [lxc-devel] " Serge Hallyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=538E0848.6060900@parallels.com \
    --to=xemul-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I@public.gmane.org \
    --cc=serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.