linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nikolay Borisov <kernel@kyup.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Kara <jack@suse.cz>,
	avagin@openvz.org, netdev@vger.kernel.org,
	Linux Containers <containers@lists.linux-foundation.org>,
	linux-kernel@vger.kernel.org, eparis@redhat.com,
	operations@siteground.com, gorcunov@openvz.org,
	john@johnmccutchan.com
Subject: Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
Date: Mon, 6 Jun 2016 09:41:20 +0300	[thread overview]
Message-ID: <57551B10.6080505@kyup.com> (raw)
In-Reply-To: <87inxqovho.fsf@x220.int.ebiederm.org>



On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>
>>> Nikolay please see my question for you at the end.
> [snip] 
>>> All of that said there is definitely a practical question that needs to
>>> be asked.  Nikolay how did you get into this situation?  A typical user
>>> namespace configuration will set up uid and gid maps with the help of a
>>> privileged program and not map the uid of the user who created the user
>>> namespace.  Thus avoiding exhausting the limits of the user who created
>>> the container.
>>
>> Right but imagine having multiple containers with identical uid/gid maps
>> for LXC-based setups imagine this:
>>
>> lxc.id_map = u 0 1337 65536
> 
> So I am only moderately concerned when the containers have overlapping
> ids.  Because at some level overlapping ids means they are the same
> user.  This is certainly true for file permissions and for other
> permissions.  To isolate one container from another it fundamentally
> needs to have separate uids and gids on the host system.
> 
>> Now all processes which are running with the same user on different
>> containers will actually share the underlying user_struct thus the
>> inotify limits. In such cases even running multiple instances of 'tail'
>> in one container will eventually use all allowed inotify/mark instances.
>> For this to happen you needn't also have complete overlap of the uid
>> map, it's enough to have at least one UID between 2 containers overlap.
>>
>>
>> So the risk of exhaustion doesn't apply to the privileged user that
>> created the container and the uid mapping, but rather the users under
>> which the various processes in the container are running. Does that make
>> it clear?
> 
> Yes.  That is clear.
> 
>>> Which makes me personally more worried about escaping the existing
>>> limits than exhausting the limits of a particular user.
>>
>> So I thought bit about it and I guess a solution can be concocted which
>> utilize the hierarchical nature of page counter, and the inotify limits
>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>> admin can set one fairly large on the init_user_ns and then in every
>> namespace created one can set smaller limits. That way for a branch in
>> the tree (in the nomenclature you used in your previous reply to me) you
>> will really be upper-bound to the limit set in the namespace which have
>> ->level = 1. For the width of the tree, you will be bound by the
>> "global" init_user_ns limits. How does that sound?
> 
> As a addendum to that design.  I think there should be an additional
> sysctl or two that specifies how much the limit decreases when creating
> a new user namespace and when creating a new user in that user
> namespace.  That way with a good selection of limits and a limit
> decrease people can use the kernel defaults without needing to change
> them.

I agree that a sysctl which controls how the limits are set for new
namespaces is a good idea. I think it's best if this is in % rather than
some absolute value. Also I'm not sure about the sysctl when a user is
added in a namespace since just adding a new user should fall under the
limits of the current userns.

Also should those sysctls be global or should they be per-namespace? At
this point I'm more inclined to have global sysctl and maybe refine it
in the future if the need arises?


> 
> Having default settings that are good enough 99% of the time and that
> people don't need to tune, would be my biggest requirement (aside from
> being light-weight) for merging something like this.
> 
> If things are set and forget and even the continer case does not need to
> be aware then I think we have a design sufficiently robust and different
> from what cgroups is doing to make it worth while to have a userns based
> solution.

Provided that we agree on the overall design, so far it seems we just
need to iron out the details with the sysctl I'll be happy to implement
this.


> 
> I can see a lot of different limits implemented this way.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

  reply	other threads:[~2016-06-06  6:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
2016-06-06  8:05   ` Cyrill Gorcunov
2016-06-06  9:26     ` Nikolay Borisov
2016-06-01  7:52 ` [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace Nikolay Borisov
2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
2016-06-01 18:13   ` David Miller
2016-06-01  7:53 ` [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER Nikolay Borisov
2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
2016-06-02  6:27   ` Nikolay Borisov
2016-06-02 16:19     ` Eric W. Biederman
2016-06-02  7:49   ` Jan Kara
2016-06-02 16:58     ` Eric W. Biederman
2016-06-03 11:14       ` Nikolay Borisov
2016-06-03 20:41         ` Eric W. Biederman
2016-06-06  6:41           ` Nikolay Borisov [this message]
2016-06-06 20:00             ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57551B10.6080505@kyup.com \
    --to=kernel@kyup.com \
    --cc=avagin@openvz.org \
    --cc=containers@lists.linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=eparis@redhat.com \
    --cc=gorcunov@openvz.org \
    --cc=jack@suse.cz \
    --cc=john@johnmccutchan.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=operations@siteground.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).