Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

From: Nikolay Borisov <kernel@kyup.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Kara <jack@suse.cz>,
	avagin@openvz.org, netdev@vger.kernel.org,
	Linux Containers <containers@lists.linux-foundation.org>,
	linux-kernel@vger.kernel.org, eparis@redhat.com,
	operations@siteground.com, gorcunov@openvz.org,
	john@johnmccutchan.com
Subject: Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
Date: Mon, 6 Jun 2016 09:41:20 +0300	[thread overview]
Message-ID: <57551B10.6080505@kyup.com> (raw)
In-Reply-To: <87inxqovho.fsf@x220.int.ebiederm.org>

On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>
>>> Nikolay please see my question for you at the end.
> [snip] 
>>> All of that said there is definitely a practical question that needs to
>>> be asked.  Nikolay how did you get into this situation?  A typical user
>>> namespace configuration will set up uid and gid maps with the help of a
>>> privileged program and not map the uid of the user who created the user
>>> namespace.  Thus avoiding exhausting the limits of the user who created
>>> the container.
>>
>> Right but imagine having multiple containers with identical uid/gid maps
>> for LXC-based setups imagine this:
>>
>> lxc.id_map = u 0 1337 65536
> 
> So I am only moderately concerned when the containers have overlapping
> ids.  Because at some level overlapping ids means they are the same
> user.  This is certainly true for file permissions and for other
> permissions.  To isolate one container from another it fundamentally
> needs to have separate uids and gids on the host system.
> 
>> Now all processes which are running with the same user on different
>> containers will actually share the underlying user_struct thus the
>> inotify limits. In such cases even running multiple instances of 'tail'
>> in one container will eventually use all allowed inotify/mark instances.
>> For this to happen you needn't also have complete overlap of the uid
>> map, it's enough to have at least one UID between 2 containers overlap.
>>
>>
>> So the risk of exhaustion doesn't apply to the privileged user that
>> created the container and the uid mapping, but rather the users under
>> which the various processes in the container are running. Does that make
>> it clear?
> 
> Yes.  That is clear.
> 
>>> Which makes me personally more worried about escaping the existing
>>> limits than exhausting the limits of a particular user.
>>
>> So I thought bit about it and I guess a solution can be concocted which
>> utilize the hierarchical nature of page counter, and the inotify limits
>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>> admin can set one fairly large on the init_user_ns and then in every
>> namespace created one can set smaller limits. That way for a branch in
>> the tree (in the nomenclature you used in your previous reply to me) you
>> will really be upper-bound to the limit set in the namespace which have
>> ->level = 1. For the width of the tree, you will be bound by the
>> "global" init_user_ns limits. How does that sound?
> 
> As a addendum to that design.  I think there should be an additional
> sysctl or two that specifies how much the limit decreases when creating
> a new user namespace and when creating a new user in that user
> namespace.  That way with a good selection of limits and a limit
> decrease people can use the kernel defaults without needing to change
> them.

I agree that a sysctl which controls how the limits are set for new
namespaces is a good idea. I think it's best if this is in % rather than
some absolute value. Also I'm not sure about the sysctl when a user is
added in a namespace since just adding a new user should fall under the
limits of the current userns.

Also should those sysctls be global or should they be per-namespace? At
this point I'm more inclined to have global sysctl and maybe refine it
in the future if the need arises?

> 
> Having default settings that are good enough 99% of the time and that
> people don't need to tune, would be my biggest requirement (aside from
> being light-weight) for merging something like this.
> 
> If things are set and forget and even the continer case does not need to
> be aware then I think we have a design sufficiently robust and different
> from what cgroups is doing to make it worth while to have a userns based
> solution.

Provided that we agree on the overall design, so far it seems we just
need to iron out the details with the sysctl I'll be happy to implement
this.

> 
> I can see a lot of different limits implemented this way.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>