Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

From: ebiederm@xmission.com (Eric W. Biederman)
To: Nikolay Borisov <kernel@kyup.com>
Cc: Jan Kara <jack@suse.cz>,
	john@johnmccutchan.com, eparis@redhat.com,
	linux-kernel@vger.kernel.org, gorcunov@openvz.org,
	avagin@openvz.org, netdev@vger.kernel.org,
	operations@siteground.com,
	Linux Containers <containers@lists.linux-foundation.org>
Subject: Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
Date: Fri, 03 Jun 2016 15:41:55 -0500	[thread overview]
Message-ID: <87inxqovho.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <5751667D.7010207@kyup.com> (Nikolay Borisov's message of "Fri, 3 Jun 2016 14:14:05 +0300")

Nikolay Borisov <kernel@kyup.com> writes:

> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>> 
>> Nikolay please see my question for you at the end.
[snip] 
>> All of that said there is definitely a practical question that needs to
>> be asked.  Nikolay how did you get into this situation?  A typical user
>> namespace configuration will set up uid and gid maps with the help of a
>> privileged program and not map the uid of the user who created the user
>> namespace.  Thus avoiding exhausting the limits of the user who created
>> the container.
>
> Right but imagine having multiple containers with identical uid/gid maps
> for LXC-based setups imagine this:
>
> lxc.id_map = u 0 1337 65536

So I am only moderately concerned when the containers have overlapping
ids.  Because at some level overlapping ids means they are the same
user.  This is certainly true for file permissions and for other
permissions.  To isolate one container from another it fundamentally
needs to have separate uids and gids on the host system.

> Now all processes which are running with the same user on different
> containers will actually share the underlying user_struct thus the
> inotify limits. In such cases even running multiple instances of 'tail'
> in one container will eventually use all allowed inotify/mark instances.
> For this to happen you needn't also have complete overlap of the uid
> map, it's enough to have at least one UID between 2 containers overlap.
>
>
> So the risk of exhaustion doesn't apply to the privileged user that
> created the container and the uid mapping, but rather the users under
> which the various processes in the container are running. Does that make
> it clear?

Yes.  That is clear.

>> Which makes me personally more worried about escaping the existing
>> limits than exhausting the limits of a particular user.
>
> So I thought bit about it and I guess a solution can be concocted which
> utilize the hierarchical nature of page counter, and the inotify limits
> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
> admin can set one fairly large on the init_user_ns and then in every
> namespace created one can set smaller limits. That way for a branch in
> the tree (in the nomenclature you used in your previous reply to me) you
> will really be upper-bound to the limit set in the namespace which have
> ->level = 1. For the width of the tree, you will be bound by the
> "global" init_user_ns limits. How does that sound?

As a addendum to that design.  I think there should be an additional
sysctl or two that specifies how much the limit decreases when creating
a new user namespace and when creating a new user in that user
namespace.  That way with a good selection of limits and a limit
decrease people can use the kernel defaults without needing to change
them.

Having default settings that are good enough 99% of the time and that
people don't need to tune, would be my biggest requirement (aside from
being light-weight) for merging something like this.

If things are set and forget and even the continer case does not need to
be aware then I think we have a design sufficiently robust and different
from what cgroups is doing to make it worth while to have a userns based
solution.

I can see a lot of different limits implemented this way.

Eric