From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422697AbcFCUxv (ORCPT ); Fri, 3 Jun 2016 16:53:51 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:50340 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751919AbcFCUxm (ORCPT ); Fri, 3 Jun 2016 16:53:42 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Nikolay Borisov Cc: Jan Kara , john@johnmccutchan.com, eparis@redhat.com, linux-kernel@vger.kernel.org, gorcunov@openvz.org, avagin@openvz.org, netdev@vger.kernel.org, operations@siteground.com, Linux Containers References: <1464767580-22732-1-git-send-email-kernel@kyup.com> <8737ow7vcp.fsf@x220.int.ebiederm.org> <20160602074920.GG19636@quack2.suse.cz> <87bn3jy1cd.fsf@x220.int.ebiederm.org> <5751667D.7010207@kyup.com> Date: Fri, 03 Jun 2016 15:41:55 -0500 In-Reply-To: <5751667D.7010207@kyup.com> (Nikolay Borisov's message of "Fri, 3 Jun 2016 14:14:05 +0300") Message-ID: <87inxqovho.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-AID: U2FsdGVkX1+FkMXoGxGArDc9m/knTZh3X2YnSbily+s= X-SA-Exim-Connect-IP: 67.3.226.120 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.7 XMSubLong Long Subject * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa05 1397; Body=1 Fuz1=1 Fuz2=1] * 1.0 T_XMDrugObfuBody_08 obfuscated drug references * 0.2 T_XMDrugObfuBody_14 obfuscated drug references X-Spam-DCC: XMission; sa05 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ***;Nikolay Borisov X-Spam-Relay-Country: X-Spam-Timing: total 781 ms - load_scoreonly_sql: 0.06 (0.0%), signal_user_changed: 6 (0.8%), b_tie_ro: 3.5 (0.5%), parse: 1.26 (0.2%), extract_message_metadata: 12 (1.6%), get_uri_detail_list: 2.0 (0.3%), tests_pri_-1000: 4.6 (0.6%), tests_pri_-950: 1.12 (0.1%), tests_pri_-900: 0.93 (0.1%), tests_pri_-400: 35 (4.4%), check_bayes: 33 (4.2%), b_tokenize: 8 (1.1%), b_tok_get_all: 10 (1.2%), b_comp_prob: 4.2 (0.5%), b_tok_touch_all: 3.8 (0.5%), b_finish: 0.96 (0.1%), tests_pri_0: 678 (86.8%), check_dkim_signature: 0.49 (0.1%), check_dkim_adsp: 4.2 (0.5%), tests_pri_500: 38 (4.9%), poll_dns_idle: 27 (3.5%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Nikolay Borisov writes: > On 06/02/2016 07:58 PM, Eric W. Biederman wrote: >> >> Nikolay please see my question for you at the end. [snip] >> All of that said there is definitely a practical question that needs to >> be asked. Nikolay how did you get into this situation? A typical user >> namespace configuration will set up uid and gid maps with the help of a >> privileged program and not map the uid of the user who created the user >> namespace. Thus avoiding exhausting the limits of the user who created >> the container. > > Right but imagine having multiple containers with identical uid/gid maps > for LXC-based setups imagine this: > > lxc.id_map = u 0 1337 65536 So I am only moderately concerned when the containers have overlapping ids. Because at some level overlapping ids means they are the same user. This is certainly true for file permissions and for other permissions. To isolate one container from another it fundamentally needs to have separate uids and gids on the host system. > Now all processes which are running with the same user on different > containers will actually share the underlying user_struct thus the > inotify limits. In such cases even running multiple instances of 'tail' > in one container will eventually use all allowed inotify/mark instances. > For this to happen you needn't also have complete overlap of the uid > map, it's enough to have at least one UID between 2 containers overlap. > > > So the risk of exhaustion doesn't apply to the privileged user that > created the container and the uid mapping, but rather the users under > which the various processes in the container are running. Does that make > it clear? Yes. That is clear. >> Which makes me personally more worried about escaping the existing >> limits than exhausting the limits of a particular user. > > So I thought bit about it and I guess a solution can be concocted which > utilize the hierarchical nature of page counter, and the inotify limits > are set per namespace if you have capable(CAP_SYS_ADMIN). That way the > admin can set one fairly large on the init_user_ns and then in every > namespace created one can set smaller limits. That way for a branch in > the tree (in the nomenclature you used in your previous reply to me) you > will really be upper-bound to the limit set in the namespace which have > ->level = 1. For the width of the tree, you will be bound by the > "global" init_user_ns limits. How does that sound? As a addendum to that design. I think there should be an additional sysctl or two that specifies how much the limit decreases when creating a new user namespace and when creating a new user in that user namespace. That way with a good selection of limits and a limit decrease people can use the kernel defaults without needing to change them. Having default settings that are good enough 99% of the time and that people don't need to tune, would be my biggest requirement (aside from being light-weight) for merging something like this. If things are set and forget and even the continer case does not need to be aware then I think we have a design sufficiently robust and different from what cgroups is doing to make it worth while to have a userns based solution. I can see a lot of different limits implemented this way. Eric