From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30B2DC433E7 for ; Sat, 17 Oct 2020 16:51:31 +0000 (UTC) Received: from silver.osuosl.org (smtp3.osuosl.org [140.211.166.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A91BC2074A for ; Sat, 17 Oct 2020 16:51:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A91BC2074A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=xmission.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=containers-bounces@lists.linux-foundation.org Received: from localhost (localhost [127.0.0.1]) by silver.osuosl.org (Postfix) with ESMTP id 8167120005; Sat, 17 Oct 2020 16:51:30 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from silver.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fv2gaXkhUPYI; Sat, 17 Oct 2020 16:51:28 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by silver.osuosl.org (Postfix) with ESMTP id E5C6C1FD42; Sat, 17 Oct 2020 16:51:28 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id CE2E3C0052; Sat, 17 Oct 2020 16:51:28 +0000 (UTC) Received: from hemlock.osuosl.org (smtp2.osuosl.org [140.211.166.133]) by lists.linuxfoundation.org (Postfix) with ESMTP id D0522C0051 for ; Sat, 17 Oct 2020 16:51:27 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by hemlock.osuosl.org (Postfix) with ESMTP id BECCC878D0 for ; Sat, 17 Oct 2020 16:51:27 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from hemlock.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gYlU9YN7czRl for ; Sat, 17 Oct 2020 16:51:26 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233]) by hemlock.osuosl.org (Postfix) with ESMTPS id 630F0878C8 for ; Sat, 17 Oct 2020 16:51:26 +0000 (UTC) Received: from in02.mta.xmission.com ([166.70.13.52]) by out03.mta.xmission.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1kTpQB-003rsF-Oa; Sat, 17 Oct 2020 10:51:07 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95] helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.87) (envelope-from ) id 1kTpQA-0004b7-CI; Sat, 17 Oct 2020 10:51:07 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: "Enrico Weigelt\, metux IT consult" References: <20200830143959.rhosiunyz5yqbr35@wittgenstein> Date: Sat, 17 Oct 2020 11:51:22 -0500 In-Reply-To: (Enrico Weigelt's message of "Thu, 15 Oct 2020 17:31:45 +0200") Message-ID: <874kmsdcdx.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 X-XM-SPF: eid=1kTpQA-0004b7-CI; ; ; mid=<874kmsdcdx.fsf@x220.int.ebiederm.org>; ; ; hst=in02.mta.xmission.com; ; ; ip=68.227.160.95; ; ; frm=ebiederm@xmission.com; ; ; spf=neutral X-XM-AID: U2FsdGVkX1+Tp7fo3PG+a6LIxX1FiHtlWjF8gEmudwk= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Cc: Alexander Mihalicyn , Giuseppe Scrivano , Joseph Christopher Sible , Wat Lim , Kees Cook , containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org, Josh Triplett , =?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?= , Andy Lutomirski , Mrunal Patel , Pavel Tikhomirov , Geoffrey Thomas X-BeenThere: containers@lists.linux-foundation.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Linux Containers List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: containers-bounces@lists.linux-foundation.org Sender: "Containers" IkVucmljbyBXZWlnZWx0LCBtZXR1eCBJVCBjb25zdWx0IiA8bGttbEBtZXR1eC5uZXQ+IHdyaXRl czoKCj4gT24gMzAuMDguMjAgMTY6MzksIENocmlzdGlhbiBCcmF1bmVyIHdyb3RlOgo+Cj4gSGkg Q2hyaXN0aWFuLAo+Cj4+IFAxLiBJc29sYXRlZCBpZCBtYXBwaW5ncyBjYW4gb25seSBiZSBndWFy YW50ZWVkIHRvIGJlIGxvY2FsbHkgaXNvbGF0ZWQuCj4+ICAgICBBIGNvbnRhaW5lciBydW50aW1l L2RhZW1vbiBjYW4gb25seSBndWFyYW50ZWUgbm9uLW92ZXJsYXBwaW5nIGlkIG1hcHBpbmdzCj4+ ICAgICB3aGVuIG5vIG90aGVyIHVzZXJzIG9uIHRoZSBzeXN0ZW0gY3JlYXRlIGNvbnRhaW5lcnMu Cj4KPiBJbmRlZWQuIEJ1dCBjb3VsZG4ndCB3ZSBqdXN0IHJlY29yZCB0aGUgbWFwcGluZ3MgaW4g c29tZSBzdGFuZGFyZGl6ZWQKPiBwbGFjZSAoZWcuIHNvbWUgZmlsZSkgd2hpY2ggYWxsIGVuZ2lu ZXMgbWFpbnRhaW4gPwo+Cj4gSSdkIGd1ZXNzIG90aGVyIHNvbHV0aW9ucyB3b3VsZCBuZWVkIGNo YW5nZXMgaW4gdGhlIHJ1bnRpbWVzLCB0b28uCj4KPiBQbGVhc2Uga2VlcCBpbiBtaW5kIHRoYXQg c29tZSBzY2VuYXJpb3MgYWN0dWFsbHkgbmVlZCBzb21lIG92ZXJsYXBzLCBlZy4KPiBhcHBsaWNh dGlvbiBjb250YWluZXJzIHRoYXQgc2hhbGwgaGF2ZSBkaXJlY3QgYWNjZXNzIHRvIGhvbWUgZGly cy4KPgo+PiBQMi4gRW5mb3JjaW5nIGlzb2xhdGVkIGlkIG1hcHBpbmdzIGluIHVzZXJzcGFjZSBp cyBkaWZmaWN1bHQuCj4+ICAgICBJdCBpcyBhbHdheXMgcG9zc2libGUgdG8gY3JlYXRlIG90aGVy IHByb2Nlc3NlcyB3aXRoIG92ZXJsYXBwaW5nIGlkCj4+ICAgICBtYXBwaW5ncy4gQ29vcmRpbmF0 aW5nIGlkIG1hcHBpbmdzIGluIHVzZXJzcGFjZSB3aWxsIGFsd2F5cyByZW1haW4KPj4gICAgIG9w dGlvbmFsLiBRdWl0ZSBhIGZldyB0b29scyBub3dhZGF5cyAoaW5jbHVkaW5nIHN5c3RlbWQpIGRv bid0IGNhcmUgYWJvdXQKPj4gICAgIC9ldGMvc3Vie2csdX1pZCBhbmQgYWN0aXZlbHkgYWR2aXNl IGFnYWluc3QgdXNpbmcgaXQuIFRoaXMgaXMgbWFkZSBldmVuCj4+ICAgICBtb3JlIHByb2JsZW1h dGljIHNpbmNlIHN1YntnLHV9aWlkIGRlbGVnYXRpb24gaXMgZG9uZSBwZXItdXNlciByYXRoZXIg dGhhbgo+PiAgICAgcGVyLWNvbnRhaW5lci1ydW50aW1lLgo+Cj4gSSBiZWxpZXZlIHN1YnVzZXJz IGFyZW4ndCBtZWFudCBmb3IgdHlpY2FsIGNvbnRhaW5lcnMgKGxpa2UgZG9ja2VyIG9yCj4gbHhj KSwgYnV0IHVucHJpdmlsZWdlZCB1c2VyIHByb2dyYW1zIHRoYXQgd2FubmEgaGF2ZSBmdXJ0aGVy IGlzb2xhdGlvbgo+IGZvciBzdWJwcm9jZXNzZXMgKGVnLiBhIGJyb3dzZXIncyByZW5kZXJlciBv ciBqcyBlbmdpbmUpLgo+Cj4gQ29ycmVjdCBtZSBpZiBJJ20gd3JvbmcuCgpUaGVyZSBpcyBhbiBv bi1nb2luZyB0cmVuZCB0byBtYWtlIHVucHJpdmlsZWdlZCBjb250YWluZXJzIHR5cGljYWwKY29u dGFpbmVycy4KCj4+IFAzLiBUaGUgcmFuZ2Ugb2YgdGhlIGlkIG1hcHBpbmcgb2YgYSBjb250YWlu ZXIgY2FuJ3QgYmUgcHJlZGV0ZXJtaW5lZC4KPj4gICAgIFdoaWxlIFBPU0lYIG1hbmRhdGVzIHRo YXQgYSBzdGFuZGFyZCBzeXN0ZW0gc2hvdWxkIHVzZSBhIHJhbmdlIG9mIDY1NTM2IGlkcwo+PiAg ICAgcmVhbGl0eSBpcyB2ZXJ5IGRpZmZlcmVudC4gU29tZSBwcm9ncmFtcyBhbGxvY2F0ZSBoaWdo IGlkcyBmb3IgcmFuZG9tCj4+ICAgICBwcm9jZXNzZXMgb3IgZm9yIG5ldHdvcmsgYXV0aGVudGlj YXRpb24uIFRoaXMgbWVhbnMsIGluIHByYWN0aWNlIGl0IGlzCj4+ICAgICBvZnRlbiBuZWNlc3Nh cnkgdG8gYXNzaWduIGEgcmFuZ2Ugb2YgdXAgdG8gMTAgbWlsbGlvbiBpZHMgdG8gYSBjb250YWlu ZXIuCj4+ICAgICBUaGlzIGxpbWl0cyBhIHN5c3RlbSB0byBsZXNzIHRoYW4gNTAwIGNvbnRhaW5l cnMgdG90YWwuCj4KPiBJbiAyNSsgeWVhcnMsIGhhdmVuJ3Qgc2VlbiBzdWNoIGFuIGFwcGxpY2F0 aW9uIGluIHRoZSBmaWVsZC4gSSdkCj4gY29uc2lkZXIgdGhpcyBhIGhvcnJpYmxlIGFuZCBkYW5n ZXJvdXMgYnVnLiBTYW5lIGFwcGxpY2F0aW9ucyBjcmVhdGUKPiBzcGVjaWZpYyB1c2VyIGVudHJp ZXMgKC9ldGMvcGFzc3dkKSBmb3IgdGhhdC4KPgo+IEknZCBzYXkgd2UncmUgc2FmZSB3LyBtYXgg Ml4xNiB1c2VycyBwZXIgY29udGFpbmVyLCB3aGljaCBzaG91bGQgZ2l2ZSB1cwo+IHNwYWNlIGZv ciBhYm91dCAyXjE2IGNvbnRhaW5lcnMuCgpJIGZvcmdldCB0aGUgZGV0YWlscyBidXQgc3lzdGVt ZCBoYXMgYSBmZWF0dXJlIHdoZXJlIGl0IHdpbGwgcmFuZG9tbHkKYWxsb2NhdGUgYSB1aWQgZm9y IGEgc2VydmljZS4gIENhbGxpbmcgdGhlbSBzb21ldGhpbmcgbGlrZSB0ZW1wb3Jhcml5IHVpZHMu Cgo+PiBQNC4gSXNvbGF0ZWQgaWQgbWFwcGluZ3Mgc2V2ZXJlbHkgcmVzdHJpY3QgdGhlIG51bWJl ciBvZiBjb250YWluZXJzIHRoYXQgY2FuIGJlCj4+ICAgICBydW4gb24gYSBzeXN0ZW0uCj4+ICAg ICBUaGlzIHRpZXMgYmFjayB0byB0aGUgcG9pbnQgYWJvdXQgcHJlLWRldGVybWluaW5nIHRoZSBp ZCByYW5nZSBvZiBhCj4+ICAgICBjb250YWluZXIgYW5kIGhvdyBsYXJnZSByYW5nZSBhbGxvY2F0 aW9ucyB0ZW5kIHRvIGJlIG9uIHJlYWwgc3lzdGVtcy4gVGhhdAo+PiAgICAgYmVjb21lcyBldmVu IG1vcmUgcmVsZXZhbnQgd2hlbiBuZXN0aW5nIGNvbnRhaW5lcnMuCj4KPiBJTUhPLCBhbGwgd2Ug bmVlZCBpcyB0byBtYWludGFpbiBhIGxpc3Qgb2YgYWN0aXZlIHJhbmdlcyAobW9yZSBwcmVjaXNl bHkKPiB0aGUgMTZiaXQgcHJlZml4ZXMsIGp1c3QgbGlrZSBjbGFzcyBCIG5ldHdvcmtzIDstKSku IEFzIHNhaWQsIEknZAo+IGRlY2xhcmUgdGhlIHNjZW5hcmlvICNQMyBhcyBpbnZhbGlkIGFuZCBy YXRoZXIgZml4IHRob3NlIGZldyBicm9rZW4KPiBhcHBsaWNhdGlvbnMuCgpXaGljaCBpcyAvZXRj L3N1YnVpZCBhbmQgL2V0Yy9zdWJnaWQsIGFuZCBpdCB3YXMgdmVyeSBtdWNoIGluc3BpcmVkIGZy b20KdGhlIHNhbWUgc291cmNlLgoKPj4gUDUuIENvbnRhaW5lciBydW50aW1lcyBjYW5ub3QgcmV1 c2Ugb3ZlcmxheWZzIGxvd2VyIGRpcmVjdG9yaWVzIGlmIGVhY2gKPj4gICAgIGNvbnRhaW5lciB1 c2VzIGlzb2xhdGVkIElEIG1hcHBpbmdzLCBsZWFkaW5nIHRvIGVpdGhlciBuZWVkbGVzcyBzdG9y YWdlCj4+ICAgICBvdmVyaGVhZCAoTFhEIC0tIHRob3VnaCB0aGUgTFhEIGZvbGtzIGRvbuKAmXQg cmVhbGx5IG1pbmQpLCBjb21wbGV0ZWx5Cj4+ICAgICBpZ25vcmluZyB0aGUgYmVuZWZpdHMgb2Yg aXNvbGF0aW5nIGNvbnRhaW5lcnMgZnJvbSBlYWNoIG90aGVyIChEb2NrZXIpLCBvcgo+PiAgICAg bm90IHVzaW5nIHRoZW0gYXQgYWxsIChLdWJlcm5ldGVzKS4gKFRoaXMgaXMgYSBtb3JlIGdlbmVy YWwgaXNzdWUgYnV0IGJlYXJzCj4+ICAgICByZXBlYXRpbmcgc2luY2UgaXQgaXMgY2xvc2VseSB0 aWVkIHRvIG1vc3QgdXNlcm5zIHByb3Bvc2Fscy4pCj4KPiBJbmRlZWQuIFRoYXQncyBJTUhPIHRo ZSBtYWluIHByb2JsZW0uIFdlIHNvbWVob3cgbmVlZCB0byBtYXAgdGhlIFVJRHMuCj4gTWF5YmUg YSBzeW50aGV0aWMgZmlsZXN5c3RlbSB0aGF0IGp1c3QgZG9lcyBleGFjdGx5IHRoZSBzYW1lIHVp ZDwtPmt1aWQKPiB0cmFuc2xhdGlvbnMgd2UncmUgYWxyZWFkeSBkb2luZyBpbiBvdGhlciBwbGFj ZXMgPwo+Cj4+IFA2LiBSbGltaXRzIHBvc2UgYSBwcm9ibGVtIGZvciBjb250YWluZXJzIHRoYXQg c2hhcmUgdGhlIHNhbWUgaWQgbWFwcGluZy4KPj4gICAgIFRoaXMgbWVhbnMgY29udGFpbmVycyB3 aXRoIG92ZXJsYXBwaW5nIGlkIG1hcHBpbmdzIGNhbiBET1MgZWFjaCBvdGhlciBieQo+PiAgICAg ZXhoYXVzdGluZyB0aGVpciBybGltaXRzLiBUaGUgcmVhc29uIGZvciB0aGlzIGxpZXMgd2l0aCB0 aGUgY3VycmVudAo+PiAgICAgaW1wbGVtZW50YXRpb24gb2YgcmxpbWl0cyAtLSBybGltaXRzIGFy ZSBjdXJyZW50bHkgdGllZCB0byB1c2VycyBhbmQgYXJlCj4+ICAgICBub3QgaGllcmFyY2hpY2Fs bHkgbGltaXRlZCBsaWtlIGlub3RpZnkgbGltaXRzIGFyZS4gVGhpcyBpcyBhIHNldmVyZQo+PiAg ICAgcHJvYmxlbSBpbiB1bnByaXZpbGVnZWQgd29ya2xvYWRzLiBFcmljIGFuZCBvdGhlcnMgaWRl bnRpZmllZCB0aGF0IHRoaXMKPj4gICAgIGlzc3VlIGNhbiBiZSBmaXhlZCBpbmRlcGVuZGVudGx5 IG9mIHRoZSBpc29sYXRlZCB1c2VyIG5hbWVzcGFjZSBwcm9wb3NhbC4KPgo+IElzIHRoaXMgcmVh bGx5IGFuIHByYWN0aWNhbCBpc3NzdWUsIHdoZW4gd2UncmUgdXNpbmcgdWlkIG5hbWVzcGFjZXMg PwoKVmVyeSBtdWNoIHNvLiAgVGhlcmUgYXJlIGNvbnRhaW5lcnMgd2hvIG90aGVyd2lzZSB3b3Vs ZCB1c2UgdGhlIHNhbWUgdWlkCnJhbmdlLiAoQUtBIHRoZXkgaGF2ZSB0aGUgc2FtZSBzZXQgb2Yg dXNlcnMpLiAgQnV0IGNhbid0IGJlY2F1c2UgdGhlcmUKYXJlIGNhc2VzIGxpa2UgZGFlbW9ucyB0 aGF0IHNldCB0aGVpciBSTElNSVRfTlBST0MgdG8gMS4gIEJlY2F1c2UgdGhlCmRhZW1vbiBrbm93 cyB0aGF0IHVzZXIgZm9yIHRoYXQgZGFlbW9uIHdpbGwgbmV2ZXIgcnVuIGFueSBvdGhlcgpwcm9j ZXNzZXMuCgpSdW4gdHdvIGNvbnRhaW5lcnMgd2l0aCB0aGUgc2FtZSBtYXBwaW5ncyBhbmQgdGhh dCBkYWVtb24gRE9TJ3MgaXRzZWxmLgoKPj4gUzIuIEtlcm5lbC1lbmZvcmNlZCB1c2VyIG5hbWVz cGFjZSBpc29sYXRpb24uCj4+ICAgICBUaGlzIG1lYW5zLCB0aGVyZSBpcyBubyBuZWVkIGZvciBk aWZmZXJlbnQgY29udGFpbmVyIHJ1bnRpbWVzIHRvCj4+ICAgICBjb2xsYWJvcmF0ZSBvbiBpZCBy YW5nZXMgd2l0aCBpbW1lZGlhdGUgYmVuZWZpdHMgZm9yIGV2ZXJ5b25lLgo+PiAgICAgVGhpcyBz b2x2ZXMgUDEgYW5kIFAyLgo+Cj4gT2theSwgYnV0IGhvdyB0byBzdXBwb3J0IHNjZW5hcmlvcyB3 aGVyZSBzb21lIG9mIHRoZSBVSURzIHNob3VsZAo+IG92ZXJsYXAgb24gcHVycG9zZSA/IChlZy4g bW91bnRpbmcgc29tZSBvZiB0aGUgaG9zdCdzIHVzZXIgaG9tZWRpcnMKPiBpbnRvIG5hbWVzcGFj ZXMgPykKCkp1c3QgaGF2ZSBhIGxpbWl0ZWQgbnVtYmVyIG9mIG1hcHBpbmdzIGZvciB0aGUgY2Fz ZXMgdGhhdCBhY3R1YWxseSBuZWVkCm9uLWRpc2sgc3RvcmFnZS4gIFRoZSBrZXkgaWRlYSBpcyBh ZGRpbmcgdWlkcyB0aGF0IGRvbid0IG5lZWQgdG8gYmUKbWFwcGVkLiAgRXZlcnl0aGluZyBlbHNl IHN0YXlzIHRoZSBzYW1lLgoKPj4gUzUuIFRoZSBvd25pbmcgaWQgY29uY2VwdCBvZiBhIHVzZXIg bmFtZXNwYWNlIG1ha2VzIG1vbml0b3JpbmcgYW5kIGludGVyYWN0aW5nCj4+ICAgICB3aXRoIHN1 Y2ggY29udGFpbmVycyB3YXkgZWFzaWVyLgo+Cj4gV2hhdCBleGFjdGx5IGlzIHRoZSBvd25pbmcg aWQgPyBIb3cgaXMgaXQgY3JlYXRlZCBhbmQgbWFuYWdlZCA/Cj4gU29tZSBtYWdpYyBpZCBvciBh biBjcnlwdG9ncmFwaGljIHRva2VuID0KCk5vdCBhIG5ldyB0aGluZy4gIEp1c3QgdGhlIHVzZXIg dGhhdCBjcmVhdGVkIHRoZSB1c2VyIG5hbWVzcGFjZS4KSXQgaXMgc3VnZ2VzdGVkIHRvIHJlZmlu ZSB0aGUgaWRlYSBzbyB0aGF0IHVzZXJzIHRoYXQgZG9uJ3QgbWFwCmFueXdoZXJlIHNob3cgdXAg YXMgdGhlIGNyZWF0b3Igb2YgdGhlIHVzZXIgbmFtZXNwYWNlLgoKPj4gMS4gSG93IGFyZSBpbnRl cmFjdGlvbnMgYWNyb3NzIGlzb2xhdGVkIHVzZXIgbmFtZXNwYWNlcyBoYW5kbGVkPwo+Cj4gV2hh dCBraW5kIG9mIGludGVyYWN0aW9uIGRvIHlvdSBoYXZlIGluIG1pbmQgPwo+IERhdGEgdHJhbnNm ZXJzID8gUHJvY2VzcyBtYW5pcHVsYXRvbiA/IE5hbWVzcGFjZSBkZXN0cnVjdGlvbiA/Cj4KPiBD YW4geW91IHBsZWFzZSBpbGx1c3RyYXRlIHNvbWUgYWN0dWFsIHVzZSBjYXNlcyA/Cj4KPj4gICAg UHJvcG9zYWwgMS4xIHNlbW1lZCBwcmVmZXJlZCBzaW5jZSBpdCB3b3VsZCBhbGxvdyBhbiB1bnBy aXZpbGVnZWQKPj4gICAgdXNlciBjcmVhdGluZyBhbiBpc29sYXRlZCB1c2VyIG5hbWVzcGFjZSB0 byBraWxsL3B0cmFjZSBhbGwgcHJvY2Vzc2VzCj4+ICAgIGluIHRoZSBpc29sYXRlZCBuYW1lc3Bh Y2UgdGhleSBzcGF3bmVkLiAKPgo+IERvbid0IHdlIGFscmVhZHkgaGF2ZSB0aGlzIGlmIHRoaXMg dXNlciBpcyBtYXBwZWQgYXMgcm9vdCBpbnNpZGUgdGhlCj4gY29udGFpbmVyID8KCkkgdGhpbmsg dGhlcmUgd2VyZSBtb3JlIGNvbmNlcm5zIHJhaXNlZCB0aGF0IEkgdGhpbmsgYWN0dWFsbHkgZXhp c3QuClRoZSBvd25lci9jcmVhdG9yIG9mIGEgdXNlciBuYW1lc3BhY2UgY2FuIGFscmVhZHkgbWFu YWdlIGFuIGNvbnRhaW5lcgphbmQgc2VuZCBpdCBzaWduYWxzLiAgVGhhdCBpcyBidWlsdCBpbnRv IHRoZSBjYXBhYmlsaXR5IHN5c3RlbSBjYWxsLgpOb3RoaW5nIG5lZWRzIHRvIGNoYW5nZSB0aGVy ZS4KClRoZSBvbmx5IHJlYWwgcXVlc3Rpb24gSSBzZWUgaXMgd2hpY2ggdWlkcyBhbmQgZ2lkcyBk byB3ZSBzaG93IHRvCnByb2Nlc3NlcyB0aGF0IGFyZSBvdXRzaWRlIG9mIHRoZSB1c2VyIG5hbWVz cGFjZSwgd2hlbiB0aGUgdWlkcyBhbmQgZ2lkcwpkb24ndCBtYXAuCgo+PiAgICBUaGUgZmlyc3Qg Y29uc2Vuc3VzIHJlYWNoZWQgc2VlbWVkIHRvIGJlIHRvIGRlY291cGxlIGlzb2xhdGVkIHVzZXIK Pj4gICAgbmFtZXNwYWNlcyBmcm9tIHNoaWZ0ZnMuIFRoZSBpZGVhIGlzIHRvIHNvbGVseSByZWx5 IG9uIHRtcGZzIGFuZCBmdXNlCj4+ICAgIGF0IHRoZSBiZWdpbm5pbmcgYXMgZmlsZXN5c3RlbXMg d2hpY2ggY2FuIGJlIG1vdW50ZWQgaW5zaWRlIGlzb2xhdGVkCj4+ICAgIHVzZXIgbmFtZXNwYWNl cyBhbmQgc28gd291bGQgaGF2ZSBwcm9wZXIgb3duZXJzaGlwLiAKPgo+IFNvLCBJJ2QgZXNzZW50 aWFsbHkgaGF2ZSB0byBydW4gdGhlIHdob2xlIHJvb3RmcyB0aHJvdWdoIGZ1c2UgYW5kIGEKPiB1 c2VybGFuZCBmaWxlc2VydmVyLCB3aGljaCBwcm9iYWJseSBoYXMgdG8gdHJhY2sgdGhpbmdzIGxp a2Ugb3duZXJzaGlwcwo+IGluIGl0cyBvd24gZGIgKHdoZW4gcnVubmluZyB1bmRlciB1bnByaXZp bGVnZWQgdXNlcikgPwoKVGhlIGNvbnNlbnN1cyB3YXMgdG8gc3RhcnQgd2l0aCB3aGF0IGlzIHdv cmtpbmcgbm93LgoKVXNlcnMgdGhhdCBkb24ndCBtYXAgb3V0c2lkZSBvZiB0aGUgdXNlciBuYW1l c3BhY2Ugd2lsbCBzaG93IHVwIGFuZCB3b3JrCnByb3Blcmx5IGluIG9uIHRtcGZzLiAgT3IgYSBm dXNlIGltcGxlbWVudGF0aW9uIG9mIGV4dDQgb24gdG9wIG9mIGEKZmlsZS4KCj4+IEZvciBtb3Vu dCBwb2ludHMKPj4gICAgdGhhdCBvcmlnaW5hdGUgZnJvbSBvdXRzaWRlIHRoZSBuYW1lc3BhY2Us IGV2ZXJ5dGhpbmcgd2lsbCBzaG93IGFzCj4+ICAgIHRoZSBvdmVyZmxvdyBpZHMgYW5kIGFjY2Vz cyB3b3VsZCBiZSByZXN0cmljdGVkIHRvIHRoZSBtb3N0Cj4+ICAgIHJlc3RyaWN0ZWQgcGVybWlz c2lvbiBiaXQgZm9yIGFueSBwYXRoIHRoYXQgY2FuIGJlIGFjY2Vzc2VkLgo+Cj4gU28sIEkgY2Fu J3QganVzdCB0YWtlIGEgYnRyZnMgc25hcHNob3QgYXMgcm9vdGZzIGFueW1vcmUgPwoKSW50ZXJl c3RpbmcgdW50aWwgcmVhZGluZyB0aHJvdWdoIHlvdXIgY29tbWVudGFyeSBJIGhhZCBtaXNzZWQg dGhlCnByb3Bvc2FsIHRvIGVmZmVjdGl2ZWx5IGVmZmVjdGl2ZWx5IGNoYW5nZSB0aGUgcGVybWlz c2lvbnMgdG86CigobW9kZSA+PiAzKSAmIChtb2RlID4+IDYpICYgbW9kZSAmIDcpLgoKVGhlIGNo YWxsZW5nZSBpcyB0aGF0IGluIGEgcGVybWlzc2lvbiB0cmlwbGUgaXQgaXMgcG9zc2libGUgdG8g c2V0Cmxvd2VyIHBlcm1pc3Npb25zIGZvciB0aGUgb3duZXIgb2YgdGhlIGZpbGUsIG9yIGZvciBh IHNwZWNpZmljIGdyb3VwLAp0aGFuIGZvciBldmVyeW9uZSBlbHNlLgoKVG9kYXkgd2UgcmVxdWly ZSByb290IHBlcm1pc3Npb25zIHRvIGJlIGFibGUgdG8gbWFwIHVzZXJzIGFuZCBncm91cHMgaW4K L3Byb2MvPHBpZD4vdWlkX21hcCBhbmQgL3Byb2MvPHBpZD4vZ2lkX21hcCwgYW5kIHdlIHJlcXVp cmUgcm9vdApwZXJtaXNzaW9ucyB0byBiZSBhYmxlIHRvIGRyb3AgZ3JvdXBzIHdpdGggc2V0Z3Jv dXBzLgoKTm93IHdlIGFyZSBkaXNjdXNzaW9uZyBtb3ZpbmcgdG8gYSB3b3JsZCB3aGVyZSB3ZSBj YW4gdXNlIHVzZXJzIGFuZApncm91cHMgdGhhdCBkb24ndCBtYXAgdG8gYW55IG90aGVyIHVzZXIg bmFtZXNwYWNlIGluIHVpZF9tYXAgYW5kCmdpZF9tYXAuICBJdCBzaG91bGQgYmUgY29tcGxldGVs eSBzYWZlIHRvIHVzZSB0aG9zZSB1c2VycyBhbmQgZ3JvdXBzCmV4Y2VwdCBmb3IgbmVnYXRpdmUg cGVybWlzc2lvbnMgaW4gZmlsZXN5c3RlbXMuICBTbyBhIGJpZyBxdWVzdGlvbiBpcwpob3cgZG8g d2UgYXJyYW5nZSB0aGUgc3lzdGVtIHNvIGFueW9uZSBjYW4gdXNlIHRob3NlIGZpbGVzIHdpdGhv dXQKbmVnYXRpdmUgcGVybWlzc2lvbiBjYXVzaW5nIHByb2JsZW1zLgoKCkkgYmVsaWV2ZSBpdCBp cyBzYWZlIHRvIG5vdCBsaW1pdCB0aGUgb3duZXIgb2YgYSBmaWxlLCBhcyB0aGUKb3duZXIgb2Yg YSBmaWxlIGNhbiBhbHdheXMgY2htb2RlIHRoZSBmaWxlIGFuZCByZW1vdmUgYW55IHJlc3RyaWN0 aW9ucy4KV2hpY2ggaXMgbm8gd29yc2UgdGhhbiBjYWxsaW5nIHNldHVpZCB0byBhIGRpZmZlcmVu dCB1aWQuCgpXaGljaCBsZWF2ZXMgd2hlcmUgd2UgaGF2ZSBiZWVuIGRlYWxpbmcgd2l0aCB0aGUg YWJpbGl0eSB0byBkcm9wIGdyb3Vwcwp3aXRoIHNldGdyb3Vwcy4KCkkgZ3Vlc3MgdGhlIHByYWN0 aWNhbCBwcm9wb3NhbCBpcyB3aGVuIHRoZSAhaW5fZ3JvdXBfcCBhbmQgd2UgYXJlCmxvb2tpbmcg YXQgdGhlIG90aGVyIHBlcm1pc3Npb24uICBUcmVhdCB0aGUgcGVybWlzc2lvbnMgYXM6CigobW9k ZSA+PiAzKSAmIG1vZGUgJiA3KS4gIEluc3RlYWQgb2YganVzdCAobW9kZSAmIDcpLgoKV2hpY2gg Zm9yIHN5c3RlbXMgd2hvIGRvbid0IHVzZSBuZWdhdGl2ZSBncm91cCBwZXJtaXNzaW9ucyBpcyBh IG5vLW9wLgpTbyB0aGlzIHNob3VsZCBub3QgZWZmZWN0IHlvdXIgYnRyZnMgc25hcHNob3RzIGF0 IGFsbCAodW5sZXNzIHlvdSB1c2UKbmVnYXRpdmUgZ3JvdXAgcGVybWlzc2lvbnMpLgoKSXQgZGVu aWVzIHRoaW5ncyBiZWZvcmUgd2UgZ2V0IHRvIGFuIE5GUyBzZXJ2ZXIgb3Igb3RoZXIgaW50ZXJl c3RpbmcKY2FzZSBzbyBpdCBzaG91bGQgd29yayBmb3IgcHJldHR5IG11Y2ggZXZlcnl0aGluZyB0 aGUga2VybmVsIGRlYWxzIHdpdGguCgpVc2Vyc3BhY2UgcmVwZWF0aW5nIHBlcm1pc3Npb24gY2hl Y2tzIGNvdWxkIGJyZWFrLiAgQnV0IHRoYXQgaXMganVzdCBhCnByb2JsZW0gb2YgaW5jb25zaXN0 ZW5jeSwgYW5kIHdpbGwgYWx3YXlzIGJlIGEgcHJvYmxlbS4KCldlIGNvdWxkIG1ha2UgaXQgbW9y ZSBwcmVjaXNlIGFzIFNlcmdlIHdhcyBzdWdnZXN0aW5nIHdpdGggYSBzZXQgb2YgdGhhdAp3ZXJl IGRyb3BwZWQgZnJvbSBzZXRncm91cHMsIGJ1dCB1bmRlciB0aGUgYXNzdW1wdGlvbiB0aGF0IG5l Z2F0aXZlCmdyb3VwcyBhcmUgc3VmZmljaWVudCByYXJlIHdlIGNhbiBhdm9pZCB0aGF0IG92ZXJo ZWFkLgoKIHN0YXRpYyBpbnQgYWNsX3Blcm1pc3Npb25fY2hlY2soc3RydWN0IGlub2RlICppbm9k ZSwgaW50IG1hc2spCiB7CiAJdW5zaWduZWQgaW50IG1vZGUgPSBpbm9kZS0+aV9tb2RlOwogCi0g W2lycmVsZXZhbnQgYml0cyBvZiB0aGlzIGZ1bmN0aW9uXSAgICAgICAgCiAKIAkvKiBPbmx5IFJX WCBtYXR0ZXJzIGZvciBncm91cC9vdGhlciBtb2RlIGJpdHMgKi8KIAltYXNrICY9IDc7CiAKIAkv KgogCSAqIEFyZSB0aGUgZ3JvdXAgcGVybWlzc2lvbnMgZGlmZmVyZW50IGZyb20KIAkgKiB0aGUg b3RoZXIgcGVybWlzc2lvbnMgaW4gdGhlIGJpdHMgd2UgY2FyZQogCSAqIGFib3V0PyBOZWVkIHRv IGNoZWNrIGdyb3VwIG93bmVyc2hpcCBpZiBzby4KIAkgKi8KIAlpZiAobWFzayAmIChtb2RlIF4g KG1vZGUgPj4gMykpKSB7CiAJCWlmIChpbl9ncm91cF9wKGlub2RlLT5pX2dpZCkpCiAJCQltb2Rl ID4+PSAzOworCQkvKiBVc2UgdGhlIG1vc3QgcmVzdHJpY3RpdmUgcGVybWlzc2lvbnM/ICovCisJ CWVsc2UgKGN1cnJlbnQtPnVzZXJfbnMtPmZsYWdzICYgVVNFUk5TX0FMV0FZU19ERU5ZX0dST1VQ UykKKwkJCW1vZGUgJj0gKG1vZGUgPj4gMyk7CiAJfQogCiAJLyogQml0cyBpbiAnbW9kZScgY2xl YXIgdGhhdCB3ZSByZXF1aXJlPyAqLwogCXJldHVybiAobWFzayAmIH5tb2RlKSA/IC1FQUNDRVMg OiAwOwogfQoKQXMgSSByZWFkIHBvc2l4X2FjbF9wZXJtaXNzaW9uIGFsbCBvZiB0aGUgcG9zaXgg YWNscyBmb3IgZ3JvdXBzIGFyZQpwb3NpdGl2ZSBwZXJtaXNzaW9ucy4gIFNvIEkgdGhpbmsgdGhl IG9ubHkgb3RoZXIgY29kZSB0aGF0IHdvdWxkIG5lZWQgdG8KYmUgdXBkYXRlZCB3b3VsZCBiZSB0 aGUgZmlsZXN5c3RlbXMgdGhhdCByZXBsYWNlIGdlbmVyaWNfcGVybWlzc2lvbiB3aXRoCnNvbWV0 aGluZyB0aGF0IGRvZXNuJ3QgY2FsbCBhY2xfcGVybWlzc2lvbiBjaGVjay4KClVzZXJzcGFjZSBj b3VsZCB0aGVuIGFjdGl2YXRlIHRoaXMgbW9kZSB3aXRoOgoJZWNobyAic2FmZWx5X2FsbG93IiA+ IC9wcm9jLzxwaWQ+L3NldGdyb3VwcwoKVGhhdCBsb29rcyB2ZXJ5IGVsZWdhbnQgYW5kIHNpbXBs ZSwgYW5kIEkgZG9uJ3QgdGhpbmsgd2lsbCBjYXVzZQpwcm9ibGVtcyBmb3IgYW55b25lLiAgSXQg bWlnaHQgZXZlbiBtYWtlIHNlbnNlIHRvIG1ha2UgdGhhdCB0aGUgZGVmYXVsdAptb2RlIHdoZW4g Y3JlYXRpbmcgYSBuZXcgdXNlciBuYW1lc3BhY2UuCgpJIGd1ZXNzIHdlIG93ZSB0aGlzIGlkZWEg dG8gSm9zaCBUcmlwbGV0dCBhbmQgR2VvZmZyZXkgVGhvbWFzLgoKRG9lcyBhbnlvbmUgc2VlIGFu eSBwcm9ibGVtcyB3aXRoIHR3ZWFraW5nIHRoZSBwZXJtaXNzaW9ucyB0aGlzIHdheSBzbwp0aGF0 IHdlIGNhbiBhbHdheXMgYWxsb3cgc2V0Z3JvdXBzIGluIGEgdXNlciBuYW1lc3BhY2U/CgpFcmlj CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpDb250YWlu ZXJzIG1haWxpbmcgbGlzdApDb250YWluZXJzQGxpc3RzLmxpbnV4LWZvdW5kYXRpb24ub3JnCmh0 dHBzOi8vbGlzdHMubGludXhmb3VuZGF0aW9uLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2NvbnRhaW5l cnM= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75AB5C433DF for ; Sat, 17 Oct 2020 16:52:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 27F422068D for ; Sat, 17 Oct 2020 16:52:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2438675AbgJQQv1 convert rfc822-to-8bit (ORCPT ); Sat, 17 Oct 2020 12:51:27 -0400 Received: from out03.mta.xmission.com ([166.70.13.233]:56890 "EHLO out03.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2438633AbgJQQv0 (ORCPT ); Sat, 17 Oct 2020 12:51:26 -0400 Received: from in02.mta.xmission.com ([166.70.13.52]) by out03.mta.xmission.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1kTpQB-003rsF-Oa; Sat, 17 Oct 2020 10:51:07 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95] helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.87) (envelope-from ) id 1kTpQA-0004b7-CI; Sat, 17 Oct 2020 10:51:07 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: "Enrico Weigelt\, metux IT consult" Cc: Christian Brauner , containers@lists.linux-foundation.org, Alexander Mihalicyn , Giuseppe Scrivano , Joseph Christopher Sible , Kees Cook , linux-kernel@vger.kernel.org, Josh Triplett , Andy Lutomirski , =?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?= , Wat Lim , Mrunal Patel , Pavel Tikhomirov , Geoffrey Thomas , "Serge E. Hallyn" References: <20200830143959.rhosiunyz5yqbr35@wittgenstein> Date: Sat, 17 Oct 2020 11:51:22 -0500 In-Reply-To: (Enrico Weigelt's message of "Thu, 15 Oct 2020 17:31:45 +0200") Message-ID: <874kmsdcdx.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-XM-SPF: eid=1kTpQA-0004b7-CI;;;mid=<874kmsdcdx.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+Tp7fo3PG+a6LIxX1FiHtlWjF8gEmudwk= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "Enrico Weigelt, metux IT consult" writes: > On 30.08.20 16:39, Christian Brauner wrote: > > Hi Christian, > >> P1. Isolated id mappings can only be guaranteed to be locally isolated. >> A container runtime/daemon can only guarantee non-overlapping id mappings >> when no other users on the system create containers. > > Indeed. But couldn't we just record the mappings in some standardized > place (eg. some file) which all engines maintain ? > > I'd guess other solutions would need changes in the runtimes, too. > > Please keep in mind that some scenarios actually need some overlaps, eg. > application containers that shall have direct access to home dirs. > >> P2. Enforcing isolated id mappings in userspace is difficult. >> It is always possible to create other processes with overlapping id >> mappings. Coordinating id mappings in userspace will always remain >> optional. Quite a few tools nowadays (including systemd) don't care about >> /etc/sub{g,u}id and actively advise against using it. This is made even >> more problematic since sub{g,u}iid delegation is done per-user rather than >> per-container-runtime. > > I believe subusers aren't meant for tyical containers (like docker or > lxc), but unprivileged user programs that wanna have further isolation > for subprocesses (eg. a browser's renderer or js engine). > > Correct me if I'm wrong. There is an on-going trend to make unprivileged containers typical containers. >> P3. The range of the id mapping of a container can't be predetermined. >> While POSIX mandates that a standard system should use a range of 65536 ids >> reality is very different. Some programs allocate high ids for random >> processes or for network authentication. This means, in practice it is >> often necessary to assign a range of up to 10 million ids to a container. >> This limits a system to less than 500 containers total. > > In 25+ years, haven't seen such an application in the field. I'd > consider this a horrible and dangerous bug. Sane applications create > specific user entries (/etc/passwd) for that. > > I'd say we're safe w/ max 2^16 users per container, which should give us > space for about 2^16 containers. I forget the details but systemd has a feature where it will randomly allocate a uid for a service. Calling them something like temporariy uids. >> P4. Isolated id mappings severely restrict the number of containers that can be >> run on a system. >> This ties back to the point about pre-determining the id range of a >> container and how large range allocations tend to be on real systems. That >> becomes even more relevant when nesting containers. > > IMHO, all we need is to maintain a list of active ranges (more precisely > the 16bit prefixes, just like class B networks ;-)). As said, I'd > declare the scenario #P3 as invalid and rather fix those few broken > applications. Which is /etc/subuid and /etc/subgid, and it was very much inspired from the same source. >> P5. Container runtimes cannot reuse overlayfs lower directories if each >> container uses isolated ID mappings, leading to either needless storage >> overhead (LXD -- though the LXD folks don’t really mind), completely >> ignoring the benefits of isolating containers from each other (Docker), or >> not using them at all (Kubernetes). (This is a more general issue but bears >> repeating since it is closely tied to most userns proposals.) > > Indeed. That's IMHO the main problem. We somehow need to map the UIDs. > Maybe a synthetic filesystem that just does exactly the same uid<->kuid > translations we're already doing in other places ? > >> P6. Rlimits pose a problem for containers that share the same id mapping. >> This means containers with overlapping id mappings can DOS each other by >> exhausting their rlimits. The reason for this lies with the current >> implementation of rlimits -- rlimits are currently tied to users and are >> not hierarchically limited like inotify limits are. This is a severe >> problem in unprivileged workloads. Eric and others identified that this >> issue can be fixed independently of the isolated user namespace proposal. > > Is this really an practical isssue, when we're using uid namespaces ? Very much so. There are containers who otherwise would use the same uid range. (AKA they have the same set of users). But can't because there are cases like daemons that set their RLIMIT_NPROC to 1. Because the daemon knows that user for that daemon will never run any other processes. Run two containers with the same mappings and that daemon DOS's itself. >> S2. Kernel-enforced user namespace isolation. >> This means, there is no need for different container runtimes to >> collaborate on id ranges with immediate benefits for everyone. >> This solves P1 and P2. > > Okay, but how to support scenarios where some of the UIDs should > overlap on purpose ? (eg. mounting some of the host's user homedirs > into namespaces ?) Just have a limited number of mappings for the cases that actually need on-disk storage. The key idea is adding uids that don't need to be mapped. Everything else stays the same. >> S5. The owning id concept of a user namespace makes monitoring and interacting >> with such containers way easier. > > What exactly is the owning id ? How is it created and managed ? > Some magic id or an cryptographic token = Not a new thing. Just the user that created the user namespace. It is suggested to refine the idea so that users that don't map anywhere show up as the creator of the user namespace. >> 1. How are interactions across isolated user namespaces handled? > > What kind of interaction do you have in mind ? > Data transfers ? Process manipulaton ? Namespace destruction ? > > Can you please illustrate some actual use cases ? > >> Proposal 1.1 semmed prefered since it would allow an unprivileged >> user creating an isolated user namespace to kill/ptrace all processes >> in the isolated namespace they spawned. > > Don't we already have this if this user is mapped as root inside the > container ? I think there were more concerns raised that I think actually exist. The owner/creator of a user namespace can already manage an container and send it signals. That is built into the capability system call. Nothing needs to change there. The only real question I see is which uids and gids do we show to processes that are outside of the user namespace, when the uids and gids don't map. >> The first consensus reached seemed to be to decouple isolated user >> namespaces from shiftfs. The idea is to solely rely on tmpfs and fuse >> at the beginning as filesystems which can be mounted inside isolated >> user namespaces and so would have proper ownership. > > So, I'd essentially have to run the whole rootfs through fuse and a > userland fileserver, which probably has to track things like ownerships > in its own db (when running under unprivileged user) ? The consensus was to start with what is working now. Users that don't map outside of the user namespace will show up and work properly in on tmpfs. Or a fuse implementation of ext4 on top of a file. >> For mount points >> that originate from outside the namespace, everything will show as >> the overflow ids and access would be restricted to the most >> restricted permission bit for any path that can be accessed. > > So, I can't just take a btrfs snapshot as rootfs anymore ? Interesting until reading through your commentary I had missed the proposal to effectively effectively change the permissions to: ((mode >> 3) & (mode >> 6) & mode & 7). The challenge is that in a permission triple it is possible to set lower permissions for the owner of the file, or for a specific group, than for everyone else. Today we require root permissions to be able to map users and groups in /proc//uid_map and /proc//gid_map, and we require root permissions to be able to drop groups with setgroups. Now we are discussiong moving to a world where we can use users and groups that don't map to any other user namespace in uid_map and gid_map. It should be completely safe to use those users and groups except for negative permissions in filesystems. So a big question is how do we arrange the system so anyone can use those files without negative permission causing problems. I believe it is safe to not limit the owner of a file, as the owner of a file can always chmode the file and remove any restrictions. Which is no worse than calling setuid to a different uid. Which leaves where we have been dealing with the ability to drop groups with setgroups. I guess the practical proposal is when the !in_group_p and we are looking at the other permission. Treat the permissions as: ((mode >> 3) & mode & 7). Instead of just (mode & 7). Which for systems who don't use negative group permissions is a no-op. So this should not effect your btrfs snapshots at all (unless you use negative group permissions). It denies things before we get to an NFS server or other interesting case so it should work for pretty much everything the kernel deals with. Userspace repeating permission checks could break. But that is just a problem of inconsistency, and will always be a problem. We could make it more precise as Serge was suggesting with a set of that were dropped from setgroups, but under the assumption that negative groups are sufficient rare we can avoid that overhead. static int acl_permission_check(struct inode *inode, int mask) { unsigned int mode = inode->i_mode; - [irrelevant bits of this function] /* Only RWX matters for group/other mode bits */ mask &= 7; /* * Are the group permissions different from * the other permissions in the bits we care * about? Need to check group ownership if so. */ if (mask & (mode ^ (mode >> 3))) { if (in_group_p(inode->i_gid)) mode >>= 3; + /* Use the most restrictive permissions? */ + else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS) + mode &= (mode >> 3); } /* Bits in 'mode' clear that we require? */ return (mask & ~mode) ? -EACCES : 0; } As I read posix_acl_permission all of the posix acls for groups are positive permissions. So I think the only other code that would need to be updated would be the filesystems that replace generic_permission with something that doesn't call acl_permission check. Userspace could then activate this mode with: echo "safely_allow" > /proc//setgroups That looks very elegant and simple, and I don't think will cause problems for anyone. It might even make sense to make that the default mode when creating a new user namespace. I guess we owe this idea to Josh Triplett and Geoffrey Thomas. Does anyone see any problems with tweaking the permissions this way so that we can always allow setgroups in a user namespace? Eric