From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752976AbbKQBka (ORCPT <rfc822;w@1wt.eu>);
	Mon, 16 Nov 2015 20:40:30 -0500
Received: from h2.hallyn.com ([78.46.35.8]:53000 "EHLO h2.hallyn.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751421AbbKQBk2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 16 Nov 2015 20:40:28 -0500
Date: Mon, 16 Nov 2015 19:40:26 -0600
From: "Serge E. Hallyn" <serge@hallyn.com>
To: "Serge E. Hallyn" <serge@hallyn.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
        Richard Weinberger <richard@nod.at>,
        Richard Weinberger <richard.weinberger@gmail.com>,
        LKML <linux-kernel@vger.kernel.org>,
        "open list:ABI/API" <linux-api@vger.kernel.org>,
        Linux Containers <containers@lists.linux-foundation.org>,
        LXC development mailing-list 
	<lxc-devel@lists.linuxcontainers.org>,
        Tejun Heo <tj@kernel.org>,
        cgroups mailinglist <cgroups@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>
Subject: Re: CGroup Namespaces (v4)
Message-ID: <20151117014026.GA2331@mail.hallyn.com>
References: <1447703505-29672-1-git-send-email-serge@hallyn.com>
 <CAFLxGvzVmbZHrpaTmXUAK03hsnVPwEs3SJGNFNXfthh3NL8EDg@mail.gmail.com>
 <20151116204606.GA30681@mail.hallyn.com>
 <564A41AF.4040208@nod.at>
 <20151116205452.GA30975@mail.hallyn.com>
 <87y4dxh9b8.fsf@x220.int.ebiederm.org>
 <20151117011349.GA1958@mail.hallyn.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151117011349.GA1958@mail.hallyn.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Nov 16, 2015 at 07:13:49PM -0600, Serge E. Hallyn wrote:
> On Mon, Nov 16, 2015 at 04:24:27PM -0600, Eric W. Biederman wrote:
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> > 
> > > On Mon, Nov 16, 2015 at 09:50:55PM +0100, Richard Weinberger wrote:
> > >> Am 16.11.2015 um 21:46 schrieb Serge E. Hallyn:
> > >> > On Mon, Nov 16, 2015 at 09:41:15PM +0100, Richard Weinberger wrote:
> > >> >> Serge,
> > >> >>
> > >> >> On Mon, Nov 16, 2015 at 8:51 PM,  <serge@hallyn.com> wrote:
> > >> >>> To summarize the semantics:
> > >> >>>
> > >> >>> 1. CLONE_NEWCGROUP re-uses 0x02000000, which was previously CLONE_STOPPED
> > >> >>>
> > >> >>> 2. unsharing a cgroup namespace makes all your current cgroups your new
> > >> >>> cgroup root.
> > >> >>>
> > >> >>> 3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
> > >> >>> cgroup namespce root.  A task outside of  your cgroup looks like
> > >> >>>
> > >> >>>         8:memory:/../../..
> > >> >>>
> > >> >>> 4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
> > >> >>> on the mounting task's  cgroup namespace.
> > >> >>>
> > >> >>> 5. setns to a cgroup namespace switches your cgroup namespace but not
> > >> >>> your cgroups.
> > >> >>>
> > >> >>> With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
> > >> >>> github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
> > >> >>> proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.
> > >> >>>
> > >> >>> This is completely backward compatible and will be completely invisible
> > >> >>> to any existing cgroup users (except for those running inside a cgroup
> > >> >>> namespace and looking at /proc/pid/cgroup of tasks outside their
> > >> >>> namespace.)
> > >> >>>    cgroupns-root.
> > >> >>
> > >> >> IIRC one downside of this series was that only the new "sane" cgroup
> > >> >> layout was supported
> > >> >> and hence it was useless for everything which expected the default layout.
> > >> >> Hence, still no systemd for us. :)
> > >> >>
> > >> >> Is this now different?
> > >> > 
> > >> > Yes, all hierachies are no supported.
> > >> > 
> > >> 
> > >> Should read "now"? :-)
> > >> If so, *awesome*!
> > >
> > > D'oh!  Yes, now :-)
> > 
> > I am glad to see multiple hierarchy support, that is something people
> > can use today.
> > 
> > A couple of quick questions before I delve into a review.
> > 
> > Does this allow mixing of cgroupfs and cgroupfs2?  That is can I: "mount
> > -t cgroupfs" inside a container and "mount -t cgroupfs2" outside a
> > container? and still have reasonable things happen?  I suspect the
> > semantics of cgroups prevent this but I am interested to know what happens.
> 
> As Tejun said, this is not an issue.  There's not an actual separate cgroupfs2
> filesystem, it's just a separate hierarchy which controllers can be bound to
> or not, which has its own set of semantics (like no tasks on leafnodes).  So
> a legacy application would never be able to run on the unified hierarchy, but
> this does not change that.
> 
> > Similary have you considered what it required to be able to safely set
> > FS_USERNS_MOUNT?
> 
> I think the only thing we need to do is
> 
> 1. go through and make sure that any ability to change mount flags is under
> capable() (which I have not yet done).  The cgroup_mount() itself checks that
> flags are not changed, but there may be some subtle way to effect a change
> that I'm not aware of yet.
> 

At least the ability to change the clone_children and release agent through
remount need to be restricted to init_user_ns root.