linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CGroup Namespaces (v8)
@ 2015-12-23  4:23 serge.hallyn
  2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent.  The code can also be found in the cgroupns.v8
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x02000000, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root.  A task outside of  your cgroup looks like

	8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's  cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V7:
1. Rework kernfs_path_from_node_locked to return the string length
2. Rename and reorder args to kernfs_path_from_node
3. cgroup.c: undo accidental conversoins to inline
4. cgroup.h: move ns declarations to bottom.
5. Rework the documentation to fit the style of the rest of cgroup.txt

Changes from V6:
1. Switch to some WARN_ONs to provide stack traces
2. Rename kernfs_node_distance to kernfs_depth
3. Make sure kernfs_common_ancestor() nodes are from same root
4. Split kernfs changes for cgroup_mount into separate patch
5. Rename kernfs_obtain_root to kernfs_node_dentry
(And more, see patch changelogs)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
   kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks).  Specificaly, not addressed:
   . kernfs_obtain_root - walking dentry from kernfs root.
     (I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
   It now finds a common ancestor, walks from the source to it, then back
   up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes.  In particular switch to
   css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
   - removed use of threadgroup_lock() while creating new cgroupns
   - use task_lock() instead of rcu_read_lock() while accessing
     task->nsproxy
   - optimized setns() to own cgroupns
   - simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
   not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
   across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
   apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
   cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
   of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
   cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
   mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
   anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
   your cgroupns-root.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/8] kernfs: Add API to generate relative kernfs path
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-23 16:08   ` Tejun Heo
  2015-12-23 16:24   ` [PATCH 1/8] " Tejun Heo
  2015-12-23  4:23 ` [PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace serge.hallyn
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge E. Hallyn

From: Aditya Kali <adityakali@google.com>

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
    * Remove BUG_ON(NULL)s
    * Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
    * Remove useless checks for depth == 0
    * Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
    * Remove needless __must_check
    * Put p;len on its own decl line.
    * Fix wrong WARN_ONCE usage
Changelog 20151209:
  - kernfs_path_from_node: change arguments to 'to' and 'from', and
    change their order.
Changelog 20151222:
  - kernfs_path_from_node{,_locked}: return the string length.
    kernfs_path is gpl-exported, so changing their return value seemed
    ill-advised, but if noone minds I can update it too.
---
 fs/kernfs/dir.c        |  205 ++++++++++++++++++++++++++++++++++++++++--------
 include/linux/kernfs.h |    9 ++-
 2 files changed, 179 insertions(+), 35 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 742bf4a..e82b9a1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,123 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-	char *p = buf + buflen;
-	int len;
+	size_t depth = 0;
 
-	*--p = '\0';
+	while (to->parent && to != from) {
+		depth++;
+		to = to->parent;
+	}
+	return depth;
+}
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
-		}
-		p -= len;
-		memcpy(p, kn->name, len);
-		*--p = '/';
-		kn = kn->parent;
-	} while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+						  struct kernfs_node *b)
+{
+	size_t da, db;
+	struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
+
+	if (ra != rb)
+		return NULL;
+
+	da = kernfs_depth(ra->kn, a);
+	db = kernfs_depth(rb->kn, b);
+
+	while (da > db) {
+		a = a->parent;
+		da--;
+	}
+	while (db > da) {
+		b = b->parent;
+		db--;
+	}
+
+	/* worst case b and a will be the same at root */
+	while (b != a) {
+		b = b->parent;
+		a = a->parent;
+	}
+
+	return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ *
+ * return value: length of the string.  If greater than buflen,
+ * then contents of buf are undefined.  On error, -1 is returned.
+ */
+static int
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+			     struct kernfs_node *kn_from, char *buf,
+			     size_t buflen)
+{
+	struct kernfs_node *kn, *common;
+	const char parent_str[] = "/..";
+	size_t depth_from, depth_to, len = 0, nlen = 0;
+	char *p;
+	int i;
+
+	if (!kn_from)
+		kn_from = kernfs_root(kn_to)->kn;
+
+	if (kn_from == kn_to)
+		return strlcpy(buf, "/", buflen);
+
+	common = kernfs_common_ancestor(kn_from, kn_to);
+	if (WARN_ON(!common))
+		return -1;
+
+	depth_to = kernfs_depth(common, kn_to);
+	depth_from = kernfs_depth(common, kn_from);
+
+	if (buf)
+		buf[0] = '\0';
+
+	for (i = 0; i < depth_from; i++)
+		len += strlcpy(buf + len, parent_str,
+			       len < buflen ? buflen - len : 0);
+
+	/* Calculate how many bytes we need for the rest */
+	for (kn = kn_to; kn != common; kn = kn->parent)
+		nlen += strlen(kn->name) + 1;
 
-	return p;
+	if (len + nlen >= buflen)
+		return len + nlen;
+
+	p = buf + len + nlen;
+	*p = '\0';
+	for (kn = kn_to; kn != common; kn = kn->parent) {
+		nlen = strlen(kn->name);
+		p -= nlen;
+		memcpy(p, kn->name, nlen);
+		*(--p) = '/';
+	}
+
+	return len + nlen;
 }
 
 /**
@@ -115,6 +210,34 @@ size_t kernfs_path_len(struct kernfs_node *kn)
 }
 
 /**
+ * kernfs_path_from_node - build path of node @to relative to @from.
+ * @from: parent kernfs_node relative to which we need to build the path
+ * @to: kernfs_node of interest
+ * @buf: buffer to copy @to's path into
+ * @buflen: size of @buf
+ *
+ * Builds @to's path relative to @from in @buf. @from and @to must
+ * be on the same kernfs-root. If @from is not parent of @to, then a relative
+ * path (which includes '..'s) as needed to reach from @from to @to is
+ * returned.
+ * 
+ * If @buf isn't long enough, the return value will be greater than @buflen
+ * and @buf contents are undefined.
+ */
+int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
+			  char *buf, size_t buflen)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	ret = kernfs_path_from_node_locked(to, from, buf, buflen);
+	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
  * kernfs_path - build full path of a given node
  * @kn: kernfs_node of interest
  * @buf: buffer to copy @kn's name into
@@ -127,13 +250,12 @@ size_t kernfs_path_len(struct kernfs_node *kn)
  */
 char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
 {
-	unsigned long flags;
-	char *p;
+	int ret;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
-	return p;
+	ret = kernfs_path_from_node(kn, NULL, buf, buflen);
+	if (ret < 0 || ret >= buflen)
+		return NULL;
+	return buf;
 }
 EXPORT_SYMBOL_GPL(kernfs_path);
 
@@ -164,18 +286,39 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
 void pr_cont_kernfs_path(struct kernfs_node *kn)
 {
 	unsigned long flags;
-	char *p;
+	char *p = NULL;
+	int sz1, sz2;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
-	if (p)
-		pr_cont("%s", p);
-	else
-		pr_cont("<name too long>");
+	sz1 = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
+					   sizeof(kernfs_pr_cont_buf));
+	if (sz1 < 0) {
+		pr_cont("(error)");
+		goto out;
+	}
+
+	if (sz1 < sizeof(kernfs_pr_cont_buf)) {
+		pr_cont("%s", kernfs_pr_cont_buf);
+		goto out;
+	}
+
+	p = kmalloc(sz1 + 1, GFP_NOFS);
+	if (!p) {
+		pr_cont("(out of memory)");
+		goto out;
+	}
+	sz2 = kernfs_path_from_node_locked(kn, NULL, p, sz1 + 1);
+	if (sz2 > sz1 || sz2 < 0) {
+		pr_cont("(error)");
+		goto out;
+	}
+
+	pr_cont("%s", p);
 
+out:
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	kfree(p);
 }
 
 /**
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index af51df3..716bfde 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -267,8 +267,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
 size_t kernfs_path_len(struct kernfs_node *kn);
-char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
-				size_t buflen);
+int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
+			  char *buf, size_t buflen);
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
 void pr_cont_kernfs_path(struct kernfs_node *kn);
 struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
@@ -338,8 +339,8 @@ static inline int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 static inline size_t kernfs_path_len(struct kernfs_node *kn)
 { return 0; }
 
-static inline char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static inline char *kernfs_path(struct kernfs_node *kn, char *buf,
+				size_t buflen)
 { return NULL; }
 
 static inline void pr_cont_kernfs_name(struct kernfs_node *kn) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
  2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-23  4:23 ` [PATCH 3/8] cgroup: introduce cgroup namespaces serge.hallyn
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge Hallyn

From: Aditya Kali <adityakali@google.com>

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
 include/uapi/linux/sched.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
 #define CLONE_DETACHED		0x00400000	/* Unused, ignored */
 #define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
 #define CLONE_CHILD_SETTID	0x01000000	/* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
-   and is now available for re-use. */
+#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
 #define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
 #define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
 #define CLONE_NEWUSER		0x10000000	/* New user namespace */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/8] cgroup: introduce cgroup namespaces
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
  2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
  2015-12-23  4:23 ` [PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-23 16:15   ` Tejun Heo
  2015-12-23  4:23 ` [PATCH 4/8] cgroup: cgroup namespace setns support serge.hallyn
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge Hallyn

From: Aditya Kali <adityakali@google.com>

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
Changelog: 2015-11-24
	- move cgroup_namespace.c into cgroup.c (and .h)
	- reformatting
	- make get_cgroup_ns return void
	- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
	- Move init_cgroup_ns to other variable declarations
	- Remove accidental conversion of put-css_set to inline
	- Drop BUG_ON(NULL)
	- Remove unneeded pre declaration of struct cgroupns_operations.
	- cgroup.h: collect common ns declerations
Changelog: 2015-12-09
	- cgroup.h: move ns declarations to bottom
	- cgroup.c: undo all accidental conversions to inline
Changelog: 2015-12-22
	- update for new kernfs_path_from_node() return value.  Since
	  cgroup_path was already gpl-exported, I abstained from updating
	  its return value.
---
 fs/proc/namespaces.c    |    3 +
 include/linux/cgroup.h  |   54 ++++++++++++++++--
 include/linux/nsproxy.h |    2 +
 include/linux/proc_ns.h |    4 ++
 kernel/cgroup.c         |  144 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |    2 +-
 kernel/nsproxy.c        |   21 ++++++-
 7 files changed, 221 insertions(+), 9 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUPS
+	&cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d70b48..6d0992f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,6 +17,11 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/jump_label.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
+#include <linux/ns_common.h>
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -532,12 +537,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
-					      size_t buflen)
-{
-	return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
 	pr_cont_kernfs_name(cgrp->kn);
@@ -570,4 +569,47 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+struct cgroup_namespace {
+	atomic_t		count;
+	struct ns_common	ns;
+	struct user_namespace	*user_ns;
+	struct css_set          *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns);
+
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns)
+{
+	return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 42dfc61..de0e771 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -9,6 +9,8 @@
 struct pid_namespace;
 struct nsproxy;
 struct path;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -24,6 +26,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -34,6 +37,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6b33631..72336f5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,9 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
 
 #include <linux/atomic.h>
 
@@ -208,6 +211,15 @@ static unsigned long have_fork_callback __read_mostly;
 static unsigned long have_exit_callback __read_mostly;
 static unsigned long have_free_callback __read_mostly;
 
+/* Cgroup namespace for init task */
+struct cgroup_namespace init_cgroup_ns = {
+	.count		= { .counter = 1, },
+	.user_ns	= &init_user_ns,
+	.ns.ops		= &cgroupns_operations,
+	.ns.inum	= PROC_CGROUP_INIT_INO,
+	.root_cset	= &init_css_set,
+};
+
 /* Ditto for the can_fork callback. */
 static unsigned long have_canfork_callback __read_mostly;
 
@@ -2166,6 +2178,26 @@ static struct file_system_type cgroup2_fs_type = {
 	.kill_sb = cgroup_kill_sb,
 };
 
+static int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
+			  struct cgroup_namespace *ns)
+{
+	struct cgroup *root;
+
+	root = cset_cgroup_from_root(ns->root_cset, cgrp->root);
+	return kernfs_path_from_node(cgrp->kn, root->kn, buf, buflen);
+}
+
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
+{
+	int ret;
+	
+	ret = cgroup_path_ns(cgrp, buf, buflen, current->nsproxy->cgroup_ns);
+	if (ret < 0 || ret >= buflen)
+		return NULL;
+	return buf;
+}
+EXPORT_SYMBOL_GPL(cgroup_path);
+
 /**
  * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
  * @task: target task
@@ -5272,6 +5304,8 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	mutex_lock(&cgroup_mutex);
 
 	/* Add init_css_set to the hash table */
@@ -5822,6 +5856,116 @@ struct cgroup *cgroup_get_from_path(const char *path)
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_path);
 
+/* cgroup namespaces */
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+	int ret;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (!new_ns)
+		return ERR_PTR(-ENOMEM);
+	ret = ns_alloc_inum(&new_ns->ns);
+	if (ret) {
+		kfree(new_ns);
+		return ERR_PTR(ret);
+	}
+	atomic_set(&new_ns->count, 1);
+	new_ns->ns.ops = &cgroupns_operations;
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	put_css_set(ns->root_cset);
+	put_user_ns(ns->user_ns);
+	ns_free_inum(&ns->ns);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct css_set *cset = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP)) {
+		get_cgroup_ns(old_ns);
+		return old_ns;
+	}
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	cset = task_css_set(current);
+	get_css_set(cset);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cset = cset;
+
+	return new_ns;
+
+err_out:
+	if (cset)
+		put_css_set(cset);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static struct ns_common *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns ? &ns->ns : NULL;
+}
+
+static void cgroupns_put(struct ns_common *ns)
+{
+	put_cgroup_ns(to_cg_ns(ns));
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/kernel/fork.c b/kernel/fork.c
index ba7d1c0..7982fee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1880,7 +1880,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 49746c8..64fe865 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,9 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+#ifdef CONFIG_CGROUPS
+	.cgroup_ns		= &init_cgroup_ns,
+#endif
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +96,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +112,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +142,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +180,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +197,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/8] cgroup: cgroup namespace setns support
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (2 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 3/8] cgroup: introduce cgroup namespaces serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-23  4:23 ` [PATCH 5/8] kernfs: define kernfs_node_dentry serge.hallyn
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge E. Hallyn

From: Aditya Kali <adityakali@google.com>

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
 kernel/cgroup.c |   24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 72336f5..e85fbf9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5925,10 +5925,28 @@ err_out:
 	return ERR_PTR(err);
 }
 
-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
 {
-	pr_info("setns not supported for cgroup namespace");
-	return -EINVAL;
+	return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+	struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+	if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+	    !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* Don't need to do anything if we are attaching to our own cgroupns. */
+	if (cgroup_ns == nsproxy->cgroup_ns)
+		return 0;
+
+	get_cgroup_ns(cgroup_ns);
+	put_cgroup_ns(nsproxy->cgroup_ns);
+	nsproxy->cgroup_ns = cgroup_ns;
+
+	return 0;
 }
 
 static struct ns_common *cgroupns_get(struct task_struct *task)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/8] kernfs: define kernfs_node_dentry
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (3 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 4/8] cgroup: cgroup namespace setns support serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-23 16:25   ` Tejun Heo
  2015-12-23  4:23 ` [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns serge.hallyn
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge E. Hallyn

From: Aditya Kali <adityakali@google.com>

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
---
Changelog:
	20151116 - Don't allow user namespaces to bind new subsystems
	20151118 - postpone the FS_USERNS_MOUNT flag until the
	           last patch, until we can convince ourselves it
		   is safe.
	20151207 - Switch to walking up the kernfs path from kn root.
	20151208 - Split out the kernfs change
		 - Style changes
		 - Switch from pr_crit to WARN_ON
		 - Reorder arguments to kernfs_obtain_root
		 - rename kernfs_obtain_root to kernfs_node_dentry
---
 fs/kernfs/mount.c      |   67 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kernfs.h |    2 ++
 2 files changed, 69 insertions(+)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..7224296 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
 #include <linux/magic.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/namei.h>
 
 #include "kernfs-internal.h"
 
@@ -62,6 +63,72 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
 	return NULL;
 }
 
+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * ancestor whose descendant we want to find.
+ *
+ * Say the path is /a/b/c/d.  @child is d, @parent is NULL.  We return the root
+ * node.  If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+	if (child == parent) {
+		pr_crit_once("BUG in find_next_ancestor: called with parent == child");
+		return NULL;
+	}
+
+	while (child->parent != parent) {
+		if (!child->parent)
+			return NULL;
+		child = child->parent;
+	}
+
+	return child;
+}
+
+/**
+ * kernfs_node_dentry - get a dentry for the given kernfs_node
+ * @kn: kernfs_node for which a dentry is needed
+ * @sb: the kernfs super_block
+ */
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+				  struct super_block *sb)
+{
+	struct dentry *dentry;
+	struct kernfs_node *knparent = NULL;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	dentry = dget(sb->s_root);
+
+	/* Check if this is the root kernfs_node */
+	if (!kn->parent)
+		return dentry;
+
+	knparent = find_next_ancestor(kn, NULL);
+	if (WARN_ON(!knparent))
+		return ERR_PTR(-EINVAL);
+
+	do {
+		struct dentry *dtmp;
+		struct kernfs_node *kntmp;
+
+		if (kn == knparent)
+			return dentry;
+		kntmp = find_next_ancestor(kn, knparent);
+		if (WARN_ON(!kntmp))
+			return ERR_PTR(-EINVAL);
+		dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+		dput(dentry);
+		if (IS_ERR(dtmp))
+			return dtmp;
+		knparent = kntmp;
+		dentry = dtmp;
+	} while (1);
+}
+
 static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
 {
 	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 716bfde..c06c442 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
 struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
 struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
 
+struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
+				  struct super_block *sb);
 struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 				       unsigned int flags, void *priv);
 void kernfs_destroy_root(struct kernfs_root *root);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (4 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 5/8] kernfs: define kernfs_node_dentry serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-31 13:38   ` Sergey Senozhatsky
  2015-12-23  4:23 ` [PATCH 7/8] cgroup: Add documentation for cgroup namespaces serge.hallyn
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge Hallyn, Serge Hallyn

From: Serge Hallyn <serge.hallyn@ubuntu.com>

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
Changelog:
	20151116 - Don't allow user namespaces to bind new subsystems
	20151118 - postpone the FS_USERNS_MOUNT flag until the
	           last patch, until we can convince ourselves it
		   is safe.
	20151207 - Switch to walking up the kernfs path from kn root.
		 - Group initialized variables
		 - Explain the capable(CAP_SYS_ADMIN) check
		 - Style fixes
---
 kernel/cgroup.c |   40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e85fbf9..99c4443 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1983,6 +1983,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 {
 	bool is_v2 = fs_type == &cgroup2_fs_type;
 	struct super_block *pinned_sb = NULL;
+	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
 	struct cgroup_subsys *ss;
 	struct cgroup_root *root;
 	struct cgroup_sb_opts opts;
@@ -1991,6 +1992,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int i;
 	bool new_sb;
 
+	get_cgroup_ns(ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2106,6 +2115,16 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		goto out_unlock;
 	}
 
+	/*
+	 * We know this subsystem has not yet been bound.  Users in a non-init
+	 * user namespace may only mount hierarchies with no bound subsystems,
+	 * i.e. 'none,name=user1'
+	 */
+	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
@@ -2124,12 +2143,30 @@ out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
 
-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
 		return ERR_PTR(ret);
+	}
 out_mount:
 	dentry = kernfs_mount(fs_type, flags, root->kf_root,
 			      is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
 			      &new_sb);
+
+	/*
+	 * In non-init cgroup namespace, instead of root cgroup's
+	 * dentry, we return the dentry corresponding to the
+	 * cgroupns->root_cgrp.
+	 */
+	if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
+		struct dentry *nsdentry;
+		struct cgroup *cgrp;
+
+		cgrp = cset_cgroup_from_root(ns->root_cset, root);
+		nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
+		dput(dentry);
+		dentry = nsdentry;
+	}
+
 	if (IS_ERR(dentry) || !new_sb)
 		cgroup_put(&root->cgrp);
 
@@ -2142,6 +2179,7 @@ out_mount:
 		deactivate_super(pinned_sb);
 	}
 
+	put_cgroup_ns(ns);
 	return dentry;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/8] cgroup: Add documentation for cgroup namespaces
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (5 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-28 17:47   ` Tejun Heo
  2015-12-23  4:23 ` [PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs serge.hallyn
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge Hallyn, Serge Hallyn

From: Aditya Kali <adityakali@google.com>

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
Changelog (2015-12-08):
  Merge into Documentation/cgroup.txt
Changelog (2015-12-22):
  Reformat to try to follow the style of the rest of the cgroup.txt file.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 Documentation/cgroup.txt |  150 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..03ad757 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,7 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+6. Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1014,155 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+6. Cgroup Namespaces
+
+Cgroup namespaces provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file. The CLONE_NEWCGROUP clone flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.  The process
+running inside the cgroup namespace will have its "/proc/$PID/cgroup" output
+restricted to cgroupns root.  The cgroupns root is the cgroup of the process at
+the time of creation of the cgroup namespace.
+
+Prior to cgroup namespaces, the "/proc/$PID/cgroup" file showed the complete
+path of the cgroup of a process. In a container setup where a set of cgroups
+and namespaces are intended to isolate processes the "/proc/$PID/cgroup" file
+may leak potential system level information to the isolated processes.
+
+For Example:
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+Cgroup namespaces can be used to restrict visibility of this path.
+For example, before creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+After unsharing a new namespace, the view has changed.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+While a task in the global cgroup namespace sees the full path.
+
+  # cat /proc/$PID/cgroup
+  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+If also unsharing the user and mounts namespaces, then when mounting cgroupfs
+then the mount's root will be the task's cgroup.
+
+  # lxc-usernsexec --unshare -m -c
+  # mount -t cgroup cgroup /tmp/cgroup
+  # ls -l /tmp/cgroup
+  total 0
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns root' for a cgroup namespace is the cgroup in which
+    the process calling unshare is running.
+    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+    cgroup /batchjobs/container_id1 becomes the cgroupns root.
+    For the init_cgroup_ns, this is the real root ('/') cgroup
+    (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns root cgroup does not change even if the namespace
+    creator process later moves to a different cgroup.
+    # ~/unshare -c # unshare cgroupns in some cgroup
+     # cat /proc/self/cgroup
+     0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+     # mkdir sub_cgrp_1
+     # echo 0 > sub_cgrp_1/cgroup.procs
+     # cat /proc/self/cgroup
+     0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+(a) Processes running inside the cgroup namespace will be able to see
+    cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+    From within an unshared cgroupns:
+    # sleep 100000 &
+    [1] 7353
+    # echo 7353 > sub_cgrp_1/cgroup.procs
+    # cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From the initial cgroup namespace, the real cgroup path will be visible:
+    $ cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroup namespace (that is, a namespace rooted at a
+    different cgroup), the cgroup path relative to its own cgroup namespace
+    root will be shown.  For instance, if PID 7353's cgroup namespace root is
+    at '/batchjobs/container_id2', then it will see
+
+    # cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+    Note that the relative path always starts with '/' to indicate that its
+    relative to the cgroup namespace root of the caller.
+
+(4) Processes inside a cgroup namespace can move into and out of the namespace
+    root if they have proper access to external cgroups.  So from inside a
+    namespace with cgroupns root at /batchjobs/container_id1, and
+    assuming that the global hierarchy is still accessible inside cgroupns:
+
+    # cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+    # echo 7353 > batchjobs/container_id2/cgroup.procs
+    # cat /proc/7353/cgroup
+    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+    Note that this kind of setup is not encouraged. A task inside cgroup
+    namespace should only be exposed to its own cgroupns hierarchy. Otherwise
+    it makes the virtualization of "/proc/$PID/cgroup" less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+    (a) the process has CAP_SYS_ADMIN against its current user namespace
+    (b) the process has CAP_SYS_ADMIN against the target cgroup namespace's
+        userns
+    No implicit cgroup changes happen with attaching to another cgroup
+    namespace. It is expected that the somone moves the attaching process under
+    the target cgroup namespace root.
+
+(6) When some thread from a multi-threaded process unshares its
+    cgroup namespace, the new cgroupns gets applied to the entire process (all
+    the threads). For the unified hierarchy this is expected as it only allows
+    process level containerization.  For the legacy hierarchies this may be
+    unexpected.  So all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is at least 1
+    process inside it.  When the last process exits, the cgroup
+    namespace is destroyed. The cgroupns root and the actual cgroups
+    remain.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+    inside a non-init cgroup namespace:
+
+    # mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+    This will mount the unified cgroup hierarchy with cgroupns root as the
+    filesystem root. The process needs CAP_SYS_ADMIN against its user and
+    mounts namespaces.
+
 P. Information on Kernel Programming
 
 This section contains kernel programming information in the areas
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (6 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 7/8] cgroup: Add documentation for cgroup namespaces serge.hallyn
@ 2015-12-23  4:23 ` serge.hallyn
  2015-12-28 17:46 ` CGroup Namespaces (v8) Tejun Heo
  2016-01-01  8:19 ` Dan Williams
  9 siblings, 0 replies; 32+ messages in thread
From: serge.hallyn @ 2015-12-23  4:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: adityakali, tj, linux-api, containers, cgroups, lxc-devel, akpm,
	ebiederm, gregkh, lizefan, hannes, Serge Hallyn, Serge Hallyn

From: Serge Hallyn <serge.hallyn@ubuntu.com>

allowing root in a non-init user namespace to mount it.  This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
   user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
   the permissions will be unchanged, so tasks will receive no new
   privilege over the cgroups which they did not have on the original
   mounts.

Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
 kernel/cgroup.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 99c4443..587247e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2208,12 +2208,14 @@ static struct file_system_type cgroup_fs_type = {
 	.name = "cgroup",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static struct file_system_type cgroup2_fs_type = {
 	.name = "cgroup2",
 	.mount = cgroup_mount,
 	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
 };
 
 static int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] kernfs: Add API to generate relative kernfs path
  2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
@ 2015-12-23 16:08   ` Tejun Heo
  2015-12-23 16:36     ` Serge E. Hallyn
  2015-12-23 19:33     ` [PATCH 1/8 v8.2] " Serge E. Hallyn
  2015-12-23 16:24   ` [PATCH 1/8] " Tejun Heo
  1 sibling, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2015-12-23 16:08 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes,
	Serge E. Hallyn

Hello, Serge.

On Tue, Dec 22, 2015 at 10:23:22PM -0600, serge.hallyn@ubuntu.com wrote:
> @@ -164,18 +286,39 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
>  void pr_cont_kernfs_path(struct kernfs_node *kn)
>  {
>  	unsigned long flags;
> -	char *p;
> +	char *p = NULL;
> +	int sz1, sz2;
>  
>  	spin_lock_irqsave(&kernfs_rename_lock, flags);
>  
> -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> -			       sizeof(kernfs_pr_cont_buf));
> -	if (p)
> -		pr_cont("%s", p);
> -	else
> -		pr_cont("<name too long>");
> +	sz1 = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
> +					   sizeof(kernfs_pr_cont_buf));
> +	if (sz1 < 0) {
> +		pr_cont("(error)");
> +		goto out;
> +	}
> +
> +	if (sz1 < sizeof(kernfs_pr_cont_buf)) {
> +		pr_cont("%s", kernfs_pr_cont_buf);
> +		goto out;
> +	}
> +
> +	p = kmalloc(sz1 + 1, GFP_NOFS);

We can't do GFP_NOFS allocation while holding a spinlock and we don't
want to do atomic allocation here either.  I think it'd be best to
keep using the static buffer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] cgroup: introduce cgroup namespaces
  2015-12-23  4:23 ` [PATCH 3/8] cgroup: introduce cgroup namespaces serge.hallyn
@ 2015-12-23 16:15   ` Tejun Heo
  2015-12-23 19:34     ` [PATCH 3/8 v8.2] " Serge E. Hallyn
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2015-12-23 16:15 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes, Serge Hallyn

Hello,

On Tue, Dec 22, 2015 at 10:23:24PM -0600, serge.hallyn@ubuntu.com wrote:
...
> +char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
> +{
> +	int ret;
> +	
> +	ret = cgroup_path_ns(cgrp, buf, buflen, current->nsproxy->cgroup_ns);
> +	if (ret < 0 || ret >= buflen)
> +		return NULL;
> +	return buf;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_path);

Shouldn't this function check whether it's being called from a task
context?  e.g. it could be being called from an irq handler and may
end up scoping the path incorrectly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] kernfs: Add API to generate relative kernfs path
  2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
  2015-12-23 16:08   ` Tejun Heo
@ 2015-12-23 16:24   ` Tejun Heo
  2015-12-23 16:51     ` Greg KH
  1 sibling, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2015-12-23 16:24 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes,
	Serge E. Hallyn

On Tue, Dec 22, 2015 at 10:23:22PM -0600, serge.hallyn@ubuntu.com wrote:
> From: Aditya Kali <adityakali@google.com>
> 
> The new function kernfs_path_from_node() generates and returns kernfs
> path of a given kernfs_node relative to a given parent kernfs_node.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>

Greg, can I route this together with other changes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] kernfs: define kernfs_node_dentry
  2015-12-23  4:23 ` [PATCH 5/8] kernfs: define kernfs_node_dentry serge.hallyn
@ 2015-12-23 16:25   ` Tejun Heo
  2015-12-23 16:51     ` Greg KH
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2015-12-23 16:25 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes,
	Serge E. Hallyn

On Tue, Dec 22, 2015 at 10:23:26PM -0600, serge.hallyn@ubuntu.com wrote:
> From: Aditya Kali <adityakali@google.com>
> 
> Add a new kernfs api is added to lookup the dentry for a particular
> kernfs path.
> 
> Signed-off-by: Aditya Kali <adityakali@google.com>
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>

Greg, this is the other kernfs change in the series.  Can I route this
through the cgroup tree with other changes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] kernfs: Add API to generate relative kernfs path
  2015-12-23 16:08   ` Tejun Heo
@ 2015-12-23 16:36     ` Serge E. Hallyn
  2015-12-23 19:33     ` [PATCH 1/8 v8.2] " Serge E. Hallyn
  1 sibling, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2015-12-23 16:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serge.hallyn, linux-api, containers, hannes, linux-kernel,
	ebiederm, lxc-devel, gregkh, cgroups, akpm

On Wed, Dec 23, 2015 at 11:08:54AM -0500, Tejun Heo wrote:
> Hello, Serge.
> 
> On Tue, Dec 22, 2015 at 10:23:22PM -0600, serge.hallyn@ubuntu.com wrote:
> > @@ -164,18 +286,39 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
> >  void pr_cont_kernfs_path(struct kernfs_node *kn)
> >  {
> >  	unsigned long flags;
> > -	char *p;
> > +	char *p = NULL;
> > +	int sz1, sz2;
> >  
> >  	spin_lock_irqsave(&kernfs_rename_lock, flags);
> >  
> > -	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> > -			       sizeof(kernfs_pr_cont_buf));
> > -	if (p)
> > -		pr_cont("%s", p);
> > -	else
> > -		pr_cont("<name too long>");
> > +	sz1 = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
> > +					   sizeof(kernfs_pr_cont_buf));
> > +	if (sz1 < 0) {
> > +		pr_cont("(error)");
> > +		goto out;
> > +	}
> > +
> > +	if (sz1 < sizeof(kernfs_pr_cont_buf)) {
> > +		pr_cont("%s", kernfs_pr_cont_buf);
> > +		goto out;
> > +	}
> > +
> > +	p = kmalloc(sz1 + 1, GFP_NOFS);
> 
> We can't do GFP_NOFS allocation while holding a spinlock and we don't
> want to do atomic allocation here either.  I think it'd be best to
> keep using the static buffer.

D'oh, right.  Will update.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] kernfs: define kernfs_node_dentry
  2015-12-23 16:25   ` Tejun Heo
@ 2015-12-23 16:51     ` Greg KH
  0 siblings, 0 replies; 32+ messages in thread
From: Greg KH @ 2015-12-23 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serge.hallyn, linux-kernel, adityakali, linux-api, containers,
	cgroups, lxc-devel, akpm, ebiederm, lizefan, hannes,
	Serge E. Hallyn

On Wed, Dec 23, 2015 at 11:25:15AM -0500, Tejun Heo wrote:
> On Tue, Dec 22, 2015 at 10:23:26PM -0600, serge.hallyn@ubuntu.com wrote:
> > From: Aditya Kali <adityakali@google.com>
> > 
> > Add a new kernfs api is added to lookup the dentry for a particular
> > kernfs path.
> > 
> > Signed-off-by: Aditya Kali <adityakali@google.com>
> > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> 
> Greg, this is the other kernfs change in the series.  Can I route this
> through the cgroup tree with other changes?

Yes, please do.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] kernfs: Add API to generate relative kernfs path
  2015-12-23 16:24   ` [PATCH 1/8] " Tejun Heo
@ 2015-12-23 16:51     ` Greg KH
  0 siblings, 0 replies; 32+ messages in thread
From: Greg KH @ 2015-12-23 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serge.hallyn, linux-kernel, adityakali, linux-api, containers,
	cgroups, lxc-devel, akpm, ebiederm, lizefan, hannes,
	Serge E. Hallyn

On Wed, Dec 23, 2015 at 11:24:33AM -0500, Tejun Heo wrote:
> On Tue, Dec 22, 2015 at 10:23:22PM -0600, serge.hallyn@ubuntu.com wrote:
> > From: Aditya Kali <adityakali@google.com>
> > 
> > The new function kernfs_path_from_node() generates and returns kernfs
> > path of a given kernfs_node relative to a given parent kernfs_node.
> > 
> > Signed-off-by: Aditya Kali <adityakali@google.com>
> > Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> 
> Greg, can I route this together with other changes?

Yes, please do:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8 v8.2] kernfs: Add API to generate relative kernfs path
  2015-12-23 16:08   ` Tejun Heo
  2015-12-23 16:36     ` Serge E. Hallyn
@ 2015-12-23 19:33     ` Serge E. Hallyn
  1 sibling, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2015-12-23 19:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serge.hallyn, linux-api, containers, hannes, linux-kernel,
	ebiederm, lxc-devel, gregkh, cgroups, akpm

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
Changelog 20151125:
  - Fully-wing multilinecomments
  - Rework kernfs_path_from_node_locked() logic
  - Replace BUG_ONs with returning NULL
  - Use a const char* for /.. and precalculate its size
Changelog 20151130:
  - Update kernfs_path_from_node_locked comment
Changelog 20151208:
  - kernfs_node_distance:
    * Remove BUG_ON(NULL)s
    * Rename kernfs_node_distance to kernfs_depth
  - kernfs_common-ancestor:
    * Remove useless checks for depth == 0
    * Add check to ensure nodes are from same root
  - kernfs_path_from_node_locked:
    * Remove needless __must_check
    * Put p;len on its own decl line.
    * Fix wrong WARN_ONCE usage
Changelog 20151209:
  - kernfs_path_from_node: change arguments to 'to' and 'from', and
    change their order.
Changelog 20151222:
  - kernfs_path_from_node{,_locked}: return the string length.
    kernfs_path is gpl-exported, so changing their return value seemed
    ill-advised, but if noone minds I can update it too.
Changelog 20151223:
  - don't allocate memory pr_cont_kernfs_path() under spinlock
---
 fs/kernfs/dir.c        |  192 ++++++++++++++++++++++++++++++++++++++++--------
 include/linux/kernfs.h |    9 ++-
 2 files changed, 166 insertions(+), 35 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 742bf4a..8cf0810 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,123 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
 	return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
 }
 
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
 {
-	char *p = buf + buflen;
-	int len;
+	size_t depth = 0;
 
-	*--p = '\0';
+	while (to->parent && to != from) {
+		depth++;
+		to = to->parent;
+	}
+	return depth;
+}
 
-	do {
-		len = strlen(kn->name);
-		if (p - buf < len + 1) {
-			buf[0] = '\0';
-			p = NULL;
-			break;
-		}
-		p -= len;
-		memcpy(p, kn->name, len);
-		*--p = '/';
-		kn = kn->parent;
-	} while (kn && kn->parent);
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+						  struct kernfs_node *b)
+{
+	size_t da, db;
+	struct kernfs_root *ra = kernfs_root(a), *rb = kernfs_root(b);
+
+	if (ra != rb)
+		return NULL;
+
+	da = kernfs_depth(ra->kn, a);
+	db = kernfs_depth(rb->kn, b);
+
+	while (da > db) {
+		a = a->parent;
+		da--;
+	}
+	while (db > da) {
+		b = b->parent;
+		db--;
+	}
+
+	/* worst case b and a will be the same at root */
+	while (b != a) {
+		b = b->parent;
+		a = a->parent;
+	}
+
+	return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to:   /n1/n2/n3/n4/n5
+ * result:  /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to:   /n1/n2/n5
+ * result:  /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5   [depth=5]
+ * kn_to:   /n1/n2/n3         [depth=3]
+ * result:  /../..
+ *
+ * return value: length of the string.  If greater than buflen,
+ * then contents of buf are undefined.  On error, -1 is returned.
+ */
+static int
+kernfs_path_from_node_locked(struct kernfs_node *kn_to,
+			     struct kernfs_node *kn_from, char *buf,
+			     size_t buflen)
+{
+	struct kernfs_node *kn, *common;
+	const char parent_str[] = "/..";
+	size_t depth_from, depth_to, len = 0, nlen = 0;
+	char *p;
+	int i;
+
+	if (!kn_from)
+		kn_from = kernfs_root(kn_to)->kn;
+
+	if (kn_from == kn_to)
+		return strlcpy(buf, "/", buflen);
+
+	common = kernfs_common_ancestor(kn_from, kn_to);
+	if (WARN_ON(!common))
+		return -1;
+
+	depth_to = kernfs_depth(common, kn_to);
+	depth_from = kernfs_depth(common, kn_from);
+
+	if (buf)
+		buf[0] = '\0';
 
-	return p;
+	for (i = 0; i < depth_from; i++)
+		len += strlcpy(buf + len, parent_str,
+			       len < buflen ? buflen - len : 0);
+
+	/* Calculate how many bytes we need for the rest */
+	for (kn = kn_to; kn != common; kn = kn->parent)
+		nlen += strlen(kn->name) + 1;
+
+	if (len + nlen >= buflen)
+		return len + nlen;
+
+	p = buf + len + nlen;
+	*p = '\0';
+	for (kn = kn_to; kn != common; kn = kn->parent) {
+		nlen = strlen(kn->name);
+		p -= nlen;
+		memcpy(p, kn->name, nlen);
+		*(--p) = '/';
+	}
+
+	return len + nlen;
 }
 
 /**
@@ -115,6 +210,34 @@ size_t kernfs_path_len(struct kernfs_node *kn)
 }
 
 /**
+ * kernfs_path_from_node - build path of node @to relative to @from.
+ * @from: parent kernfs_node relative to which we need to build the path
+ * @to: kernfs_node of interest
+ * @buf: buffer to copy @to's path into
+ * @buflen: size of @buf
+ *
+ * Builds @to's path relative to @from in @buf. @from and @to must
+ * be on the same kernfs-root. If @from is not parent of @to, then a relative
+ * path (which includes '..'s) as needed to reach from @from to @to is
+ * returned.
+ * 
+ * If @buf isn't long enough, the return value will be greater than @buflen
+ * and @buf contents are undefined.
+ */
+int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
+			  char *buf, size_t buflen)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&kernfs_rename_lock, flags);
+	ret = kernfs_path_from_node_locked(to, from, buf, buflen);
+	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
  * kernfs_path - build full path of a given node
  * @kn: kernfs_node of interest
  * @buf: buffer to copy @kn's name into
@@ -127,13 +250,12 @@ size_t kernfs_path_len(struct kernfs_node *kn)
  */
 char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
 {
-	unsigned long flags;
-	char *p;
+	int ret;
 
-	spin_lock_irqsave(&kernfs_rename_lock, flags);
-	p = kernfs_path_locked(kn, buf, buflen);
-	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
-	return p;
+	ret = kernfs_path_from_node(kn, NULL, buf, buflen);
+	if (ret < 0 || ret >= buflen)
+		return NULL;
+	return buf;
 }
 EXPORT_SYMBOL_GPL(kernfs_path);
 
@@ -164,17 +286,25 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
 void pr_cont_kernfs_path(struct kernfs_node *kn)
 {
 	unsigned long flags;
-	char *p;
+	int sz;
 
 	spin_lock_irqsave(&kernfs_rename_lock, flags);
 
-	p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
-			       sizeof(kernfs_pr_cont_buf));
-	if (p)
-		pr_cont("%s", p);
-	else
-		pr_cont("<name too long>");
+	sz = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
+					  sizeof(kernfs_pr_cont_buf));
+	if (sz < 0) {
+		pr_cont("(error)");
+		goto out;
+	}
+
+	if (sz >= sizeof(kernfs_pr_cont_buf)) {
+		pr_cont("(name too long)");
+		goto out;
+	}
+
+	pr_cont("%s", kernfs_pr_cont_buf);
 
+out:
 	spin_unlock_irqrestore(&kernfs_rename_lock, flags);
 }
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index af51df3..716bfde 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -267,8 +267,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
 
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
 size_t kernfs_path_len(struct kernfs_node *kn);
-char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
-				size_t buflen);
+int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
+			  char *buf, size_t buflen);
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen);
 void pr_cont_kernfs_name(struct kernfs_node *kn);
 void pr_cont_kernfs_path(struct kernfs_node *kn);
 struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn);
@@ -338,8 +339,8 @@ static inline int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
 static inline size_t kernfs_path_len(struct kernfs_node *kn)
 { return 0; }
 
-static inline char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
-					      size_t buflen)
+static inline char *kernfs_path(struct kernfs_node *kn, char *buf,
+				size_t buflen)
 { return NULL; }
 
 static inline void pr_cont_kernfs_name(struct kernfs_node *kn) { }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8 v8.2] cgroup: introduce cgroup namespaces
  2015-12-23 16:15   ` Tejun Heo
@ 2015-12-23 19:34     ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2015-12-23 19:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: serge.hallyn, linux-api, containers, hannes, linux-kernel,
	ebiederm, lxc-devel, gregkh, cgroups, akpm

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
Changelog: 2015-11-24
	- move cgroup_namespace.c into cgroup.c (and .h)
	- reformatting
	- make get_cgroup_ns return void
	- rename ns->root_cgrps to root_cset.
Changelog: 2015-12-08
	- Move init_cgroup_ns to other variable declarations
	- Remove accidental conversion of put-css_set to inline
	- Drop BUG_ON(NULL)
	- Remove unneeded pre declaration of struct cgroupns_operations.
	- cgroup.h: collect common ns declerations
Changelog: 2015-12-09
	- cgroup.h: move ns declarations to bottom
	- cgroup.c: undo all accidental conversions to inline
Changelog: 2015-12-22
	- update for new kernfs_path_from_node() return value.  Since
	  cgroup_path was already gpl-exported, I abstained from updating
	  its return value.
Changelog: 2015-12-23
	- cgroup_path(): use init_cgroup_ns when in interupt context.
---
 fs/proc/namespaces.c    |    3 +
 include/linux/cgroup.h  |   54 +++++++++++++++--
 include/linux/nsproxy.h |    2 +
 include/linux/proc_ns.h |    4 ++
 kernel/cgroup.c         |  148 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |    2 +-
 kernel/nsproxy.c        |   21 ++++++-
 7 files changed, 225 insertions(+), 9 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
 	&userns_operations,
 #endif
 	&mntns_operations,
+#ifdef CONFIG_CGROUPS
+	&cgroupns_operations,
+#endif
 };
 
 static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d70b48..6d0992f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,6 +17,11 @@
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
 #include <linux/jump_label.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
+#include <linux/ns_common.h>
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
 
 #include <linux/cgroup-defs.h>
 
@@ -532,12 +537,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
-					      size_t buflen)
-{
-	return kernfs_path(cgrp->kn, buf, buflen);
-}
-
 static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
 	pr_cont_kernfs_name(cgrp->kn);
@@ -570,4 +569,47 @@ static inline int cgroup_init(void) { return 0; }
 
 #endif /* !CONFIG_CGROUPS */
 
+struct cgroup_namespace {
+	atomic_t		count;
+	struct ns_common	ns;
+	struct user_namespace	*user_ns;
+	struct css_set          *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+#ifdef CONFIG_CGROUPS
+
+void free_cgroup_ns(struct cgroup_namespace *ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns);
+
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
+#else /* !CONFIG_CGROUPS */
+
+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns)
+{
+	return old_ns;
+}
+
+#endif /* !CONFIG_CGROUPS */
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct cgroup_namespace;
 struct fs_struct;
 
 /*
@@ -33,6 +34,7 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 42dfc61..de0e771 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -9,6 +9,8 @@
 struct pid_namespace;
 struct nsproxy;
 struct path;
+struct task_struct;
+struct inode;
 
 struct proc_ns_operations {
 	const char *name;
@@ -24,6 +26,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
 
 /*
  * We always define these enumerators
@@ -34,6 +37,7 @@ enum {
 	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
 	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
 	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6b33631..3d9a0bb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,9 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/kthread.h>
 #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
 
 #include <linux/atomic.h>
 
@@ -208,6 +211,15 @@ static unsigned long have_fork_callback __read_mostly;
 static unsigned long have_exit_callback __read_mostly;
 static unsigned long have_free_callback __read_mostly;
 
+/* Cgroup namespace for init task */
+struct cgroup_namespace init_cgroup_ns = {
+	.count		= { .counter = 2, },
+	.user_ns	= &init_user_ns,
+	.ns.ops		= &cgroupns_operations,
+	.ns.inum	= PROC_CGROUP_INIT_INO,
+	.root_cset	= &init_css_set,
+};
+
 /* Ditto for the can_fork callback. */
 static unsigned long have_canfork_callback __read_mostly;
 
@@ -2166,6 +2178,30 @@ static struct file_system_type cgroup2_fs_type = {
 	.kill_sb = cgroup_kill_sb,
 };
 
+static int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
+			  struct cgroup_namespace *ns)
+{
+	struct cgroup *root;
+
+	root = cset_cgroup_from_root(ns->root_cset, cgrp->root);
+	return kernfs_path_from_node(cgrp->kn, root->kn, buf, buflen);
+}
+
+char *cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
+{
+	int ret;
+	struct cgroup_namespace *ns = &init_cgroup_ns;
+
+	if (!in_interrupt())
+		ns = current->nsproxy->cgroup_ns;
+	
+	ret = cgroup_path_ns(cgrp, buf, buflen, ns);
+	if (ret < 0 || ret >= buflen)
+		return NULL;
+	return buf;
+}
+EXPORT_SYMBOL_GPL(cgroup_path);
+
 /**
  * task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
  * @task: target task
@@ -5272,6 +5308,8 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
+	get_user_ns(init_cgroup_ns.user_ns);
+
 	mutex_lock(&cgroup_mutex);
 
 	/* Add init_css_set to the hash table */
@@ -5822,6 +5860,116 @@ struct cgroup *cgroup_get_from_path(const char *path)
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_path);
 
+/* cgroup namespaces */
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+	int ret;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (!new_ns)
+		return ERR_PTR(-ENOMEM);
+	ret = ns_alloc_inum(&new_ns->ns);
+	if (ret) {
+		kfree(new_ns);
+		return ERR_PTR(ret);
+	}
+	atomic_set(&new_ns->count, 1);
+	new_ns->ns.ops = &cgroupns_operations;
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	put_css_set(ns->root_cset);
+	put_user_ns(ns->user_ns);
+	ns_free_inum(&ns->ns);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *
+copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
+	       struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct css_set *cset = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP)) {
+		get_cgroup_ns(old_ns);
+		return old_ns;
+	}
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	cset = task_css_set(current);
+	get_css_set(cset);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cset = cset;
+
+	return new_ns;
+
+err_out:
+	if (cset)
+		put_css_set(cset);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static struct ns_common *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns ? &ns->ns : NULL;
+}
+
+static void cgroupns_put(struct ns_common *ns)
+{
+	put_cgroup_ns(to_cg_ns(ns));
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
+
 #ifdef CONFIG_CGROUP_DEBUG
 static struct cgroup_subsys_state *
 debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/kernel/fork.c b/kernel/fork.c
index ba7d1c0..7982fee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1880,7 +1880,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
 				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
 		return -EINVAL;
 	/*
 	 * Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 49746c8..64fe865 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
 #include <linux/proc_ns.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -39,6 +40,9 @@ struct nsproxy init_nsproxy = {
 #ifdef CONFIG_NET
 	.net_ns			= &init_net,
 #endif
+#ifdef CONFIG_CGROUPS
+	.cgroup_ns		= &init_cgroup_ns,
+#endif
 };
 
 static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +96,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
 	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
 		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +112,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
 	if (new_nsp->pid_ns_for_children)
 		put_pid_ns(new_nsp->pid_ns_for_children);
 out_pid:
@@ -128,7 +142,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 	struct nsproxy *new_ns;
 
 	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
 		get_nsproxy(old_ns);
 		return 0;
 	}
@@ -165,6 +180,8 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns_for_children)
 		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
@@ -180,7 +197,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 	int err = 0;
 
 	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
 		return 0;
 
 	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (7 preceding siblings ...)
  2015-12-23  4:23 ` [PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs serge.hallyn
@ 2015-12-28 17:46 ` Tejun Heo
  2016-01-01  8:19 ` Dan Williams
  9 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2015-12-28 17:46 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes

Applied 1-6 and 8 to cgroup/for-4.5.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] cgroup: Add documentation for cgroup namespaces
  2015-12-23  4:23 ` [PATCH 7/8] cgroup: Add documentation for cgroup namespaces serge.hallyn
@ 2015-12-28 17:47   ` Tejun Heo
  2015-12-28 21:13     ` Serge Hallyn
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2015-12-28 17:47 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes, Serge Hallyn

Hello,

I did some heavy editing of the documentation.  How does this look?
Did I miss anything?

Thanks.
---
 Documentation/cgroup.txt |  146 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)

--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,11 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+6. Namespace
+  6-1. Basics
+  6-2. The Root and Views
+  6-3. Migration and setns(2)
+  6-4. Interaction with Other Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1018,147 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+6. Namespace
+
+6-1. Basics
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file.  The CLONE_NEWCGROUP clone flag can be used
+with clone(2) and unshare(2) to create a new cgroup namespace.  The
+process running inside the cgroup namespace will have its
+"/proc/$PID/cgroup" output restricted to cgroupns root.  The cgroupns
+root is the cgroup of the process at the time of creation of the
+cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example:
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside it.
+When the last process exits, the cgroup namespace is destroyed.  The
+cgroupns root and the actual cgroups remain.
+
+
+6-2. The Root and Views
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup.
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+
+  # sleep 100000 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible:
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+6-3. Migration and setns(2)
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns root at
+/batchjobs/container_id1, and assuming that the global hierarchy is
+still accessible inside cgroupns:
+
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+  # echo 7353 > batchjobs/container_id2/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/../container_id2
+
+Note that this kind of setup is not encouraged.  A task inside cgroup
+namespace should only be exposed to its own cgroupns hierarchy.
+
+setns(2) to another cgroup namespace is allowed when:
+
+(a) the process has CAP_SYS_ADMIN against its current user namespace
+(b) the process has CAP_SYS_ADMIN against the target cgroup
+    namespace's userns
+
+No implicit cgroup changes happen with attaching to another cgroup
+namespace.  It is expected that the someone moves the attaching
+process under the target cgroup namespace root.
+
+
+6-4. Interaction with Other Namespaces
+
+Namespace specific cgroup hierarchy can be mounted by a process
+running inside a non-init cgroup namespace.
+
+  # mount -t cgroup2 none $MOUNT_POINT
+
+This will mount the unified cgroup hierarchy with cgroupns root as the
+filesystem root.  The process needs CAP_SYS_ADMIN against its user and
+mount namespaces.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+provides a properly isolated cgroup view inside the container.
+
+
 P. Information on Kernel Programming
 
 This section contains kernel programming information in the areas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] cgroup: Add documentation for cgroup namespaces
  2015-12-28 17:47   ` Tejun Heo
@ 2015-12-28 21:13     ` Serge Hallyn
  2015-12-28 21:48       ` [PATCH] " Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2015-12-28 21:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes, Serge Hallyn

On Mon Dec 28 2015 09:47:35 AM PST, Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> I did some heavy editing of the documentation.      How does this look?

Thanks Tejun, just three things (which come from my version):

> Did I miss anything?
> 
> Thanks.
> ---
>      Documentation/cgroup.txt |      146
> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 146
> insertions(+)
> 
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -47,6 +47,11 @@ CONTENTS
>          5-3. IO
>              5-3-1. IO Interface Files
>              5-3-2. Writeback
> +6. Namespace
> +      6-1. Basics
> +      6-2. The Root and Views
> +      6-3. Migration and setns(2)
> +      6-4. Interaction with Other Namespaces
>      P. Information on Kernel Programming
>          P-1. Filesystem Support for Writeback
>      D. Deprecated v1 Core Features
> @@ -1013,6 +1018,147 @@ writeback as follows.
>          vm.dirty[_background]_ratio.
>      
>      
> +6. Namespace
> +
> +6-1. Basics
> +
> +cgroup namespace provides a mechanism to virtualize the view of the
> +"/proc/$PID/cgroup" file

and cgroup mounts

>.      The CLONE_NEWCGROUP clone flag can be used
> +with clone(2) and unshare(2) to create a new cgroup namespace.      The
> +process running inside the cgroup namespace will have its
> +"/proc/$PID/cgroup" output restricted to cgroupns root.      The cgroupns
> +root is the cgroup of the process at the time of creation of the
> +cgroup namespace.
> +
> +Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
> +complete path of the cgroup of a process.      In a container setup where
> +a set of cgroups and namespaces are intended to isolate processes the
> +"/proc/$PID/cgroup" file may leak potential system level information
> +to the isolated processes.      For Example:
> +
> +      # cat /proc/self/cgroup
> +      0::/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can be considered as system-data
> +and undesirable to expose to the isolated processes.      cgroup namespace
> +can be used to restrict visibility of this path.      For example, before
> +creating a cgroup namespace, one would see:
> +
> +      # ls -l /proc/self/ns/cgroup
> +      lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup ->
> cgroup:[4026531835] +      # cat /proc/self/cgroup
> +      0::/batchjobs/container_id1
> +
> +After unsharing a new namespace, the view changes.
> +
> +      # ls -l /proc/self/ns/cgroup
> +      lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
> cgroup:[4026532183] +      # cat /proc/self/cgroup
> +      0::/
> +
> +When some thread from a multi-threaded process unshares its cgroup
> +namespace, the new cgroupns gets applied to the entire process (all
> +the threads).      This is natural for the v2 hierarchy; however, for the
> +legacy hierarchies, this may be unexpected.
> +
> +A cgroup namespace is alive as long as there are processes inside it.

Or mounts pinning it.

> +When the last process exits

or the last mount is umounted,

>, the cgroup namespace is destroyed.      The
> +cgroupns root and the actual cgroups remain.
> +
> +
> +6-2. The Root and Views
> +
> +The 'cgroupns root' for a cgroup namespace is the cgroup in which the
> +process calling unshare(2) is running.      For example, if a process in
> +/batchjobs/container_id1 cgroup calls unshare, cgroup
> +/batchjobs/container_id1 becomes the cgroupns root.      For the
> +init_cgroup_ns, this is the real root ('/') cgroup.
> +
> +The cgroupns root cgroup does not change even if the namespace creator
> +process later moves to a different cgroup.
> +
> +      # ~/unshare -c # unshare cgroupns in some cgroup
> +      # cat /proc/self/cgroup
> +      0::/
> +      # mkdir sub_cgrp_1
> +      # echo 0 > sub_cgrp_1/cgroup.procs
> +      # cat /proc/self/cgroup
> +      0::/sub_cgrp_1
> +
> +Each process gets its namespace-specific view of "/proc/$PID/cgroup"
> +
> +Processes running inside the cgroup namespace will be able to see
> +cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
> +From within an unshared cgroupns:
> +
> +      # sleep 100000 &
> +      [1] 7353
> +      # echo 7353 > sub_cgrp_1/cgroup.procs
> +      # cat /proc/7353/cgroup
> +      0::/sub_cgrp_1
> +
> +From the initial cgroup namespace, the real cgroup path will be
> +visible:
> +
> +      $ cat /proc/7353/cgroup
> +      0::/batchjobs/container_id1/sub_cgrp_1
> +
> +From a sibling cgroup namespace (that is, a namespace rooted at a
> +different cgroup), the cgroup path relative to its own cgroup
> +namespace root will be shown.      For instance, if PID 7353's cgroup
> +namespace root is at '/batchjobs/container_id2', then it will see
> +
> +      # cat /proc/7353/cgroup
> +      0::/../container_id2/sub_cgrp_1
> +
> +Note that the relative path always starts with '/' to indicate that
> +its relative to the cgroup namespace root of the caller.
> +
> +
> +6-3. Migration and setns(2)
> +
> +Processes inside a cgroup namespace can move into and out of the
> +namespace root if they have proper access to external cgroups

this really means two things - write DAC access to the cgroupfs files, and access to the directories through a cgroupfs mount.    Not sure if that should be spelled out.

>.      For
> +example, from inside a namespace with cgroupns root at
> +/batchjobs/container_id1, and assuming that the global hierarchy is
> +still accessible inside cgroupns:
> +
> +      # cat /proc/7353/cgroup
> +      0::/sub_cgrp_1
> +      # echo 7353 > batchjobs/container_id2/cgroup.procs
> +      # cat /proc/7353/cgroup
> +      0::/../container_id2
> +
> +Note that this kind of setup is not encouraged.      A task inside cgroup
> +namespace should only be exposed to its own cgroupns hierarchy.
> +
> +setns(2) to another cgroup namespace is allowed when:
> +
> +(a) the process has CAP_SYS_ADMIN against its current user namespace
> +(b) the process has CAP_SYS_ADMIN against the target cgroup
> +          namespace's userns
> +
> +No implicit cgroup changes happen with attaching to another cgroup
> +namespace.      It is expected that the someone moves the attaching
> +process under the target cgroup namespace root.
> +
> +
> +6-4. Interaction with Other Namespaces
> +
> +Namespace specific cgroup hierarchy can be mounted by a process
> +running inside a non-init cgroup namespace.
> +
> +      # mount -t cgroup2 none $MOUNT_POINT
> +
> +This will mount the unified cgroup hierarchy with cgroupns root as the
> +filesystem root.      The process needs CAP_SYS_ADMIN against its user and
> +mount namespaces.
> +
> +The virtualization of /proc/self/cgroup file combined with restricting
> +the view of cgroup hierarchy by namespace-private cgroupfs mount
> +provides a properly isolated cgroup view inside the container.
> +
> +
>      P. Information on Kernel Programming
>      
>      This section contains kernel programming information in the areas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at      http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at      http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] cgroup: Add documentation for cgroup namespaces
  2015-12-28 21:13     ` Serge Hallyn
@ 2015-12-28 21:48       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2015-12-28 21:48 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: linux-kernel, adityakali, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes, Serge Hallyn

Hello, Serge.

I applied your suggestions and applied the following to
cgroup/for-4.5.  Let's improve it from here.

Thanks!

------ 8< ------
>From 5dddabce1171af2990160e1cb90f69a1440dcfb7 Mon Sep 17 00:00:00 2001
From: Aditya Kali <adityakali@google.com>
Date: Tue, 22 Dec 2015 22:23:28 -0600

tj: Reorganized to better fit the documentation.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroup.txt | 147 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..983ba63 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,11 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+6. Namespace
+  6-1. Basics
+  6-2. The Root and Views
+  6-3. Migration and setns(2)
+  6-4. Interaction with Other Namespaces
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1013,6 +1018,148 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+6. Namespace
+
+6-1. Basics
+
+cgroup namespace provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file and cgroup mounts.  The CLONE_NEWCGROUP clone
+flag can be used with clone(2) and unshare(2) to create a new cgroup
+namespace.  The process running inside the cgroup namespace will have
+its "/proc/$PID/cgroup" output restricted to cgroupns root.  The
+cgroupns root is the cgroup of the process at the time of creation of
+the cgroup namespace.
+
+Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
+complete path of the cgroup of a process.  In a container setup where
+a set of cgroups and namespaces are intended to isolate processes the
+"/proc/$PID/cgroup" file may leak potential system level information
+to the isolated processes.  For Example:
+
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can be considered as system-data
+and undesirable to expose to the isolated processes.  cgroup namespace
+can be used to restrict visibility of this path.  For example, before
+creating a cgroup namespace, one would see:
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+  # cat /proc/self/cgroup
+  0::/batchjobs/container_id1
+
+After unsharing a new namespace, the view changes.
+
+  # ls -l /proc/self/ns/cgroup
+  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+  # cat /proc/self/cgroup
+  0::/
+
+When some thread from a multi-threaded process unshares its cgroup
+namespace, the new cgroupns gets applied to the entire process (all
+the threads).  This is natural for the v2 hierarchy; however, for the
+legacy hierarchies, this may be unexpected.
+
+A cgroup namespace is alive as long as there are processes inside or
+mounts pinning it.  When the last usage goes away, the cgroup
+namespace is destroyed.  The cgroupns root and the actual cgroups
+remain.
+
+
+6-2. The Root and Views
+
+The 'cgroupns root' for a cgroup namespace is the cgroup in which the
+process calling unshare(2) is running.  For example, if a process in
+/batchjobs/container_id1 cgroup calls unshare, cgroup
+/batchjobs/container_id1 becomes the cgroupns root.  For the
+init_cgroup_ns, this is the real root ('/') cgroup.
+
+The cgroupns root cgroup does not change even if the namespace creator
+process later moves to a different cgroup.
+
+  # ~/unshare -c # unshare cgroupns in some cgroup
+  # cat /proc/self/cgroup
+  0::/
+  # mkdir sub_cgrp_1
+  # echo 0 > sub_cgrp_1/cgroup.procs
+  # cat /proc/self/cgroup
+  0::/sub_cgrp_1
+
+Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+Processes running inside the cgroup namespace will be able to see
+cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+From within an unshared cgroupns:
+
+  # sleep 100000 &
+  [1] 7353
+  # echo 7353 > sub_cgrp_1/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+
+From the initial cgroup namespace, the real cgroup path will be
+visible:
+
+  $ cat /proc/7353/cgroup
+  0::/batchjobs/container_id1/sub_cgrp_1
+
+From a sibling cgroup namespace (that is, a namespace rooted at a
+different cgroup), the cgroup path relative to its own cgroup
+namespace root will be shown.  For instance, if PID 7353's cgroup
+namespace root is at '/batchjobs/container_id2', then it will see
+
+  # cat /proc/7353/cgroup
+  0::/../container_id2/sub_cgrp_1
+
+Note that the relative path always starts with '/' to indicate that
+its relative to the cgroup namespace root of the caller.
+
+
+6-3. Migration and setns(2)
+
+Processes inside a cgroup namespace can move into and out of the
+namespace root if they have proper access to external cgroups.  For
+example, from inside a namespace with cgroupns root at
+/batchjobs/container_id1, and assuming that the global hierarchy is
+still accessible inside cgroupns:
+
+  # cat /proc/7353/cgroup
+  0::/sub_cgrp_1
+  # echo 7353 > batchjobs/container_id2/cgroup.procs
+  # cat /proc/7353/cgroup
+  0::/../container_id2
+
+Note that this kind of setup is not encouraged.  A task inside cgroup
+namespace should only be exposed to its own cgroupns hierarchy.
+
+setns(2) to another cgroup namespace is allowed when:
+
+(a) the process has CAP_SYS_ADMIN against its current user namespace
+(b) the process has CAP_SYS_ADMIN against the target cgroup
+    namespace's userns
+
+No implicit cgroup changes happen with attaching to another cgroup
+namespace.  It is expected that the someone moves the attaching
+process under the target cgroup namespace root.
+
+
+6-4. Interaction with Other Namespaces
+
+Namespace specific cgroup hierarchy can be mounted by a process
+running inside a non-init cgroup namespace.
+
+  # mount -t cgroup2 none $MOUNT_POINT
+
+This will mount the unified cgroup hierarchy with cgroupns root as the
+filesystem root.  The process needs CAP_SYS_ADMIN against its user and
+mount namespaces.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+provides a properly isolated cgroup view inside the container.
+
+
 P. Information on Kernel Programming
 
 This section contains kernel programming information in the areas
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2015-12-23  4:23 ` [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns serge.hallyn
@ 2015-12-31 13:38   ` Sergey Senozhatsky
  2016-01-01  0:58     ` Serge E. Hallyn
  0 siblings, 1 reply; 32+ messages in thread
From: Sergey Senozhatsky @ 2015-12-31 13:38 UTC (permalink / raw)
  To: serge.hallyn
  Cc: linux-kernel, adityakali, tj, linux-api, containers, cgroups,
	lxc-devel, akpm, ebiederm, gregkh, lizefan, hannes, Serge Hallyn,
	Stephen Rothwell, linux-next

On (12/22/15 22:23), serge.hallyn@ubuntu.com wrote:
[..]
>  	root = kzalloc(sizeof(*root), GFP_KERNEL);
>  	if (!root) {
>  		ret = -ENOMEM;
> @@ -2124,12 +2143,30 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  out_mount:
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  			      is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
>  			      &new_sb);
> +
> +	/*
> +	 * In non-init cgroup namespace, instead of root cgroup's
> +	 * dentry, we return the dentry corresponding to the
> +	 * cgroupns->root_cgrp.
> +	 */
> +	if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
> +		struct dentry *nsdentry;
> +		struct cgroup *cgrp;
> +
> +		cgrp = cset_cgroup_from_root(ns->root_cset, root);
> +		nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
> +		dput(dentry);
> +		dentry = nsdentry;
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -2142,6 +2179,7 @@ out_mount:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }

Hello,

didn't investigate, just let you know that I'm seeing some warning on linux-next 20151231

1159 is  lockdep_assert_held(&css_set_lock)

[    2.880978] WARNING: CPU: 1 PID: 198 at kernel/cgroup.c:1159 cset_cgroup_from_root+0xa8/0xaf()
[    2.880981] Modules linked in: mousedev arc4 nls_iso8859_1 nls_cp437 vfat fat serio_raw iwlmvm psmouse i915 atkbd mac80211 libps2 mxm_wmi i2c_algo_bit drm_kms_helper coretemp cfbfillrect iwlwifi syscopyarea hwmon cfbimgblt crc32c_intel sysfillrect sysimgblt fb_sys_fops cfg80211 ie31200_edac cfbcopyarea evdev r8169 lpc_ich drm i2c_i801 mii intel_gtt edac_core i2c_core mfd_core thermal video i8042 wmi serio button acpi_cpufreq ext4 crc16 mbcache jbd2 sd_mod ehci_pci ehci_hcd ahci libahci libata xhci_pci scsi_mod xhci_hcd usbcore usb_common
[    2.881040] CPU: 1 PID: 198 Comm: kworker/1:2 Tainted: G        W       4.4.0-rc7-next-20151231-dbg-00012-gc1e64b7-dirty #316
[    2.881045] Workqueue: events cgroup_release_agent
[    2.881048]  0000000000000000 ffff88041b7f3cb8 ffffffff811e8bc6 0000000000000000
[    2.881053]  ffff88041b7f3cf0 ffffffff81040889 ffffffff810b2d27 ffffffff8163f3c0
[    2.881057]  ffff88041cc84000 ffff88041b7c0000 ffff88042fa58a00 ffff88041b7f3d00
[    2.881062] Call Trace:
[    2.881068]  [<ffffffff811e8bc6>] dump_stack+0x4b/0x63
[    2.881073]  [<ffffffff81040889>] warn_slowpath_common+0x99/0xb2
[    2.881076]  [<ffffffff810b2d27>] ? cset_cgroup_from_root+0xa8/0xaf
[    2.881080]  [<ffffffff81040955>] warn_slowpath_null+0x1a/0x1c
[    2.881083]  [<ffffffff810b2d27>] cset_cgroup_from_root+0xa8/0xaf
[    2.881086]  [<ffffffff810b58a4>] cgroup_path+0x51/0x83
[    2.881090]  [<ffffffff810b59fa>] cgroup_release_agent+0x75/0xed
[    2.881094]  [<ffffffff810566ea>] process_one_work+0x292/0x54b
[    2.881098]  [<ffffffff81057826>] worker_thread+0x291/0x383
[    2.881102]  [<ffffffff81057595>] ? rescuer_thread+0x2cf/0x2cf
[    2.881105]  [<ffffffff8105c603>] kthread+0xf8/0x100
[    2.881108]  [<ffffffff8105c50b>] ? kthread_create_on_node+0x1c7/0x1c7
[    2.881114]  [<ffffffff813b71bf>] ret_from_fork+0x3f/0x70
[    2.881117]  [<ffffffff8105c50b>] ? kthread_create_on_node+0x1c7/0x1c7
[    2.881119] ---[ end trace aa407df654d6b05d ]---

	-ss

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2015-12-31 13:38   ` Sergey Senozhatsky
@ 2016-01-01  0:58     ` Serge E. Hallyn
  2016-01-01  1:17       ` Sergey Senozhatsky
  2016-01-01  1:56       ` Tejun Heo
  0 siblings, 2 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2016-01-01  0:58 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: serge.hallyn, Stephen Rothwell, linux-api, containers, hannes,
	linux-kernel, linux-next, ebiederm, lxc-devel, gregkh, tj,
	cgroups, akpm

On Thu, Dec 31, 2015 at 10:38:22PM +0900, Sergey Senozhatsky wrote:
> On (12/22/15 22:23), serge.hallyn@ubuntu.com wrote:
> [..]
> >  	root = kzalloc(sizeof(*root), GFP_KERNEL);
> >  	if (!root) {
> >  		ret = -ENOMEM;
> > @@ -2124,12 +2143,30 @@ out_free:
> >  	kfree(opts.release_agent);
> >  	kfree(opts.name);
> >  
> > -	if (ret)
> > +	if (ret) {
> > +		put_cgroup_ns(ns);
> >  		return ERR_PTR(ret);
> > +	}
> >  out_mount:
> >  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
> >  			      is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
> >  			      &new_sb);
> > +
> > +	/*
> > +	 * In non-init cgroup namespace, instead of root cgroup's
> > +	 * dentry, we return the dentry corresponding to the
> > +	 * cgroupns->root_cgrp.
> > +	 */
> > +	if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
> > +		struct dentry *nsdentry;
> > +		struct cgroup *cgrp;
> > +
> > +		cgrp = cset_cgroup_from_root(ns->root_cset, root);
> > +		nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
> > +		dput(dentry);
> > +		dentry = nsdentry;
> > +	}
> > +
> >  	if (IS_ERR(dentry) || !new_sb)
> >  		cgroup_put(&root->cgrp);
> >  
> > @@ -2142,6 +2179,7 @@ out_mount:
> >  		deactivate_super(pinned_sb);
> >  	}
> >  
> > +	put_cgroup_ns(ns);
> >  	return dentry;
> >  }
> 
> Hello,
> 
> didn't investigate, just let you know that I'm seeing some warning on linux-next 20151231
> 
> 1159 is  lockdep_assert_held(&css_set_lock)
> 
> [    2.880978] WARNING: CPU: 1 PID: 198 at kernel/cgroup.c:1159 cset_cgroup_from_root+0xa8/0xaf()
> [    2.880981] Modules linked in: mousedev arc4 nls_iso8859_1 nls_cp437 vfat fat serio_raw iwlmvm psmouse i915 atkbd mac80211 libps2 mxm_wmi i2c_algo_bit drm_kms_helper coretemp cfbfillrect iwlwifi syscopyarea hwmon cfbimgblt crc32c_intel sysfillrect sysimgblt fb_sys_fops cfg80211 ie31200_edac cfbcopyarea evdev r8169 lpc_ich drm i2c_i801 mii intel_gtt edac_core i2c_core mfd_core thermal video i8042 wmi serio button acpi_cpufreq ext4 crc16 mbcache jbd2 sd_mod ehci_pci ehci_hcd ahci libahci libata xhci_pci scsi_mod xhci_hcd usbcore usb_common
> [    2.881040] CPU: 1 PID: 198 Comm: kworker/1:2 Tainted: G        W       4.4.0-rc7-next-20151231-dbg-00012-gc1e64b7-dirty #316
> [    2.881045] Workqueue: events cgroup_release_agent
> [    2.881048]  0000000000000000 ffff88041b7f3cb8 ffffffff811e8bc6 0000000000000000
> [    2.881053]  ffff88041b7f3cf0 ffffffff81040889 ffffffff810b2d27 ffffffff8163f3c0
> [    2.881057]  ffff88041cc84000 ffff88041b7c0000 ffff88042fa58a00 ffff88041b7f3d00
> [    2.881062] Call Trace:
> [    2.881068]  [<ffffffff811e8bc6>] dump_stack+0x4b/0x63
> [    2.881073]  [<ffffffff81040889>] warn_slowpath_common+0x99/0xb2
> [    2.881076]  [<ffffffff810b2d27>] ? cset_cgroup_from_root+0xa8/0xaf
> [    2.881080]  [<ffffffff81040955>] warn_slowpath_null+0x1a/0x1c
> [    2.881083]  [<ffffffff810b2d27>] cset_cgroup_from_root+0xa8/0xaf
> [    2.881086]  [<ffffffff810b58a4>] cgroup_path+0x51/0x83
> [    2.881090]  [<ffffffff810b59fa>] cgroup_release_agent+0x75/0xed
> [    2.881094]  [<ffffffff810566ea>] process_one_work+0x292/0x54b
> [    2.881098]  [<ffffffff81057826>] worker_thread+0x291/0x383
> [    2.881102]  [<ffffffff81057595>] ? rescuer_thread+0x2cf/0x2cf
> [    2.881105]  [<ffffffff8105c603>] kthread+0xf8/0x100
> [    2.881108]  [<ffffffff8105c50b>] ? kthread_create_on_node+0x1c7/0x1c7
> [    2.881114]  [<ffffffff813b71bf>] ret_from_fork+0x3f/0x70
> [    2.881117]  [<ffffffff8105c50b>] ? kthread_create_on_node+0x1c7/0x1c7
> [    2.881119] ---[ end trace aa407df654d6b05d ]---

Thanks!  Following trivial patch should fix this:

>From fc54592077533ff2ff90ed54b72bf03b4378ca9f Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serge.hallyn@ubuntu.com>
Date: Thu, 31 Dec 2015 16:55:19 -0800
Subject: [PATCH 1/1] cgroup_release_agent: grab css_set_lock around
 cgroup_path()

Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 kernel/cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4812f59..d10f127 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5769,7 +5769,9 @@ static void cgroup_release_agent(struct work_struct *work)
 	if (!pathbuf || !agentbuf)
 		goto out;
 
+	spin_lock_bh(&css_set_lock);
 	path = cgroup_path(cgrp, pathbuf, PATH_MAX);
+	spin_unlock_bh(&css_set_lock);
 	if (!path)
 		goto out;
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2016-01-01  0:58     ` Serge E. Hallyn
@ 2016-01-01  1:17       ` Sergey Senozhatsky
  2016-01-01  1:56       ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Sergey Senozhatsky @ 2016-01-01  1:17 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sergey Senozhatsky, Stephen Rothwell, linux-api, containers,
	hannes, linux-kernel, linux-next, ebiederm, lxc-devel, gregkh,
	tj, cgroups, akpm

On (12/31/15 18:58), Serge E. Hallyn wrote:
[..]
> Thanks!  Following trivial patch should fix this:

confirmed, thanks!

Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> From fc54592077533ff2ff90ed54b72bf03b4378ca9f Mon Sep 17 00:00:00 2001
> From: Serge Hallyn <serge.hallyn@ubuntu.com>
> Date: Thu, 31 Dec 2015 16:55:19 -0800
> Subject: [PATCH 1/1] cgroup_release_agent: grab css_set_lock around
>  cgroup_path()
> 
> Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> ---
>  kernel/cgroup.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4812f59..d10f127 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -5769,7 +5769,9 @@ static void cgroup_release_agent(struct work_struct *work)
>  	if (!pathbuf || !agentbuf)
>  		goto out;
>  
> +	spin_lock_bh(&css_set_lock);
>  	path = cgroup_path(cgrp, pathbuf, PATH_MAX);
> +	spin_unlock_bh(&css_set_lock);
>  	if (!path)
>  		goto out;
>  
> -- 
> 2.5.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns
  2016-01-01  0:58     ` Serge E. Hallyn
  2016-01-01  1:17       ` Sergey Senozhatsky
@ 2016-01-01  1:56       ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2016-01-01  1:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sergey Senozhatsky, Stephen Rothwell, linux-api, containers,
	hannes, linux-kernel, linux-next, ebiederm, lxc-devel, gregkh,
	cgroups, akpm

Hello,

> From fc54592077533ff2ff90ed54b72bf03b4378ca9f Mon Sep 17 00:00:00 2001
> From: Serge Hallyn <serge.hallyn@ubuntu.com>
> Date: Thu, 31 Dec 2015 16:55:19 -0800
> Subject: [PATCH 1/1] cgroup_release_agent: grab css_set_lock around
>  cgroup_path()
> 
> Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>

Applied to cgroup/for-4.5.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
                   ` (8 preceding siblings ...)
  2015-12-28 17:46 ` CGroup Namespaces (v8) Tejun Heo
@ 2016-01-01  8:19 ` Dan Williams
  2016-01-01  8:59   ` Serge E. Hallyn
  9 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2016-01-01  8:19 UTC (permalink / raw)
  To: serge.hallyn
  Cc: Linux Kernel Mailing List, adityakali, Tejun Heo, linux-api,
	containers, cgroups, lxc-devel, Andrew Morton, Eric W. Biederman,
	Greg Kroah-Hartman, lizefan, hannes, Stephen Rothwell

On Tue, Dec 22, 2015 at 8:23 PM,  <serge.hallyn@ubuntu.com> wrote:
> Hi,
>
> following is a revised set of the CGroup Namespace patchset which Aditya
> Kali has previously sent.  The code can also be found in the cgroupns.v8
> branch of
>
> https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

Happy New Year!

I'm hitting both a build and boot regression in -next with these
patches.  The bisect log reports:

git bisect skip f176ae3a5df610e0373c39538cc7744c7788b3df
# bad: [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup
namespace setns support
git bisect bad 54b39d263704b66cc8006ce8f7dd4743fdabae70
# good: [bf9b3590879afbaa54148d817ed68a0b51180375] sched: new clone
flag CLONE_NEWCGROUP for c
group namespace
git bisect good bf9b3590879afbaa54148d817ed68a0b51180375
# only skipped commits left to test
# possible first bad commit:
[54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup namespa
ce setns support
# possible first bad commit:
[f176ae3a5df610e0373c39538cc7744c7788b3df] cgroup: introduce cgro
up namespaces

Where compiling "f176ae3a5df6"  fails with:

kernel/cgroup.c: In function ‘cgroupns_put’:
kernel/cgroup.c:5956:2: error: implicit declaration of function
‘to_cg_ns’ [-Werror=implicit-f
unction-declaration]
 put_cgroup_ns(to_cg_ns(ns));
 ^
kernel/cgroup.c:5956:16: warning: passing argument 1 of
‘put_cgroup_ns’ makes pointer from int
eger without a cast
 put_cgroup_ns(to_cg_ns(ns));
               ^
In file included from kernel/cgroup.c:31:0:
include/linux/cgroup.h:609:60: note: expected ‘struct cgroup_namespace
*’ but argument is of t
ype ‘int’
static inline void put_cgroup_ns(struct cgroup_namespace *ns)
                                                           ^
kernel/cgroup.c: At top level:
kernel/cgroup.c:5964:2: warning: initialization from incompatible pointer type
 .install = cgroupns_install,
 ^
kernel/cgroup.c:5964:2: warning: (near initialization for
‘cgroupns_operations.install’)
cc1: some warnings being treated as errors

...and booting "54b39d263704" fails with the following log:

[ 3.963105] BUG: unable to handle kernel NULL pointer dereference at
0000000000000030
[ 3.963110] IP: [<ffffffff81143f3d>] cgroup_path+0x3d/0x90

https://gist.github.com/djbw/1d30cd323d802e3d9d07

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2016-01-01  8:19 ` Dan Williams
@ 2016-01-01  8:59   ` Serge E. Hallyn
  2016-01-01  9:42     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Serge E. Hallyn @ 2016-01-01  8:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: serge.hallyn, Stephen Rothwell, linux-api, containers, hannes,
	Linux Kernel Mailing List, Eric W. Biederman, lxc-devel,
	Greg Kroah-Hartman, Tejun Heo, cgroups, Andrew Morton

On Fri, Jan 01, 2016 at 12:19:44AM -0800, Dan Williams wrote:
> On Tue, Dec 22, 2015 at 8:23 PM,  <serge.hallyn@ubuntu.com> wrote:
> > Hi,
> >
> > following is a revised set of the CGroup Namespace patchset which Aditya
> > Kali has previously sent.  The code can also be found in the cgroupns.v8
> > branch of
> >
> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> 
> Happy New Year!
> 
> I'm hitting both a build and boot regression in -next with these
> patches.  The bisect log reports:
> 
> git bisect skip f176ae3a5df610e0373c39538cc7744c7788b3df
> # bad: [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup
> namespace setns support
> git bisect bad 54b39d263704b66cc8006ce8f7dd4743fdabae70
> # good: [bf9b3590879afbaa54148d817ed68a0b51180375] sched: new clone
> flag CLONE_NEWCGROUP for c
> group namespace
> git bisect good bf9b3590879afbaa54148d817ed68a0b51180375
> # only skipped commits left to test
> # possible first bad commit:
> [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup namespa
> ce setns support
> # possible first bad commit:
> [f176ae3a5df610e0373c39538cc7744c7788b3df] cgroup: introduce cgro
> up namespaces
> 
> Where compiling "f176ae3a5df6"  fails with:
> 
> kernel/cgroup.c: In function ‘cgroupns_put’:
> kernel/cgroup.c:5956:2: error: implicit declaration of function
> ‘to_cg_ns’ [-Werror=implicit-f
> unction-declaration]
>  put_cgroup_ns(to_cg_ns(ns));

Hi,

is it possible you have a bad tree?  Looking at 
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/kernel/cgroup.c?id=refs/tags/next-20151231
to_cg_ns should definately be defined there.

-serge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2016-01-01  8:59   ` Serge E. Hallyn
@ 2016-01-01  9:42     ` Dan Williams
  2016-01-01 18:06       ` Serge E. Hallyn
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2016-01-01  9:42 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stephen Rothwell, linux-api, containers, Johannes Weiner,
	Linux Kernel Mailing List, Eric W. Biederman, lxc-devel,
	Greg Kroah-Hartman, Tejun Heo, cgroups, Andrew Morton

On Fri, Jan 1, 2016 at 12:59 AM, Serge E. Hallyn
<serge.hallyn@ubuntu.com> wrote:
> On Fri, Jan 01, 2016 at 12:19:44AM -0800, Dan Williams wrote:
>> On Tue, Dec 22, 2015 at 8:23 PM,  <serge.hallyn@ubuntu.com> wrote:
>> > Hi,
>> >
>> > following is a revised set of the CGroup Namespace patchset which Aditya
>> > Kali has previously sent.  The code can also be found in the cgroupns.v8
>> > branch of
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
>>
>> Happy New Year!
>>
>> I'm hitting both a build and boot regression in -next with these
>> patches.  The bisect log reports:
>>
>> git bisect skip f176ae3a5df610e0373c39538cc7744c7788b3df
>> # bad: [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup
>> namespace setns support
>> git bisect bad 54b39d263704b66cc8006ce8f7dd4743fdabae70
>> # good: [bf9b3590879afbaa54148d817ed68a0b51180375] sched: new clone
>> flag CLONE_NEWCGROUP for c
>> group namespace
>> git bisect good bf9b3590879afbaa54148d817ed68a0b51180375
>> # only skipped commits left to test
>> # possible first bad commit:
>> [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup namespa
>> ce setns support
>> # possible first bad commit:
>> [f176ae3a5df610e0373c39538cc7744c7788b3df] cgroup: introduce cgro
>> up namespaces
>>
>> Where compiling "f176ae3a5df6"  fails with:
>>
>> kernel/cgroup.c: In function ‘cgroupns_put’:
>> kernel/cgroup.c:5956:2: error: implicit declaration of function
>> ‘to_cg_ns’ [-Werror=implicit-f
>> unction-declaration]
>>  put_cgroup_ns(to_cg_ns(ns));
>
> Hi,
>
> is it possible you have a bad tree?

No, I don't believe so...

>  Looking at
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/kernel/cgroup.c?id=refs/tags/next-20151231
> to_cg_ns should definately be defined there.
>

Right, but it's not defined until commit 54b39d263704 "cgroup: cgroup
namespace setns support", however it is first used in the preceeding
commit f176ae3a5df6 "cgroup: introduce cgroup namespaces".  The bad
patch ordering breaks bisection.

Commit 54b39d263704 "cgroup: cgroup namespace setns support" not
booting is a separate issue.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2016-01-01  9:42     ` Dan Williams
@ 2016-01-01 18:06       ` Serge E. Hallyn
  2016-01-01 19:14         ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Serge E. Hallyn @ 2016-01-01 18:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Serge E. Hallyn, Stephen Rothwell, Greg Kroah-Hartman, linux-api,
	containers, Linux Kernel Mailing List, Eric W. Biederman,
	lxc-devel, Johannes Weiner, Tejun Heo, cgroups, Andrew Morton

On Fri, Jan 01, 2016 at 01:42:57AM -0800, Dan Williams wrote:
> On Fri, Jan 1, 2016 at 12:59 AM, Serge E. Hallyn
> <serge.hallyn@ubuntu.com> wrote:
> > On Fri, Jan 01, 2016 at 12:19:44AM -0800, Dan Williams wrote:
> >> On Tue, Dec 22, 2015 at 8:23 PM,  <serge.hallyn@ubuntu.com> wrote:
> >> > Hi,
> >> >
> >> > following is a revised set of the CGroup Namespace patchset which Aditya
> >> > Kali has previously sent.  The code can also be found in the cgroupns.v8
> >> > branch of
> >> >
> >> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> >>
> >> Happy New Year!
> >>
> >> I'm hitting both a build and boot regression in -next with these
> >> patches.  The bisect log reports:
> >>
> >> git bisect skip f176ae3a5df610e0373c39538cc7744c7788b3df
> >> # bad: [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup
> >> namespace setns support
> >> git bisect bad 54b39d263704b66cc8006ce8f7dd4743fdabae70
> >> # good: [bf9b3590879afbaa54148d817ed68a0b51180375] sched: new clone
> >> flag CLONE_NEWCGROUP for c
> >> group namespace
> >> git bisect good bf9b3590879afbaa54148d817ed68a0b51180375
> >> # only skipped commits left to test
> >> # possible first bad commit:
> >> [54b39d263704b66cc8006ce8f7dd4743fdabae70] cgroup: cgroup namespa
> >> ce setns support
> >> # possible first bad commit:
> >> [f176ae3a5df610e0373c39538cc7744c7788b3df] cgroup: introduce cgro
> >> up namespaces
> >>
> >> Where compiling "f176ae3a5df6"  fails with:
> >>
> >> kernel/cgroup.c: In function ‘cgroupns_put’:
> >> kernel/cgroup.c:5956:2: error: implicit declaration of function
> >> ‘to_cg_ns’ [-Werror=implicit-f
> >> unction-declaration]
> >>  put_cgroup_ns(to_cg_ns(ns));
> >
> > Hi,
> >
> > is it possible you have a bad tree?
> 
> No, I don't believe so...
> 
> >  Looking at
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/kernel/cgroup.c?id=refs/tags/next-20151231
> > to_cg_ns should definately be defined there.
> >
> 
> Right, but it's not defined until commit 54b39d263704 "cgroup: cgroup
> namespace setns support", however it is first used in the preceeding
> commit f176ae3a5df6 "cgroup: introduce cgroup namespaces".  The bad
> patch ordering breaks bisection.
> 
> Commit 54b39d263704 "cgroup: cgroup namespace setns support" not
> booting is a separate issue.

Oh - been there since my first version of the set (v4).  Odd, I
thought that the automated korg testing caught those.

What is the simplest way to fix this?  Do I send new versions of
patches v3 and v4?  Does Tejun or Stephen just do it inline in the
git tree?  Do we leave it be?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2016-01-01 18:06       ` Serge E. Hallyn
@ 2016-01-01 19:14         ` Dan Williams
  2016-01-02 11:52           ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2016-01-01 19:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stephen Rothwell, Greg Kroah-Hartman, linux-api, containers,
	Linux Kernel Mailing List, Eric W. Biederman, lxc-devel,
	Johannes Weiner, Tejun Heo, cgroups, Andrew Morton

On Fri, Jan 1, 2016 at 10:06 AM, Serge E. Hallyn
<serge.hallyn@ubuntu.com> wrote:
> On Fri, Jan 01, 2016 at 01:42:57AM -0800, Dan Williams wrote:
>> Commit 54b39d263704 "cgroup: cgroup namespace setns support" not
>> booting is a separate issue.
>
> Oh - been there since my first version of the set (v4).  Odd, I
> thought that the automated korg testing caught those.
>
> What is the simplest way to fix this?  Do I send new versions of
> patches v3 and v4?  Does Tejun or Stephen just do it inline in the
> git tree?  Do we leave it be?

I'm assuming it can be fixed up when you re-spin the patches to fix
the boot failure.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: CGroup Namespaces (v8)
  2016-01-01 19:14         ` Dan Williams
@ 2016-01-02 11:52           ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2016-01-02 11:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Serge E. Hallyn, Stephen Rothwell, Greg Kroah-Hartman, linux-api,
	containers, Linux Kernel Mailing List, Eric W. Biederman,
	lxc-devel, Johannes Weiner, cgroups, Andrew Morton

On Fri, Jan 01, 2016 at 11:14:14AM -0800, Dan Williams wrote:
> On Fri, Jan 1, 2016 at 10:06 AM, Serge E. Hallyn
> <serge.hallyn@ubuntu.com> wrote:
> > On Fri, Jan 01, 2016 at 01:42:57AM -0800, Dan Williams wrote:
> >> Commit 54b39d263704 "cgroup: cgroup namespace setns support" not
> >> booting is a separate issue.
> >
> > Oh - been there since my first version of the set (v4).  Odd, I
> > thought that the automated korg testing caught those.
> >
> > What is the simplest way to fix this?  Do I send new versions of
> > patches v3 and v4?  Does Tejun or Stephen just do it inline in the
> > git tree?  Do we leave it be?
> 
> I'm assuming it can be fixed up when you re-spin the patches to fix
> the boot failure.

I reverted the cgroup namespace patchset for the time being.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-01-02 11:53 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-23  4:23 CGroup Namespaces (v8) serge.hallyn
2015-12-23  4:23 ` [PATCH 1/8] kernfs: Add API to generate relative kernfs path serge.hallyn
2015-12-23 16:08   ` Tejun Heo
2015-12-23 16:36     ` Serge E. Hallyn
2015-12-23 19:33     ` [PATCH 1/8 v8.2] " Serge E. Hallyn
2015-12-23 16:24   ` [PATCH 1/8] " Tejun Heo
2015-12-23 16:51     ` Greg KH
2015-12-23  4:23 ` [PATCH 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace serge.hallyn
2015-12-23  4:23 ` [PATCH 3/8] cgroup: introduce cgroup namespaces serge.hallyn
2015-12-23 16:15   ` Tejun Heo
2015-12-23 19:34     ` [PATCH 3/8 v8.2] " Serge E. Hallyn
2015-12-23  4:23 ` [PATCH 4/8] cgroup: cgroup namespace setns support serge.hallyn
2015-12-23  4:23 ` [PATCH 5/8] kernfs: define kernfs_node_dentry serge.hallyn
2015-12-23 16:25   ` Tejun Heo
2015-12-23 16:51     ` Greg KH
2015-12-23  4:23 ` [PATCH 6/8] cgroup: mount cgroupns-root when inside non-init cgroupns serge.hallyn
2015-12-31 13:38   ` Sergey Senozhatsky
2016-01-01  0:58     ` Serge E. Hallyn
2016-01-01  1:17       ` Sergey Senozhatsky
2016-01-01  1:56       ` Tejun Heo
2015-12-23  4:23 ` [PATCH 7/8] cgroup: Add documentation for cgroup namespaces serge.hallyn
2015-12-28 17:47   ` Tejun Heo
2015-12-28 21:13     ` Serge Hallyn
2015-12-28 21:48       ` [PATCH] " Tejun Heo
2015-12-23  4:23 ` [PATCH 8/8] Add FS_USERNS_FLAG to cgroup fs serge.hallyn
2015-12-28 17:46 ` CGroup Namespaces (v8) Tejun Heo
2016-01-01  8:19 ` Dan Williams
2016-01-01  8:59   ` Serge E. Hallyn
2016-01-01  9:42     ` Dan Williams
2016-01-01 18:06       ` Serge E. Hallyn
2016-01-01 19:14         ` Dan Williams
2016-01-02 11:52           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).