[PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance
@ 2019-01-03 23:44 David Howells
  2019-01-03 23:44 ` [PATCH 2/5] kernfs: Provide a fill_super hook for the subclass filesystem David Howells
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:44 UTC (permalink / raw)
  To: viro; +Cc: Andrei Vagin, avagin, dhowells, linux-fsdevel

Always apply the v2 cgroup root flag settings after successfully obtaining
a superblock, both when the superblock is new and when an already extant
superblock is being shared.

Currently there's a bug in commit b3678086951a5 whereby the flags are only
changed if the superblock wasn't new.  The intention was originally to
effect the change for a new superblock by having kernfs_fill_super() call
back into the subclassing filesystem, but that never got completed.

On further reflection, it's possibly better not to do that so that we don't
have to revert the flag change if there's a later failure.

This patch is on top of Andrei Vagin's v6 cgroup refcount patch.

Fixes: b3678086951a ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Andrei Vagin <avagin@gmail.com>
---

 kernel/cgroup/cgroup.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a19f0fec9d82..2e5150412ae0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2044,12 +2044,9 @@ int cgroup_do_get_tree(struct fs_context *fc)
 		fc->root = nsdentry;
 	}

+	if (ctx->version == 2)
+		apply_cgroup_root_flags(ctx->flags);
 	ret = 0;
-	if (ctx->kfc.new_sb_created)
-		goto out_cgrp;
-	apply_cgroup_root_flags(ctx->flags);
-	return 0;
-
 out_cgrp:
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/5] kernfs: Provide a fill_super hook for the subclass filesystem
  2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
@ 2019-01-03 23:44 ` David Howells
  2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:44 UTC (permalink / raw)
  To: viro; +Cc: avagin, dhowells, linux-fsdevel

Provide a fill_super hook for a subclass filesystem to use to get a
reference on the resource it attaches to struct kernfs_super_info::root.

Without this, error handling becomes tricky in the event that, say,
kernfs_get_inode() fails in kernfs_fill_super() as the superblock
destructor will be invoked before kernfs_get_tree() returns.

Fixes: b3678086951a ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/kernfs/mount.c      |   13 +++++++++++--
 include/linux/kernfs.h |    1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3b61a4bb02c4..f04173d29845 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -217,11 +217,13 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 	} while (true);
 }
 
-static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
+static int kernfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
+	struct kernfs_fs_context *kfc = fc->fs_private;
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct inode *inode;
 	struct dentry *root;
+	int ret;
 
 	info->sb = sb;
 	/* Userspace would break if executables or devices appear on sysfs */
@@ -238,6 +240,12 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	/* sysfs dentries and inodes don't require IO to create */
 	sb->s_shrink.seeks = 0;
 
+	if (kfc->fill_super) {
+		ret = kfc->fill_super(sb, fc);
+		if (ret < 0)
+			return ret;
+	}
+
 	/* get root inode, initialize and unlock it */
 	mutex_lock(&kernfs_mutex);
 	inode = kernfs_get_inode(sb, info->root->kn);
@@ -290,6 +298,7 @@ const void *kernfs_super_ns(struct super_block *sb)
 /**
  * kernfs_get_tree - kernfs filesystem access/retrieval helper
  * @fc: The filesystem context.
+ * @fill_super: The subclass's superblock initialiser function
  *
  * This is to be called from each kernfs user's fs_context->ops->get_tree()
  * implementation, which should set the specified ->@fs_type and ->@flags, and
@@ -321,7 +330,7 @@ int kernfs_get_tree(struct fs_context *fc)
 
 		kfc->new_sb_created = true;
 
-		error = kernfs_fill_super(sb, kfc);
+		error = kernfs_fill_super(sb, fc);
 		if (error) {
 			deactivate_locked_super(sb);
 			return error;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 60ae0f862a79..a68e3fdd428c 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -277,6 +277,7 @@ struct kernfs_fs_context {
 	struct kernfs_root	*root;		/* Root of the hierarchy being mounted */
 	void			*ns_tag;	/* Namespace tag of the mount (or NULL) */
 	unsigned long		magic;		/* File system specific magic number */
+	int (*fill_super)(struct super_block *sb, struct fs_context *fc);
 
 	/* The following are set/used by kernfs_mount() */
 	bool			new_sb_created;	/* Set to T if we allocated a new sb */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] cgroup: Fix refcounting
  2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
  2019-01-03 23:44 ` [PATCH 2/5] kernfs: Provide a fill_super hook for the subclass filesystem David Howells
@ 2019-01-03 23:44 ` David Howells
  2019-01-04  8:16   ` Andrei Vagin
                     ` (2 more replies)
  2019-01-03 23:44 ` [PATCH 4/5] cgroup: refcount fixes David Howells
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:44 UTC (permalink / raw)
  To: viro; +Cc: avagin, dhowells, linux-fsdevel

Fix cgroup refcounting by the following means:

 (1) Don't use PERCPU_REF_INIT_DEAD and percpu_ref_reinit().  Using this
     causes a problem should kernfs_get_tree() create a superblock and then
     fail to create a root inode.  The superblock destructor will be invoked
     before kernfs_get_tree() returns - and the refcount isn't reinit'd till
     after that.

     To this end, cgroup_setup_root() no longer needs a ref_flags argument.

 (2) Provide a flag, CSS_CREATING, that is used to prevent concurrent access
     to a cgroup root that is being set up and for which the superblock is
     still being set up.  This appears to be necessary to hide the fact that
     the cgroup is accessible before the superblock is fully initialised.

 (3) Set CSS_CREATING in cgroup1_get_tree() on a new cgroup_root.  This is
     then cleared in cgroup_do_get_tree().

 (4) cgroup_get_tree() is made to call cgroup_get() on the root it sets for a
     v2 cgroup.  Admittedly, this doesn't do anything because CSS_NO_REF is
     set, but it future proofs it in case this is changed in future.

 (5) cgroup_fill_super() is created and passed to kernfs_get_tree() in the
     kernfs_fs_context struct.  This takes an extra ref on the root for the
     superblock in the event that a superblock is created.

     struct cgroup_fs_context::root then holds a single ref on the root right
     through till it is freed.

Note that new_root is transferred into the cgroup_fs_context as
is_new_root, though this is probably unnecessary as it's only used to clear
CSS_CREATING - and no one else can gain access to the root until we've
cleared the flag.

Fixes: b3678086951a ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/cgroup-defs.h     |    1 +
 include/linux/cgroup.h          |    4 ++++
 kernel/cgroup/cgroup-internal.h |    3 ++-
 kernel/cgroup/cgroup-v1.c       |   25 +++++++++----------------
 kernel/cgroup/cgroup.c          |   24 +++++++++++++++++++++---
 5 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..f5da2396a809 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -52,6 +52,7 @@ enum {
 	CSS_RELEASED	= (1 << 2), /* refcnt reached zero, released */
 	CSS_VISIBLE	= (1 << 3), /* css is visible to userland */
 	CSS_DYING	= (1 << 4), /* css is dying */
+	CSS_CREATING	= (1 << 5), /* Root css is being constructed */
 };
 
 /* bits in struct cgroup flags field */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index bb0c7da50ed2..5708ad663572 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -333,6 +333,8 @@ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
  */
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
+	if (css->flags & CSS_CREATING)
+		return false;
 	if (!(css->flags & CSS_NO_REF))
 		return percpu_ref_tryget(&css->refcnt);
 	return true;
@@ -350,6 +352,8 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
  */
 static inline bool css_tryget_online(struct cgroup_subsys_state *css)
 {
+	if (css->flags & CSS_CREATING)
+		return false;
 	if (!(css->flags & CSS_NO_REF))
 		return percpu_ref_tryget_live(&css->refcnt);
 	return true;
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index ff86943ea1c8..b22c3d95d8eb 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -46,6 +46,7 @@ struct cgroup_fs_context {
 	unsigned int	flags;			/* CGRP_ROOT_* flags */
 
 	/* cgroup1 bits */
+	bool		is_new_root;		/* ->root is new and needs refcnt init */
 	bool		cpuset_clone_children;
 	bool		none;			/* User explicitly requested empty subsystem */
 	bool		all_ss;			/* Seen 'all' option */
@@ -214,7 +215,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
 
 void cgroup_free_root(struct cgroup_root *root);
 void init_cgroup_root(struct cgroup_fs_context *ctx);
-int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
+int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
 int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
 int cgroup_do_get_tree(struct fs_context *fc);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 4b189e821cad..0fbbde86a64d 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1156,7 +1156,6 @@ int cgroup1_get_tree(struct fs_context *fc)
 	struct cgroup_root *root;
 	struct cgroup_subsys *ss;
 	int i, ret = cgroup1_validate(fc);
-	bool new_root = false;
 
 	if (ret)
 		return ret;
@@ -1261,12 +1260,19 @@ int cgroup1_get_tree(struct fs_context *fc)
 		ret = -ENOMEM;
 		goto err_unlock;
 	}
-	new_root = true;
 	ctx->root = root;
+	ctx->is_new_root = true;
 
 	init_cgroup_root(ctx);
 
-	ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
+	/*
+	 * There's a race window after we release cgroup_mutex and before
+	 * allocating a superblock. Make sure a concurrent process won't be
+	 * able to re-use the root during this window by setting CSS_CREATING.
+	 */
+	root->cgrp.self.flags |= CSS_CREATING;
+
+	ret = cgroup_setup_root(root, ctx->subsys_mask);
 	if (ret)
 		goto err_unlock;
 
@@ -1275,19 +1281,6 @@ int cgroup1_get_tree(struct fs_context *fc)
 
 	ret = cgroup_do_get_tree(fc);
 
-	/*
-	 * There's a race window after we release cgroup_mutex and before
-	 * allocating a superblock. Make sure a concurrent process won't
-	 * be able to re-use the root during this window by delaying the
-	 * initialization of root refcnt.
-	 */
-	if (new_root) {
-		mutex_lock(&cgroup_mutex);
-		percpu_ref_reinit(&root->cgrp.self.refcnt);
-		mutex_unlock(&cgroup_mutex);
-		cgroup_get(&root->cgrp);
-	}
-
 	/*
 	 * If @pinned_sb, we're reusing an existing root and holding an
 	 * extra ref on its sb.  Mount is complete.  Put the extra ref.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2e5150412ae0..091e7eca3661 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1909,7 +1909,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
 }
 
-int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
+int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 {
 	LIST_HEAD(tmp_links);
 	struct cgroup *root_cgrp = &root->cgrp;
@@ -1926,7 +1926,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 	root_cgrp->ancestor_ids[0] = ret;
 
 	ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
-			      ref_flags, GFP_KERNEL);
+			      0, GFP_KERNEL);
 	if (ret)
 		goto out;
 
@@ -2010,12 +2010,23 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
 	return ret;
 }
 
+static int cgroup_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
+
+	mutex_lock(&cgroup_mutex);
+	cgroup_get(&ctx->root->cgrp);
+	mutex_unlock(&cgroup_mutex);
+	return 0;
+}
+
 int cgroup_do_get_tree(struct fs_context *fc)
 {
 	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	int ret;
 
 	ctx->kfc.root = ctx->root->kf_root;
+	ctx->kfc.fill_super = cgroup_fill_super;
 
 	ret = kernfs_get_tree(fc);
 	if (ret < 0)
@@ -2046,6 +2057,12 @@ int cgroup_do_get_tree(struct fs_context *fc)
 
 	if (ctx->version == 2)
 		apply_cgroup_root_flags(ctx->flags);
+
+	if (ctx->is_new_root) {
+		mutex_lock(&cgroup_mutex);
+		ctx->root->cgrp.self.flags &= ~CSS_CREATING;
+		mutex_unlock(&cgroup_mutex);
+	}
 	ret = 0;
 out_cgrp:
 	return ret;
@@ -2071,6 +2088,7 @@ static int cgroup_get_tree(struct fs_context *fc)
 		cgroup_get_live(&cgrp_dfl_root.cgrp);
 
 		ctx->root = &cgrp_dfl_root;
+		cgroup_get(&ctx->root->cgrp);
 		return cgroup_do_get_tree(fc);
 
 	default:
@@ -5420,7 +5438,7 @@ int __init cgroup_init(void)
 	hash_add(css_set_table, &init_css_set.hlist,
 		 css_set_hash(init_css_set.subsys));
 
-	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));
+	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
 
 	mutex_unlock(&cgroup_mutex);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] cgroup: refcount fixes
  2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
  2019-01-03 23:44 ` [PATCH 2/5] kernfs: Provide a fill_super hook for the subclass filesystem David Howells
  2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
@ 2019-01-03 23:44 ` David Howells
  2019-01-03 23:44 ` [PATCH 5/5] percpu: Kill off PERCPU_REF_INIT_DEAD and percpu_ref_reinit() David Howells
  2019-01-03 23:56 ` [PATCH 4/5] cgroup: refcount fixes David Howells
  4 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:44 UTC (permalink / raw)
  To: viro; +Cc: avagin, dhowells, linux-fsdevel


---

 include/linux/cgroup-defs.h     |    2 ++
 include/linux/cgroup.h          |   32 ++++++++++++++++++++++++--------
 kernel/cgroup/cgroup-internal.h |    1 +
 kernel/cgroup/cgroup-v1.c       |    6 ++++--
 kernel/cgroup/cgroup.c          |   26 ++++++++++++++++++++++----
 5 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f5da2396a809..34a284ef3a78 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -166,6 +166,8 @@ struct cgroup_subsys_state {
 	 * fields of the containing structure.
 	 */
 	struct cgroup_subsys_state *parent;
+
+	bool debug;
 };
 
 /*
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5708ad663572..72c076920072 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -295,6 +295,8 @@ void css_task_iter_end(struct css_task_iter *it);
  * Inline functions.
  */
 
+extern void css_refcount(struct cgroup_subsys_state *css, const char *op);
+
 /**
  * css_get - obtain a reference on the specified css
  * @css: target css
@@ -303,8 +305,10 @@ void css_task_iter_end(struct css_task_iter *it);
  */
 static inline void css_get(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get(&css->refcnt);
+		css_refcount(css, "GET");
+	}
 }
 
 /**
@@ -316,8 +320,10 @@ static inline void css_get(struct cgroup_subsys_state *css)
  */
 static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_get_many(&css->refcnt, n);
+		css_refcount(css, "GETM");
+	}
 }
 
 /**
@@ -335,8 +341,11 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
 	if (css->flags & CSS_CREATING)
 		return false;
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget(&css->refcnt))
+		return false;
+	css_refcount(css, "TRY");
 	return true;
 }
 
@@ -354,8 +363,11 @@ static inline bool css_tryget_online(struct cgroup_subsys_state *css)
 {
 	if (css->flags & CSS_CREATING)
 		return false;
-	if (!(css->flags & CSS_NO_REF))
-		return percpu_ref_tryget_live(&css->refcnt);
+	if (css->flags & CSS_NO_REF)
+		return true;
+	if (!percpu_ref_tryget_live(&css->refcnt))
+		return false;
+	css_refcount(css, "TRYO");
 	return true;
 }
 
@@ -387,8 +399,10 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
  */
 static inline void css_put(struct cgroup_subsys_state *css)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put(&css->refcnt);
+		css_refcount(css, "PUT");
+	}
 }
 
 /**
@@ -400,8 +414,10 @@ static inline void css_put(struct cgroup_subsys_state *css)
  */
 static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
 {
-	if (!(css->flags & CSS_NO_REF))
+	if (!(css->flags & CSS_NO_REF)) {
 		percpu_ref_put_many(&css->refcnt, n);
+		css_refcount(css, "PUTM");
+	}
 }
 
 static inline void cgroup_get(struct cgroup *cgrp)
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b22c3d95d8eb..afe417c44602 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -46,6 +46,7 @@ struct cgroup_fs_context {
 	unsigned int	flags;			/* CGRP_ROOT_* flags */
 
 	/* cgroup1 bits */
+	bool		debug;
 	bool		is_new_root;		/* ->root is new and needs refcnt init */
 	bool		cpuset_clone_children;
 	bool		none;			/* User explicitly requested empty subsystem */
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 0fbbde86a64d..281930390c6f 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1033,6 +1033,8 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
 			return cg_invalf(fc, "cgroup1: name respecified");
 		ctx->name = param->string;
 		param->string = NULL;
+		if (strcmp(ctx->name, "xxxxy") == 0)
+			ctx->debug = true;
 		return 0;
 	}
 
@@ -1174,7 +1176,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		    ss->root == &cgrp_dfl_root)
 			continue;
 
-		if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+		if (!css_tryget_online(&ss->root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			goto err_restart;
 		}
@@ -1227,7 +1229,7 @@ int cgroup1_get_tree(struct fs_context *fc)
 		 */
 		pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
 		if (IS_ERR(pinned_sb) ||
-		    !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+		    !css_tryget_online(&root->cgrp.self)) {
 			mutex_unlock(&cgroup_mutex);
 			if (!IS_ERR_OR_NULL(pinned_sb))
 				deactivate_super(pinned_sb);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 091e7eca3661..a25c6dee8515 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1907,6 +1907,8 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
 		strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
 	if (ctx->cpuset_clone_children)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
+	if (ctx->debug)
+		cgrp->self.debug = true;
 }
 
 int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
@@ -1926,7 +1928,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	root_cgrp->ancestor_ids[0] = ret;
 
 	ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
-			      0, GFP_KERNEL);
+			      PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out;
 
@@ -2025,6 +2027,8 @@ int cgroup_do_get_tree(struct fs_context *fc)
 	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	int ret;
 
+	printk("*** cgroup_do_get_tree()\n");
+
 	ctx->kfc.root = ctx->root->kf_root;
 	ctx->kfc.fill_super = cgroup_fill_super;
 
@@ -2072,6 +2076,8 @@ static int cgroup_get_tree(struct fs_context *fc)
 {
 	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 
+	printk("*** cgroup_get_tree()\n");
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2169,8 +2175,10 @@ static void cgroup_kill_sb(struct super_block *sb)
 	if (!list_empty(&root->cgrp.self.children) ||
 	    root == &cgrp_dfl_root)
 		cgroup_put(&root->cgrp);
-	else
+	else {
+		css_refcount(&root->cgrp.self, "KIL");
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }
@@ -4894,7 +4902,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 
 	init_and_link_css(css, ss, cgrp);
 
-	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
+	err = percpu_ref_init(&css->refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (err)
 		goto err_free_css;
 
@@ -4949,7 +4957,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
+	ret = percpu_ref_init(&cgrp->self.refcnt, css_release, PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
 	if (ret)
 		goto out_free_cgrp;
 
@@ -5197,6 +5205,7 @@ static void kill_css(struct cgroup_subsys_state *css)
 	 * Use percpu_ref_kill_and_confirm() to get notifications as each
 	 * css is confirmed to be seen as killed on all CPUs.
 	 */
+	css_refcount(css, "KLC");
 	percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
 }
 
@@ -5281,6 +5290,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	cgroup1_check_for_release(parent);
 
 	/* put the base reference */
+	css_refcount(&cgrp->self, "KIL");
 	percpu_ref_kill(&cgrp->self.refcnt);
 
 	return 0;
@@ -6109,3 +6119,11 @@ static int __init cgroup_sysfs_init(void)
 }
 subsys_initcall(cgroup_sysfs_init);
 #endif /* CONFIG_SYSFS */
+
+noinline void css_refcount(struct cgroup_subsys_state *css, const char *op)
+{
+	if (css->debug)
+		printk("*** %p %s %2ld: %pSR\n",
+		       css, op, atomic_long_read(&css->refcnt.count),
+		       __builtin_return_address(0));
+}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] percpu: Kill off PERCPU_REF_INIT_DEAD and percpu_ref_reinit()
  2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
                   ` (2 preceding siblings ...)
  2019-01-03 23:44 ` [PATCH 4/5] cgroup: refcount fixes David Howells
@ 2019-01-03 23:44 ` David Howells
  2019-01-03 23:56 ` [PATCH 4/5] cgroup: refcount fixes David Howells
  4 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:44 UTC (permalink / raw)
  To: viro; +Cc: avagin, dhowells, linux-fsdevel

Kill off PERCPU_REF_INIT_DEAD and percpu_ref_reinit() for lack of users.

They were being used in cgroups, but this was causing problems with the
refcounting in the event that kernfs_get_tree() created a superblock and
then failed with the refcount not fully initialised.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/percpu-refcount.h |    7 -------
 lib/percpu-refcount.c           |   31 +++----------------------------
 2 files changed, 3 insertions(+), 35 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index b297cd1cd4f1..fdd69b99f331 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -77,12 +77,6 @@ enum {
 	 * percpu_ref_switch_to_percpu() is invoked on it.
 	 */
 	PERCPU_REF_INIT_ATOMIC	= 1 << 0,
-
-	/*
-	 * Start dead w/ ref == 0 in atomic mode.  Must be revived with
-	 * percpu_ref_reinit() before used.  Implies INIT_ATOMIC.
-	 */
-	PERCPU_REF_INIT_DEAD	= 1 << 1,
 };
 
 struct percpu_ref {
@@ -109,7 +103,6 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
 void percpu_ref_resurrect(struct percpu_ref *ref);
-void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index de10b8c0bff6..7a56c36aec5d 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -70,15 +70,12 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 
 	ref->force_atomic = flags & PERCPU_REF_INIT_ATOMIC;
 
-	if (flags & (PERCPU_REF_INIT_ATOMIC | PERCPU_REF_INIT_DEAD))
+	if (flags & PERCPU_REF_INIT_ATOMIC)
 		ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;
 	else
 		start_count += PERCPU_COUNT_BIAS;
 
-	if (flags & PERCPU_REF_INIT_DEAD)
-		ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
-	else
-		start_count++;
+	start_count++;
 
 	atomic_long_set(&ref->count, start_count);
 
@@ -282,14 +279,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
  * @ref: percpu_ref to switch to percpu mode
  *
  * There's no reason to use this function for the usual reference counting.
- * To re-use an expired ref, use percpu_ref_reinit().
  *
  * Switch @ref to percpu mode.  This function may be invoked concurrently
  * with all the get/put operations and can safely be mixed with kill and
  * reinit operations.  This function reverses the sticky atomic state set
- * by PERCPU_REF_INIT_ATOMIC or percpu_ref_switch_to_atomic().  If @ref is
- * dying or dead, the actual switching takes place on the following
- * percpu_ref_reinit().
+ * by PERCPU_REF_INIT_ATOMIC or percpu_ref_switch_to_atomic().
  *
  * This function may block if @ref is in the process of switching to atomic
  * mode.  If the caller ensures that @ref is not in the process of
@@ -343,25 +337,6 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 }
 EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
 
-/**
- * percpu_ref_reinit - re-initialize a percpu refcount
- * @ref: perpcu_ref to re-initialize
- *
- * Re-initialize @ref so that it's in the same state as when it finished
- * percpu_ref_init() ignoring %PERCPU_REF_INIT_DEAD.  @ref must have been
- * initialized successfully and reached 0 but not exited.
- *
- * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
- * this function is in progress.
- */
-void percpu_ref_reinit(struct percpu_ref *ref)
-{
-	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
-
-	percpu_ref_resurrect(ref);
-}
-EXPORT_SYMBOL_GPL(percpu_ref_reinit);
-
 /**
  * percpu_ref_resurrect - modify a percpu refcount from dead to live
  * @ref: perpcu_ref to resurrect

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/5] cgroup: refcount fixes
  2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
                   ` (3 preceding siblings ...)
  2019-01-03 23:44 ` [PATCH 5/5] percpu: Kill off PERCPU_REF_INIT_DEAD and percpu_ref_reinit() David Howells
@ 2019-01-03 23:56 ` David Howells
  4 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2019-01-03 23:56 UTC (permalink / raw)
  Cc: dhowells, viro, avagin, linux-fsdevel

This patch is just some debugging stuff, allowing the refcounting to be
traced; it's not intended for inclusion upstream, but rather included in case
it's useful.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/5] cgroup: Fix refcounting
  2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
@ 2019-01-04  8:16   ` Andrei Vagin
  2019-01-05  6:37   ` Al Viro
  2019-01-05  7:21   ` David Howells
  2 siblings, 0 replies; 10+ messages in thread
From: Andrei Vagin @ 2019-01-04  8:16 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-fsdevel

On Thu, Jan 03, 2019 at 11:44:16PM +0000, David Howells wrote:
> Fix cgroup refcounting by the following means:
> 
>  (1) Don't use PERCPU_REF_INIT_DEAD and percpu_ref_reinit().  Using this
>      causes a problem should kernfs_get_tree() create a superblock and then
>      fail to create a root inode.  The superblock destructor will be invoked
>      before kernfs_get_tree() returns - and the refcount isn't reinit'd till
>      after that.
> 
>      To this end, cgroup_setup_root() no longer needs a ref_flags argument.
> 
>  (2) Provide a flag, CSS_CREATING, that is used to prevent concurrent access
>      to a cgroup root that is being set up and for which the superblock is
>      still being set up.  This appears to be necessary to hide the fact that
>      the cgroup is accessible before the superblock is fully initialised.
> 
>  (3) Set CSS_CREATING in cgroup1_get_tree() on a new cgroup_root.  This is
>      then cleared in cgroup_do_get_tree().
> 
>  (4) cgroup_get_tree() is made to call cgroup_get() on the root it sets for a
>      v2 cgroup.  Admittedly, this doesn't do anything because CSS_NO_REF is
>      set, but it future proofs it in case this is changed in future.
> 
>  (5) cgroup_fill_super() is created and passed to kernfs_get_tree() in the
>      kernfs_fs_context struct.  This takes an extra ref on the root for the
>      superblock in the event that a superblock is created.
> 
>      struct cgroup_fs_context::root then holds a single ref on the root right
>      through till it is freed.
> 
> Note that new_root is transferred into the cgroup_fs_context as
> is_new_root, though this is probably unnecessary as it's only used to clear
> CSS_CREATING - and no one else can gain access to the root until we've
> cleared the flag.
> 

I haven't read the patch yet, but all my tests passed without any
visible problems with this series. Thanks!

https://travis-ci.org/avagin/linux/builds/475101517

Tested-by: Andrei Vagin <avagin@gmail.com>
Reported-by: Andrei Vagin <avagin@gmail.com>

> Fixes: b3678086951a ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context")
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  include/linux/cgroup-defs.h     |    1 +
>  include/linux/cgroup.h          |    4 ++++
>  kernel/cgroup/cgroup-internal.h |    3 ++-
>  kernel/cgroup/cgroup-v1.c       |   25 +++++++++----------------
>  kernel/cgroup/cgroup.c          |   24 +++++++++++++++++++++---
>  5 files changed, 37 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 5e1694fe035b..f5da2396a809 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -52,6 +52,7 @@ enum {
>  	CSS_RELEASED	= (1 << 2), /* refcnt reached zero, released */
>  	CSS_VISIBLE	= (1 << 3), /* css is visible to userland */
>  	CSS_DYING	= (1 << 4), /* css is dying */
> +	CSS_CREATING	= (1 << 5), /* Root css is being constructed */
>  };
>  
>  /* bits in struct cgroup flags field */
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index bb0c7da50ed2..5708ad663572 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -333,6 +333,8 @@ static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
>   */
>  static inline bool css_tryget(struct cgroup_subsys_state *css)
>  {
> +	if (css->flags & CSS_CREATING)
> +		return false;
>  	if (!(css->flags & CSS_NO_REF))
>  		return percpu_ref_tryget(&css->refcnt);
>  	return true;
> @@ -350,6 +352,8 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
>   */
>  static inline bool css_tryget_online(struct cgroup_subsys_state *css)
>  {
> +	if (css->flags & CSS_CREATING)
> +		return false;
>  	if (!(css->flags & CSS_NO_REF))
>  		return percpu_ref_tryget_live(&css->refcnt);
>  	return true;
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index ff86943ea1c8..b22c3d95d8eb 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -46,6 +46,7 @@ struct cgroup_fs_context {
>  	unsigned int	flags;			/* CGRP_ROOT_* flags */
>  
>  	/* cgroup1 bits */
> +	bool		is_new_root;		/* ->root is new and needs refcnt init */
>  	bool		cpuset_clone_children;
>  	bool		none;			/* User explicitly requested empty subsystem */
>  	bool		all_ss;			/* Seen 'all' option */
> @@ -214,7 +215,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
>  
>  void cgroup_free_root(struct cgroup_root *root);
>  void init_cgroup_root(struct cgroup_fs_context *ctx);
> -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
> +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
>  int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
>  int cgroup_do_get_tree(struct fs_context *fc);
>  
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 4b189e821cad..0fbbde86a64d 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -1156,7 +1156,6 @@ int cgroup1_get_tree(struct fs_context *fc)
>  	struct cgroup_root *root;
>  	struct cgroup_subsys *ss;
>  	int i, ret = cgroup1_validate(fc);
> -	bool new_root = false;
>  
>  	if (ret)
>  		return ret;
> @@ -1261,12 +1260,19 @@ int cgroup1_get_tree(struct fs_context *fc)
>  		ret = -ENOMEM;
>  		goto err_unlock;
>  	}
> -	new_root = true;
>  	ctx->root = root;
> +	ctx->is_new_root = true;
>  
>  	init_cgroup_root(ctx);
>  
> -	ret = cgroup_setup_root(root, ctx->subsys_mask, PERCPU_REF_INIT_DEAD);
> +	/*
> +	 * There's a race window after we release cgroup_mutex and before
> +	 * allocating a superblock. Make sure a concurrent process won't be
> +	 * able to re-use the root during this window by setting CSS_CREATING.
> +	 */
> +	root->cgrp.self.flags |= CSS_CREATING;
> +
> +	ret = cgroup_setup_root(root, ctx->subsys_mask);
>  	if (ret)
>  		goto err_unlock;
>  
> @@ -1275,19 +1281,6 @@ int cgroup1_get_tree(struct fs_context *fc)
>  
>  	ret = cgroup_do_get_tree(fc);
>  
> -	/*
> -	 * There's a race window after we release cgroup_mutex and before
> -	 * allocating a superblock. Make sure a concurrent process won't
> -	 * be able to re-use the root during this window by delaying the
> -	 * initialization of root refcnt.
> -	 */
> -	if (new_root) {
> -		mutex_lock(&cgroup_mutex);
> -		percpu_ref_reinit(&root->cgrp.self.refcnt);
> -		mutex_unlock(&cgroup_mutex);
> -		cgroup_get(&root->cgrp);
> -	}
> -
>  	/*
>  	 * If @pinned_sb, we're reusing an existing root and holding an
>  	 * extra ref on its sb.  Mount is complete.  Put the extra ref.
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 2e5150412ae0..091e7eca3661 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1909,7 +1909,7 @@ void init_cgroup_root(struct cgroup_fs_context *ctx)
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
> +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
>  	struct cgroup *root_cgrp = &root->cgrp;
> @@ -1926,7 +1926,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
>  	root_cgrp->ancestor_ids[0] = ret;
>  
>  	ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
> -			      ref_flags, GFP_KERNEL);
> +			      0, GFP_KERNEL);
>  	if (ret)
>  		goto out;
>  
> @@ -2010,12 +2010,23 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
>  	return ret;
>  }
>  
> +static int cgroup_fill_super(struct super_block *sb, struct fs_context *fc)
> +{
> +	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
> +
> +	mutex_lock(&cgroup_mutex);
> +	cgroup_get(&ctx->root->cgrp);
> +	mutex_unlock(&cgroup_mutex);
> +	return 0;
> +}
> +
>  int cgroup_do_get_tree(struct fs_context *fc)
>  {
>  	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
>  	int ret;
>  
>  	ctx->kfc.root = ctx->root->kf_root;
> +	ctx->kfc.fill_super = cgroup_fill_super;
>  
>  	ret = kernfs_get_tree(fc);
>  	if (ret < 0)
> @@ -2046,6 +2057,12 @@ int cgroup_do_get_tree(struct fs_context *fc)
>  
>  	if (ctx->version == 2)
>  		apply_cgroup_root_flags(ctx->flags);
> +
> +	if (ctx->is_new_root) {
> +		mutex_lock(&cgroup_mutex);
> +		ctx->root->cgrp.self.flags &= ~CSS_CREATING;
> +		mutex_unlock(&cgroup_mutex);
> +	}
>  	ret = 0;
>  out_cgrp:
>  	return ret;
> @@ -2071,6 +2088,7 @@ static int cgroup_get_tree(struct fs_context *fc)
>  		cgroup_get_live(&cgrp_dfl_root.cgrp);
>  
>  		ctx->root = &cgrp_dfl_root;
> +		cgroup_get(&ctx->root->cgrp);
>  		return cgroup_do_get_tree(fc);
>  
>  	default:
> @@ -5420,7 +5438,7 @@ int __init cgroup_init(void)
>  	hash_add(css_set_table, &init_css_set.hlist,
>  		 css_set_hash(init_css_set.subsys));
>  
> -	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));
> +	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
>  
>  	mutex_unlock(&cgroup_mutex);
>  
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/5] cgroup: Fix refcounting
  2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
  2019-01-04  8:16   ` Andrei Vagin
@ 2019-01-05  6:37   ` Al Viro
  2019-01-05 16:05     ` Al Viro
  2019-01-05  7:21   ` David Howells
  2 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2019-01-05  6:37 UTC (permalink / raw)
  To: David Howells; +Cc: avagin, linux-fsdevel

On Thu, Jan 03, 2019 at 11:44:16PM +0000, David Howells wrote:
> Fix cgroup refcounting by the following means:
> 
>  (1) Don't use PERCPU_REF_INIT_DEAD and percpu_ref_reinit().  Using this
>      causes a problem should kernfs_get_tree() create a superblock and then
>      fail to create a root inode.  The superblock destructor will be invoked
>      before kernfs_get_tree() returns - and the refcount isn't reinit'd till
>      after that.
> 
>      To this end, cgroup_setup_root() no longer needs a ref_flags argument.
> 
>  (2) Provide a flag, CSS_CREATING, that is used to prevent concurrent access
>      to a cgroup root that is being set up and for which the superblock is
>      still being set up.  This appears to be necessary to hide the fact that
>      the cgroup is accessible before the superblock is fully initialised.
> 
>  (3) Set CSS_CREATING in cgroup1_get_tree() on a new cgroup_root.  This is
>      then cleared in cgroup_do_get_tree().
> 
>  (4) cgroup_get_tree() is made to call cgroup_get() on the root it sets for a
>      v2 cgroup.  Admittedly, this doesn't do anything because CSS_NO_REF is
>      set, but it future proofs it in case this is changed in future.
> 
>  (5) cgroup_fill_super() is created and passed to kernfs_get_tree() in the
>      kernfs_fs_context struct.  This takes an extra ref on the root for the
>      superblock in the event that a superblock is created.
> 
>      struct cgroup_fs_context::root then holds a single ref on the root right
>      through till it is freed.
> 
> Note that new_root is transferred into the cgroup_fs_context as
> is_new_root, though this is probably unnecessary as it's only used to clear
> CSS_CREATING - and no one else can gain access to the root until we've
> cleared the flag.

Umm...  It'll need reordering; could you do CSS_CREATING parts on top of
mainline, with the rest rediffed on top of that?  We'll need to backport
the fix, obviously...

Re (4) - that's over the top for backporting.  Sure, the analogue of the
problem does exist in mainline - 
        if (IS_ERR(dentry) || !new_sb)
                cgroup_put(&root->cgrp);
can't tell kernfs_mount_ns() failing very early from kernfs_fill_super()
failing and triggering ->kill_sb().  The latter case has a reference
dropped by cgroup_kill_sb(); in the former nothing of that sort
happens.  Said that, we have the following cases:
	1) early failure (anything up to sget_userns() in mainline).
ERR_PTR() is returned, reference not consumed.
	2) successful reuse of existing superblock.  Normal dentry
returned, new_sb set to false, reference not consumed.
	3) new superblock, with failing kernfs_fill_super().
ERR_PTR() returned, new_sb set to true.  Reference consumed.
	4) new superblock, successful setup.  Normal dentry returned,
new_sb set to true.  Reference consumed.

Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed
iff new_sb is true.  Case (1) is also fine - reference is not consumed.
Case (3) is where we go wrong - the test comes up with "do cgroup_put()",
which we don't want.

Note that case (1) leaves new_sb uninitialized.  And *that* is the
root cause of this shit - if we initialized it (in cgroup_do_mount())
to false, the test would've been simply "if (!new_sb)".  In all
four cases.

That's precisely the same kind of bug as the one fixed in
7b745a4e4051 ("unfuck sysfs_mount()") last May...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/5] cgroup: Fix refcounting
  2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
  2019-01-04  8:16   ` Andrei Vagin
  2019-01-05  6:37   ` Al Viro
@ 2019-01-05  7:21   ` David Howells
  2 siblings, 0 replies; 10+ messages in thread
From: David Howells @ 2019-01-05  7:21 UTC (permalink / raw)
  To: Al Viro; +Cc: dhowells, avagin, linux-fsdevel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> Umm...  It'll need reordering; could you do CSS_CREATING parts on top of
> mainline, with the rest rediffed on top of that?  We'll need to backport
> the fix, obviously...

Not before Tuesday, probably:-(

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/5] cgroup: Fix refcounting
  2019-01-05  6:37   ` Al Viro
@ 2019-01-05 16:05     ` Al Viro
  0 siblings, 0 replies; 10+ messages in thread
From: Al Viro @ 2019-01-05 16:05 UTC (permalink / raw)
  To: David Howells; +Cc: avagin, linux-fsdevel

On Sat, Jan 05, 2019 at 06:37:03AM +0000, Al Viro wrote:

> Cases (2) and (4) (!IS_ERR(dentry)) are fine - reference is consumed
> iff new_sb is true.  Case (1) is also fine - reference is not consumed.
> Case (3) is where we go wrong - the test comes up with "do cgroup_put()",
> which we don't want.
> 
> Note that case (1) leaves new_sb uninitialized.  And *that* is the
> root cause of this shit - if we initialized it (in cgroup_do_mount())
> to false, the test would've been simply "if (!new_sb)".  In all
> four cases.
> 
> That's precisely the same kind of bug as the one fixed in
> 7b745a4e4051 ("unfuck sysfs_mount()") last May...

Egads...  That's not all there is in there; kernfs_node_dentry() failures
are... not handled well.

To start with, kernfs_node_dentry() leaks.  Consider this:
struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
                                  struct super_block *sb)
{
...
        dentry = dget(sb->s_root);

        /* Check if this is the root kernfs_node */
        if (!kn->parent)
                return dentry;
OK, it might (and normally will) return an extra reference to a dentry on
the given superblock.  However,
        knparent = find_next_ancestor(kn, NULL);
        if (WARN_ON(!knparent))
                return ERR_PTR(-EINVAL);
right after that will return ERR_PTR() *and* acquire an extra reference to
the root dentry of given superblock.  Then we have a loop:
        do {
                struct dentry *dtmp;
                struct kernfs_node *kntmp;

                if (kn == knparent)
                        return dentry;
                kntmp = find_next_ancestor(kn, knparent);
                if (WARN_ON(!kntmp))
                        return ERR_PTR(-EINVAL);
                dtmp = lookup_one_len_unlocked(kntmp->name, dentry,
                                               strlen(kntmp->name));
Get an extra reference to child of dentry or get ERR_PTR()
                dput(dentry);
... and drop the parent in either case.
                if (IS_ERR(dtmp))
                        return dtmp;
                knparent = kntmp;
                dentry = dtmp;
        } while (true);

So normally we acquire an return an extra reference to a dentry on our
superblock.  In case of error returned by lookup_one_len_unlocked()
we pass that error along; no effect on refcounts.  And in case of
that NULL from find_next_ancestor() we return an error *and* leave
behind an extra reference to some dentry on the superblock.  With
no way for caller to tell which dentry would that be.

IOW, these two if (WARN_ON(!...)) should do dput(dentry) as well,
but the trouble doesn't end there.  These, after all, appear to
be "never can happen" cases; failure of lookup_one_len_unlocked()
is really possible (with -ENOMEM, if nothing else).  And the caller
(cgroup_do_mount()) is doing this:
        if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
                struct dentry *nsdentry;
                struct cgroup *cgrp;

                mutex_lock(&cgroup_mutex);
                spin_lock_irq(&css_set_lock);

                cgrp = cset_cgroup_from_root(ns->root_cset, root);

                spin_unlock_irq(&css_set_lock);
                mutex_unlock(&cgroup_mutex);

                nsdentry = kernfs_node_dentry(cgrp->kn, dentry->d_sb);
                dput(dentry);
                dentry = nsdentry;
        }

Here dentry is equal to the root of our superblock.  The reference to
dentry is held, so's an active reference to superblock in question and
so is ->s_umount of that superblock.

In success case we end up acquiring an extra reference to dentry on
the same superblock, dropping the reference we used to hold on the
original and use that new dentry instead.

In failure case, though, we end up with
	* no extra references to any dentries on that superblock
	* active reference held to the superblock itself
	* ->s_umount held on it
... and the error is returned to caller and propagated back to
caller of ->mount().  Which has no way to tell which superblock
have we leaked (locked, at that).  IOW, deactivate_locked_super()
was missing not just in David's rewrite - it's needed in the
mainline as well...

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-05 16:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-03 23:44 [PATCH 1/5] cgroup2: Always apply root flags on successful superblock obtainance David Howells
2019-01-03 23:44 ` [PATCH 2/5] kernfs: Provide a fill_super hook for the subclass filesystem David Howells
2019-01-03 23:44 ` [PATCH 3/5] cgroup: Fix refcounting David Howells
2019-01-04  8:16   ` Andrei Vagin
2019-01-05  6:37   ` Al Viro
2019-01-05 16:05     ` Al Viro
2019-01-05  7:21   ` David Howells
2019-01-03 23:44 ` [PATCH 4/5] cgroup: refcount fixes David Howells
2019-01-03 23:44 ` [PATCH 5/5] percpu: Kill off PERCPU_REF_INIT_DEAD and percpu_ref_reinit() David Howells
2019-01-03 23:56 ` [PATCH 4/5] cgroup: refcount fixes David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.