All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-08 12:15 ` Yi Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Yi Tao @ 2021-09-08 12:15 UTC (permalink / raw)
  To: gregkh, tj, lizefan.x, hannes, mcgrof, keescook, yzaikin
  Cc: linux-kernel, cgroups, linux-fsdevel, shanpeic

In a scenario where containers are started with high concurrency, in
order to control the use of system resources by the container, it is
necessary to create a corresponding cgroup for each container and
attach the process. The kernel uses the cgroup_mutex global lock to
protect the consistency of the data, which results in a higher
long-tail delay for cgroup-related operations during concurrent startup.
For example, long-tail delay of creating cgroup under each subsystems
is 900ms when starting 400 containers, which becomes bottleneck of
performance. The delay is mainly composed of two parts, namely the
time of the critical section protected by cgroup_mutex and the
scheduling time of sleep. The scheduling time will increase with
the increase of the cpu overhead.

In order to solve this long-tail delay problem, we designed a cgroup
pool. The cgroup pool will create a certain number of cgroups in advance.
When a user creates a cgroup through the mkdir system call, a clean cgroup
can be quickly obtained from the pool. Cgroup pool draws on the idea of
cgroup rename. By creating pool and rename in advance, it reduces the
critical area of cgroup creation, and uses a spinlock different from
cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
competition with attaching processes on the other hand.

The core idea of implementing a cgroup pool is to create a hidden kernfs
tree. Cgroup is implemented based on the kernfs file system. The user
manipulates the cgroup through the kernfs file. Therefore, we can create
a cgroup in advance and place it in a hidden kernfs tree, so that the user
can not operate the cgroup. When the user needs to create one, move the
cgroup to its original location. Because this only needs to remove a node
from one kernfs tree and move it to another tree, it does not affect other
data of the cgroup and related subsystems, so this operation is very
efficient and fast, and there is no need to hold cgroup_mutex. In this
way, we get rid of the limitation of cgroup_mutex and reduce the time
consumption of the critical section, but the kernfs_rwsem is still
protecting the kernfs-related data structure, and the scheduling time
of sleep still exists.

In order to avoid the use of kernfs_rwsem, we introduced a pinned state for
the kernfs node. When the pinned state of this node is true, the lock that
protects the data of this node is changed from kernfs_rwsem to a lock that
can be set. In the scenario of a cgroup pool, the parent cgroup will have a
corresponding spinlock. When the pool is enabled, the kernfs nodes of all
cgroups under the parent cgroup are set to the pinned state. Create,
delete, and move these kernfs nodes are protected by the spinlock of the
parent cgroup, so data consistency will not be a problem.

After opening the pool, the user creates a cgroup will take the fast path
and obtain it from the cgroup pool. Deleting cgroups still take the slow
path. When resources in the pool are insufficient, a delayed task will be
triggered, and the pool will be replenished after a period of time. This
is done to avoid competition with the current creation of cgroups and thus
affect performance. When the resources in the pool are exhausted and not
replenished in time, the creation of a cgroup will take a slow path,
so users need to set an appropriate pool size and supplementary delay time.

What we did in the patches are:
	1.add pinned flags for kernfs nodes, so that they can get rid of
	kernfs_rwsem and choose to be protected by other locks.
	2.add pool_size interface which used to open cgroup pool and
	close cgroup pool.
	3.add extra kernfs tree which used to hide cgroup in pool.
	4.add spinlock to protect kernfs nodes of cgroup in pool


Yi Tao (2):
  add pinned flags for kernfs node
  support cgroup pool in v1

 fs/kernfs/dir.c             |  74 ++++++++++++++++-------
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 include/linux/kernfs.h      |  14 +++++
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 7 files changed, 345 insertions(+), 21 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-08 12:15 ` Yi Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Yi Tao @ 2021-09-08 12:15 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

In a scenario where containers are started with high concurrency, in
order to control the use of system resources by the container, it is
necessary to create a corresponding cgroup for each container and
attach the process. The kernel uses the cgroup_mutex global lock to
protect the consistency of the data, which results in a higher
long-tail delay for cgroup-related operations during concurrent startup.
For example, long-tail delay of creating cgroup under each subsystems
is 900ms when starting 400 containers, which becomes bottleneck of
performance. The delay is mainly composed of two parts, namely the
time of the critical section protected by cgroup_mutex and the
scheduling time of sleep. The scheduling time will increase with
the increase of the cpu overhead.

In order to solve this long-tail delay problem, we designed a cgroup
pool. The cgroup pool will create a certain number of cgroups in advance.
When a user creates a cgroup through the mkdir system call, a clean cgroup
can be quickly obtained from the pool. Cgroup pool draws on the idea of
cgroup rename. By creating pool and rename in advance, it reduces the
critical area of cgroup creation, and uses a spinlock different from
cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
competition with attaching processes on the other hand.

The core idea of implementing a cgroup pool is to create a hidden kernfs
tree. Cgroup is implemented based on the kernfs file system. The user
manipulates the cgroup through the kernfs file. Therefore, we can create
a cgroup in advance and place it in a hidden kernfs tree, so that the user
can not operate the cgroup. When the user needs to create one, move the
cgroup to its original location. Because this only needs to remove a node
from one kernfs tree and move it to another tree, it does not affect other
data of the cgroup and related subsystems, so this operation is very
efficient and fast, and there is no need to hold cgroup_mutex. In this
way, we get rid of the limitation of cgroup_mutex and reduce the time
consumption of the critical section, but the kernfs_rwsem is still
protecting the kernfs-related data structure, and the scheduling time
of sleep still exists.

In order to avoid the use of kernfs_rwsem, we introduced a pinned state for
the kernfs node. When the pinned state of this node is true, the lock that
protects the data of this node is changed from kernfs_rwsem to a lock that
can be set. In the scenario of a cgroup pool, the parent cgroup will have a
corresponding spinlock. When the pool is enabled, the kernfs nodes of all
cgroups under the parent cgroup are set to the pinned state. Create,
delete, and move these kernfs nodes are protected by the spinlock of the
parent cgroup, so data consistency will not be a problem.

After opening the pool, the user creates a cgroup will take the fast path
and obtain it from the cgroup pool. Deleting cgroups still take the slow
path. When resources in the pool are insufficient, a delayed task will be
triggered, and the pool will be replenished after a period of time. This
is done to avoid competition with the current creation of cgroups and thus
affect performance. When the resources in the pool are exhausted and not
replenished in time, the creation of a cgroup will take a slow path,
so users need to set an appropriate pool size and supplementary delay time.

What we did in the patches are:
	1.add pinned flags for kernfs nodes, so that they can get rid of
	kernfs_rwsem and choose to be protected by other locks.
	2.add pool_size interface which used to open cgroup pool and
	close cgroup pool.
	3.add extra kernfs tree which used to hide cgroup in pool.
	4.add spinlock to protect kernfs nodes of cgroup in pool


Yi Tao (2):
  add pinned flags for kernfs node
  support cgroup pool in v1

 fs/kernfs/dir.c             |  74 ++++++++++++++++-------
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 include/linux/kernfs.h      |  14 +++++
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 7 files changed, 345 insertions(+), 21 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/2] add pinned flags for kernfs node
  2021-09-08 12:15 ` Yi Tao
  (?)
@ 2021-09-08 12:15 ` Yi Tao
  2021-09-08 12:15     ` Yi Tao
                     ` (2 more replies)
  -1 siblings, 3 replies; 35+ messages in thread
From: Yi Tao @ 2021-09-08 12:15 UTC (permalink / raw)
  To: gregkh, tj, lizefan.x, hannes, mcgrof, keescook, yzaikin
  Cc: linux-kernel, cgroups, linux-fsdevel, shanpeic

This patch is preparing for the implementation of cgroup pool. If a
kernfs node is set to pinned. the data of this node will no longer be
protected by kernfs internally. When it performs the following actions,
the area protected by kernfs_rwsem will be protected by the specific
spinlock:
	1.rename this node
	2.remove this node
	3.create child node

Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 fs/kernfs/dir.c        | 74 ++++++++++++++++++++++++++++++++++++--------------
 include/linux/kernfs.h | 14 ++++++++++
 2 files changed, 68 insertions(+), 20 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..68b05b5bc1a2 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -26,7 +26,6 @@
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-	lockdep_assert_held(&kernfs_rwsem);
 	return atomic_read(&kn->active) >= 0;
 }
 
@@ -461,10 +460,9 @@ static void kernfs_drain(struct kernfs_node *kn)
 {
 	struct kernfs_root *root = kernfs_root(kn);
 
-	lockdep_assert_held_write(&kernfs_rwsem);
 	WARN_ON_ONCE(kernfs_active(kn));
 
-	up_write(&kernfs_rwsem);
+	kernfs_unlock(kn);
 
 	if (kernfs_lockdep(kn)) {
 		rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +481,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
 	kernfs_drain_open_files(kn);
 
-	down_write(&kernfs_rwsem);
+	kernfs_lock(kn);
 }
 
 /**
@@ -722,7 +720,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	bool has_ns;
 	int ret;
 
-	down_write(&kernfs_rwsem);
+	kernfs_lock(parent);
 
 	ret = -EINVAL;
 	has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +751,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 		ps_iattr->ia_mtime = ps_iattr->ia_ctime;
 	}
 
-	up_write(&kernfs_rwsem);
+	kernfs_unlock(parent);
 
 	/*
 	 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +765,7 @@ int kernfs_add_one(struct kernfs_node *kn)
 	return 0;
 
 out_unlock:
-	up_write(&kernfs_rwsem);
+	kernfs_unlock(parent);
 	return ret;
 }
 
@@ -788,8 +786,6 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
 	bool has_ns = kernfs_ns_enabled(parent);
 	unsigned int hash;
 
-	lockdep_assert_held(&kernfs_rwsem);
-
 	if (has_ns != (bool)ns) {
 		WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
 		     has_ns ? "required" : "invalid", parent->name, name);
@@ -1242,8 +1238,6 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
 {
 	struct rb_node *rbn;
 
-	lockdep_assert_held_write(&kernfs_rwsem);
-
 	/* if first iteration, visit leftmost descendant which may be root */
 	if (!pos)
 		return kernfs_leftmost_descendant(root);
@@ -1299,8 +1293,6 @@ static void __kernfs_remove(struct kernfs_node *kn)
 {
 	struct kernfs_node *pos;
 
-	lockdep_assert_held_write(&kernfs_rwsem);
-
 	/*
 	 * Short-circuit if non-root @kn has already finished removal.
 	 * This is for kernfs_remove_self() which plays with active ref
@@ -1369,9 +1361,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	down_write(&kernfs_rwsem);
+	kernfs_lock(kn);
 	__kernfs_remove(kn);
-	up_write(&kernfs_rwsem);
+	kernfs_unlock(kn);
 }
 
 /**
@@ -1525,13 +1517,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
 		return -ENOENT;
 	}
 
-	down_write(&kernfs_rwsem);
+	kernfs_lock(parent);
 
 	kn = kernfs_find_ns(parent, name, ns);
 	if (kn)
 		__kernfs_remove(kn);
 
-	up_write(&kernfs_rwsem);
+	kernfs_unlock(parent);
 
 	if (kn)
 		return 0;
@@ -1557,7 +1549,9 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	if (!kn->parent)
 		return -EINVAL;
 
-	down_write(&kernfs_rwsem);
+	/* if parent is pinned, parent->lock protects rename */
+	if (!kn->parent->pinned)
+		down_write(&kernfs_rwsem);
 
 	error = -ENOENT;
 	if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1576,7 +1570,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	/* rename kernfs_node */
 	if (strcmp(kn->name, new_name) != 0) {
 		error = -ENOMEM;
-		new_name = kstrdup_const(new_name, GFP_KERNEL);
+		/* use GFP_ATOMIC to avoid sleep */
+		new_name = kstrdup_const(new_name, GFP_ATOMIC);
 		if (!new_name)
 			goto out;
 	} else {
@@ -1611,10 +1606,49 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 
 	error = 0;
  out:
-	up_write(&kernfs_rwsem);
+	if (!kn->parent->pinned)
+		up_write(&kernfs_rwsem);
 	return error;
 }
 
+/* Traverse all descendants and set pinned */
+void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock)
+{
+	struct kernfs_node *pos = NULL;
+
+	while ((pos = kernfs_next_descendant_post(pos, kn))) {
+		pos->pinned = true;
+		pos->lock = lock;
+	}
+}
+
+/* Traverse all descendants and clear pinned */
+void kernfs_clear_pinned(struct kernfs_node *kn)
+{
+	struct kernfs_node *pos = NULL;
+
+	while ((pos = kernfs_next_descendant_post(pos, kn))) {
+		pos->pinned = false;
+		pos->lock = NULL;
+	}
+}
+
+void kernfs_lock(struct kernfs_node *kn)
+{
+	if (!kn->pinned)
+		down_write(&kernfs_rwsem);
+	else
+		spin_lock(kn->lock);
+}
+
+void kernfs_unlock(struct kernfs_node *kn)
+{
+	if (!kn->pinned)
+		up_write(&kernfs_rwsem);
+	else
+		spin_unlock(kn->lock);
+}
+
 /* Relationship between mode and the DT_xxx types */
 static inline unsigned char dt_type(struct kernfs_node *kn)
 {
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 1093abf7c28c..a70d96308c51 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,13 @@ struct kernfs_node {
 	unsigned short		flags;
 	umode_t			mode;
 	struct kernfs_iattrs	*iattr;
+
+	/*
+	 * If pinned is true, use lock to protect remove, rename this kernfs
+	 * node or create child kernfs node.
+	 */
+	bool			pinned;
+	spinlock_t		*lock;
 };
 
 /*
@@ -415,6 +422,11 @@ int kernfs_xattr_set(struct kernfs_node *kn, const char *name,
 
 struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
 						   u64 id);
+
+void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock);
+void kernfs_clear_pinned(struct kernfs_node *kn);
+void kernfs_lock(struct kernfs_node *kn);
+void kernfs_unlock(struct kernfs_node *kn);
 #else	/* CONFIG_KERNFS */
 
 static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -528,6 +540,8 @@ static inline void kernfs_kill_sb(struct super_block *sb) { }
 
 static inline void kernfs_init(void) { }
 
+inline void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock) {}
+inline void kernfs_clear_pinned(struct kernfs_node *kn) {}
 #endif	/* CONFIG_KERNFS */
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-08 12:15     ` Yi Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Yi Tao @ 2021-09-08 12:15 UTC (permalink / raw)
  To: gregkh, tj, lizefan.x, hannes, mcgrof, keescook, yzaikin
  Cc: linux-kernel, cgroups, linux-fsdevel, shanpeic

Add pool_size interface and delay_time interface. When the user writes
pool_size, a cgroup pool will be created, and then when the user needs
to create a cgroup, it will take the fast path protected by spinlock to
obtain it from the resource pool. Performance is improved by the
following aspects:
	1.reduce the critical area for creating cgroups
	2.reduce the scheduling time of sleep
	3.avoid competition with other cgroup behaviors which protected
	  by cgroup_mutex

The essence of obtaining resources from the pool is kernfs rename. With
the help of the previous pinned kernfs node function, when the pool is
enabled, these cgroups will be in the pinned state, and the protection
of the kernfs data structure will be protected by the specified
spinlock, thus getting rid of the cgroup_mutex and kernfs_rwsem.

In order to avoid random operations by users, the kernfs nodes of the
cgroups in the pool will be placed under a hidden kernfs tree, and users
can not directly touch them. When a user creates a cgroup, it will take
the fast path, select a node from the hidden tree, and move it to the
correct position.

As users continue to obtain resources from the pool, the number of
cgroups in the pool will gradually decrease. When the number is less
than a certain value, it will be supplemented. In order to avoid
competition with the currently created cgroup, you can delay this by
setting delay_time process

Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 5 files changed, 277 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e1c705fdfa7c..e3e486d1b678 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -486,6 +486,22 @@ struct cgroup {
 	/* Used to store internal freezer state */
 	struct cgroup_freezer_state freezer;
 
+	/*
+	 * cgroup pool related members. lock protects cgroup's kernfs node in
+	 * pool. pool_index records index of cgroup which put into pool next.
+	 * pool_amount records how many cgroups pool remains. pool_size is set
+	 * by user, supply pool util pool_amount reach 2*pool_size if
+	 * pool_amount is less than pool_size to retain enough cgroup in pool to
+	 * guarantee cgroup_mkdir take the fast path.
+	 */
+	spinlock_t lock;
+	atomic64_t pool_index;
+	atomic64_t pool_amount;
+	u64 pool_size;
+	bool enable_pool;
+	struct kernfs_root *hidden_place;
+	struct delayed_work supply_pool_work;
+
 	/* ids of the ancestors at each level including self */
 	u64 ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7bf60454a313..ade614b78040 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,8 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern unsigned int cgroup_supply_delay_time;
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 35b920328344..8964c4d0741b 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -609,6 +609,136 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+/*
+ * kernfs_open_file->mutex can't avoid competition if writing to pool_size
+ * of parent cgroup and child cgroup at the same time. use cgroup_pool_mutex
+ * to serialize any write operation to pool_size.
+ */
+DEFINE_MUTEX(cgroup_pool_mutex);
+
+static u64 cgroup_pool_size_read(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	/* kernfs r/w access is serialized by kernfs_open_file->mutex */
+	return css->cgroup->pool_size;
+}
+
+extern struct kernfs_node *kernfs_get_active(struct kernfs_node *kn);
+extern void kernfs_put_active(struct kernfs_node *kn);
+
+static ssize_t cgroup_pool_size_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	char name[NAME_MAX + 1];
+	struct cgroup *cgrp;
+	ssize_t ret = -EPERM;
+	u64 val;
+	int i;
+
+	cgrp = of->kn->parent->priv;
+
+	if (kstrtoull(buf, 0, &val))
+		return -EINVAL;
+
+	if (!cgroup_tryget(cgrp))
+		return -ENODEV;
+
+	kernfs_break_active_protection(of->kn);
+	mutex_lock(&cgroup_pool_mutex);
+	mutex_lock(&cgroup_mutex);
+	spin_lock(&cgrp->lock);
+
+	/*
+	 * only non-zero -> zero or zero -> non-zero settings are invalid.
+	 */
+	if ((cgrp->pool_size && val) || (!cgrp->pool_size && !val))
+		goto out_fail;
+
+	if (cgroup_is_dead(cgrp)) {
+		ret = -ENODEV;
+		goto out_fail;
+	}
+
+	cgrp->pool_size = val;
+	spin_unlock(&cgrp->lock);
+	mutex_unlock(&cgroup_mutex);
+
+	if (val) {
+		/* create kernfs root to hide cgroup which belongs to pool */
+		cgrp->hidden_place = kernfs_create_root(NULL, 0, NULL);
+
+		/*
+		 * names of cgroups in pool obey the rule of pool-*, it may
+		 * fail if cgroup has the same name already exists, if failed,
+		 * try again with different name.
+		 *
+		 * cgroup_mkdir called here is under context of writing
+		 * pool_size, so we need to call kernfs_get_active to simulate
+		 * kernfs mkdir context.
+		 *
+		 * normally, the mode of 0xffff is intercepted at the VFS layer
+		 * because it is invalid. use 0xffff to tell cgroup_mkdir it is
+		 * create a cgroup for cgroup pool.
+		 */
+		for (i = 0; i < val * 2;) {
+			sprintf(name, "pool-%llu", atomic64_add_return(1, &cgrp->pool_index));
+			kernfs_get_active(cgrp->kn);
+			if (!cgroup_mkdir(cgrp->kn, name, 0xffff))
+				i++;
+			kernfs_put_active(cgrp->kn);
+		}
+		atomic64_set(&cgrp->pool_amount, val * 2);
+
+		/* set kernfs node pinned after generating pool */
+		mutex_lock(&cgroup_mutex);
+		spin_lock(&cgrp->lock);
+		cgrp->enable_pool = true;
+		kernfs_set_pinned(cgrp->kn, &cgrp->lock);
+		kernfs_set_pinned(cgrp->hidden_place->kn, &cgrp->lock);
+		spin_unlock(&cgrp->lock);
+		mutex_unlock(&cgroup_mutex);
+	} else {
+		struct kernfs_node *child, *n;
+		struct rb_root *hidden_root = &cgrp->hidden_place->kn->dir.children;
+
+		/* clear kernfs node pinned before removing pool */
+		mutex_lock(&cgroup_mutex);
+		spin_lock(&cgrp->lock);
+		cgrp->enable_pool = false;
+		kernfs_clear_pinned(cgrp->kn);
+		kernfs_clear_pinned(cgrp->hidden_place->kn);
+		spin_unlock(&cgrp->lock);
+		mutex_unlock(&cgroup_mutex);
+
+		/* pool is disabled, cancel supply work */
+		cancel_delayed_work_sync(&cgrp->supply_pool_work);
+
+		/* traverse cgroup in pool and remove them */
+		while (hidden_root->rb_node) {
+			rbtree_postorder_for_each_entry_safe(child, n, hidden_root, rb) {
+				kernfs_get_active(child);
+				ret = cgroup_rmdir(child);
+				kernfs_put_active(child);
+			}
+		}
+		kernfs_destroy_root(cgrp->hidden_place);
+		atomic64_set(&cgrp->pool_amount, 0);
+	}
+
+
+	ret = nbytes;
+	goto out_success;
+
+out_fail:
+	spin_unlock(&cgrp->lock);
+	mutex_unlock(&cgroup_mutex);
+out_success:
+	mutex_unlock(&cgroup_pool_mutex);
+	kernfs_unbreak_active_protection(of->kn);
+	cgroup_put(cgrp);
+	return ret;
+}
+
 /* cgroup core interface files for the legacy hierarchies */
 struct cftype cgroup1_base_files[] = {
 	{
@@ -651,6 +781,11 @@ struct cftype cgroup1_base_files[] = {
 		.write = cgroup_release_agent_write,
 		.max_write_len = PATH_MAX - 1,
 	},
+	{
+		.name = "pool_size",
+		.read_u64 = cgroup_pool_size_read,
+		.write = cgroup_pool_size_write,
+	},
 	{ }	/* terminate */
 };
 
@@ -845,9 +980,13 @@ static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent
 
 	mutex_lock(&cgroup_mutex);
 
+	if (kn->parent->pinned)
+		spin_lock(kn->parent->lock);
 	ret = kernfs_rename(kn, new_parent, new_name_str);
 	if (!ret)
 		TRACE_CGROUP_PATH(rename, cgrp);
+	if (kn->parent->pinned)
+		spin_unlock(kn->parent->lock);
 
 	mutex_unlock(&cgroup_mutex);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..d259e3bdca2a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -245,6 +245,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			      struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
+static void cgroup_supply_work(struct work_struct *work);
 
 /**
  * cgroup_ssid_enabled - cgroup subsys enabled test by subsys ID
@@ -1925,6 +1926,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->cset_links);
 	INIT_LIST_HEAD(&cgrp->pidlists);
 	mutex_init(&cgrp->pidlist_mutex);
+	spin_lock_init(&cgrp->lock);
 	cgrp->self.cgroup = cgrp;
 	cgrp->self.flags |= CSS_ONLINE;
 	cgrp->dom_cgrp = cgrp;
@@ -1938,6 +1940,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 
 	init_waitqueue_head(&cgrp->offline_waitq);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
+	INIT_DELAYED_WORK(&cgrp->supply_pool_work, cgroup_supply_work);
 }
 
 void init_cgroup_root(struct cgroup_fs_context *ctx)
@@ -5419,15 +5422,113 @@ static bool cgroup_check_hierarchy_limits(struct cgroup *parent)
 	return ret;
 }
 
+extern struct kernfs_node *kernfs_get_active(struct kernfs_node *kn);
+extern void kernfs_put_active(struct kernfs_node *kn);
+
+unsigned int cgroup_supply_delay_time;
+
+/* supply pool_amount to 2*pool_size */
+static void cgroup_supply_work(struct work_struct *work)
+{
+	char name[NAME_MAX + 1];
+	struct cgroup *parent = container_of((struct delayed_work *)work,
+				struct cgroup, supply_pool_work);
+	struct kernfs_node *parent_kn = parent->kn;
+
+	while (atomic64_read(&parent->pool_amount) < 2 * parent->pool_size) {
+		sprintf(name, "pool-%llu", atomic64_add_return(1, &parent->pool_index));
+		kernfs_get_active(parent_kn);
+		if (!cgroup_mkdir(parent_kn, name, 0xffff))
+			atomic64_add(1, &parent->pool_amount);
+		kernfs_put_active(parent_kn);
+	}
+}
+
+static int cgroup_mkdir_fast_path(struct kernfs_node *parent_kn, const char *name)
+{
+	struct cgroup *parent;
+	struct rb_root *hidden_root;
+	int ret;
+
+	parent = parent_kn->priv;
+
+	if (!cgroup_tryget(parent))
+		return -ENODEV;
+
+	/*
+	 * acquire spinlock outside kernfs_rename because choosing kernfs node
+	 * and renaming need to be atomic.
+	 */
+	spin_lock(&parent->lock);
+
+	/* if pool is disabled or empty, return and take the slowpath */
+	if (!parent->enable_pool) {
+		ret = 1;
+		goto out_unlock;
+	}
+
+	hidden_root = &parent->hidden_place->kn->dir.children;
+	if (!hidden_root->rb_node) {
+		ret = 1;
+		goto out_unlock;
+	}
+
+#define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
+	ret = kernfs_rename(rb_to_kn(rb_first(hidden_root)), parent_kn, name);
+	if (ret)
+		goto out_unlock;
+
+	/* supply pool if pool_amount is less than pool_size */
+	if (atomic64_sub_return(1, &parent->pool_amount) < parent->pool_size)
+		schedule_delayed_work(&parent->supply_pool_work,
+				msecs_to_jiffies(cgroup_supply_delay_time));
+
+out_unlock:
+	spin_unlock(&parent->lock);
+	cgroup_put(parent);
+	return ret;
+}
+
+/* hide cgroup which belongs to pool */
+static void cgroup_hide(struct cgroup *parent, struct cgroup *cgrp, const char *name)
+{
+	/*
+	 * if cgroup_hide is called by cgroup_supply_work, pool is enabled,
+	 * it needs to acquire spinlock to protect kernfs_rename
+	 */
+	if (parent->enable_pool)
+		spin_lock(&parent->lock);
+	kernfs_get_active(parent->hidden_place->kn);
+	BUG_ON(kernfs_rename(cgrp->kn, parent->hidden_place->kn, name));
+	kernfs_put_active(parent->hidden_place->kn);
+	if (parent->enable_pool) {
+		spin_unlock(&parent->lock);
+		kernfs_set_pinned(cgrp->kn, &parent->lock);
+	}
+}
+
 int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 {
 	struct cgroup *parent, *cgrp;
-	int ret;
+	struct kernfs_node *kn;
+	int ret = 1;
+	bool hide = false;
 
 	/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
 	if (strchr(name, '\n'))
 		return -EINVAL;
 
+	/* 0xffff means cgroup is created for pool, set to default mode 0x1ed */
+	if (mode == 0xffff) {
+		hide = true;
+		mode = 0x1ed;
+	}
+
+	if (!hide)
+		ret = cgroup_mkdir_fast_path(parent_kn, name);
+	if (ret <= 0)
+		return ret;
+
 	parent = cgroup_kn_lock_live(parent_kn, false);
 	if (!parent)
 		return -ENODEV;
@@ -5466,6 +5567,9 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	/* let's create and online css's */
 	kernfs_activate(cgrp->kn);
 
+	if (hide)
+		cgroup_hide(parent, cgrp, name);
+
 	ret = 0;
 	goto out_unlock;
 
@@ -5658,10 +5762,17 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
+	/* it may creating cgroup in pool */
+	if (cgrp->pool_size) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	ret = cgroup_destroy_locked(cgrp);
 	if (!ret)
 		TRACE_CGROUP_PATH(rmdir, cgrp);
 
+out_unlock:
 	cgroup_kn_unlock(kn);
 	return ret;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 083be6af29d7..9406111a30cb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -73,6 +73,7 @@
 #include <linux/latencytop.h>
 #include <linux/pid.h>
 #include <linux/delayacct.h>
+#include <linux/cgroup.h>
 
 #include "../lib/kstrtox.h"
 
@@ -2718,6 +2719,13 @@ int proc_do_static_key(struct ctl_table *table, int write,
 		.extra2		= SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname       = "cgroup_supply_delay_time",
+		.data           = &cgroup_supply_delay_time,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-08 12:15     ` Yi Tao
  0 siblings, 0 replies; 35+ messages in thread
From: Yi Tao @ 2021-09-08 12:15 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

Add pool_size interface and delay_time interface. When the user writes
pool_size, a cgroup pool will be created, and then when the user needs
to create a cgroup, it will take the fast path protected by spinlock to
obtain it from the resource pool. Performance is improved by the
following aspects:
	1.reduce the critical area for creating cgroups
	2.reduce the scheduling time of sleep
	3.avoid competition with other cgroup behaviors which protected
	  by cgroup_mutex

The essence of obtaining resources from the pool is kernfs rename. With
the help of the previous pinned kernfs node function, when the pool is
enabled, these cgroups will be in the pinned state, and the protection
of the kernfs data structure will be protected by the specified
spinlock, thus getting rid of the cgroup_mutex and kernfs_rwsem.

In order to avoid random operations by users, the kernfs nodes of the
cgroups in the pool will be placed under a hidden kernfs tree, and users
can not directly touch them. When a user creates a cgroup, it will take
the fast path, select a node from the hidden tree, and move it to the
correct position.

As users continue to obtain resources from the pool, the number of
cgroups in the pool will gradually decrease. When the number is less
than a certain value, it will be supplemented. In order to avoid
competition with the currently created cgroup, you can delay this by
setting delay_time process

Suggested-by: Shanpei Chen <shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Signed-off-by: Yi Tao <escape-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
---
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 5 files changed, 277 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e1c705fdfa7c..e3e486d1b678 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -486,6 +486,22 @@ struct cgroup {
 	/* Used to store internal freezer state */
 	struct cgroup_freezer_state freezer;
 
+	/*
+	 * cgroup pool related members. lock protects cgroup's kernfs node in
+	 * pool. pool_index records index of cgroup which put into pool next.
+	 * pool_amount records how many cgroups pool remains. pool_size is set
+	 * by user, supply pool util pool_amount reach 2*pool_size if
+	 * pool_amount is less than pool_size to retain enough cgroup in pool to
+	 * guarantee cgroup_mkdir take the fast path.
+	 */
+	spinlock_t lock;
+	atomic64_t pool_index;
+	atomic64_t pool_amount;
+	u64 pool_size;
+	bool enable_pool;
+	struct kernfs_root *hidden_place;
+	struct delayed_work supply_pool_work;
+
 	/* ids of the ancestors at each level including self */
 	u64 ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7bf60454a313..ade614b78040 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,8 @@ static inline void cgroup_put(struct cgroup *cgrp)
 	css_put(&cgrp->self);
 }
 
+extern unsigned int cgroup_supply_delay_time;
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 35b920328344..8964c4d0741b 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -609,6 +609,136 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+/*
+ * kernfs_open_file->mutex can't avoid competition if writing to pool_size
+ * of parent cgroup and child cgroup at the same time. use cgroup_pool_mutex
+ * to serialize any write operation to pool_size.
+ */
+DEFINE_MUTEX(cgroup_pool_mutex);
+
+static u64 cgroup_pool_size_read(struct cgroup_subsys_state *css,
+				 struct cftype *cft)
+{
+	/* kernfs r/w access is serialized by kernfs_open_file->mutex */
+	return css->cgroup->pool_size;
+}
+
+extern struct kernfs_node *kernfs_get_active(struct kernfs_node *kn);
+extern void kernfs_put_active(struct kernfs_node *kn);
+
+static ssize_t cgroup_pool_size_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	char name[NAME_MAX + 1];
+	struct cgroup *cgrp;
+	ssize_t ret = -EPERM;
+	u64 val;
+	int i;
+
+	cgrp = of->kn->parent->priv;
+
+	if (kstrtoull(buf, 0, &val))
+		return -EINVAL;
+
+	if (!cgroup_tryget(cgrp))
+		return -ENODEV;
+
+	kernfs_break_active_protection(of->kn);
+	mutex_lock(&cgroup_pool_mutex);
+	mutex_lock(&cgroup_mutex);
+	spin_lock(&cgrp->lock);
+
+	/*
+	 * only non-zero -> zero or zero -> non-zero settings are invalid.
+	 */
+	if ((cgrp->pool_size && val) || (!cgrp->pool_size && !val))
+		goto out_fail;
+
+	if (cgroup_is_dead(cgrp)) {
+		ret = -ENODEV;
+		goto out_fail;
+	}
+
+	cgrp->pool_size = val;
+	spin_unlock(&cgrp->lock);
+	mutex_unlock(&cgroup_mutex);
+
+	if (val) {
+		/* create kernfs root to hide cgroup which belongs to pool */
+		cgrp->hidden_place = kernfs_create_root(NULL, 0, NULL);
+
+		/*
+		 * names of cgroups in pool obey the rule of pool-*, it may
+		 * fail if cgroup has the same name already exists, if failed,
+		 * try again with different name.
+		 *
+		 * cgroup_mkdir called here is under context of writing
+		 * pool_size, so we need to call kernfs_get_active to simulate
+		 * kernfs mkdir context.
+		 *
+		 * normally, the mode of 0xffff is intercepted at the VFS layer
+		 * because it is invalid. use 0xffff to tell cgroup_mkdir it is
+		 * create a cgroup for cgroup pool.
+		 */
+		for (i = 0; i < val * 2;) {
+			sprintf(name, "pool-%llu", atomic64_add_return(1, &cgrp->pool_index));
+			kernfs_get_active(cgrp->kn);
+			if (!cgroup_mkdir(cgrp->kn, name, 0xffff))
+				i++;
+			kernfs_put_active(cgrp->kn);
+		}
+		atomic64_set(&cgrp->pool_amount, val * 2);
+
+		/* set kernfs node pinned after generating pool */
+		mutex_lock(&cgroup_mutex);
+		spin_lock(&cgrp->lock);
+		cgrp->enable_pool = true;
+		kernfs_set_pinned(cgrp->kn, &cgrp->lock);
+		kernfs_set_pinned(cgrp->hidden_place->kn, &cgrp->lock);
+		spin_unlock(&cgrp->lock);
+		mutex_unlock(&cgroup_mutex);
+	} else {
+		struct kernfs_node *child, *n;
+		struct rb_root *hidden_root = &cgrp->hidden_place->kn->dir.children;
+
+		/* clear kernfs node pinned before removing pool */
+		mutex_lock(&cgroup_mutex);
+		spin_lock(&cgrp->lock);
+		cgrp->enable_pool = false;
+		kernfs_clear_pinned(cgrp->kn);
+		kernfs_clear_pinned(cgrp->hidden_place->kn);
+		spin_unlock(&cgrp->lock);
+		mutex_unlock(&cgroup_mutex);
+
+		/* pool is disabled, cancel supply work */
+		cancel_delayed_work_sync(&cgrp->supply_pool_work);
+
+		/* traverse cgroup in pool and remove them */
+		while (hidden_root->rb_node) {
+			rbtree_postorder_for_each_entry_safe(child, n, hidden_root, rb) {
+				kernfs_get_active(child);
+				ret = cgroup_rmdir(child);
+				kernfs_put_active(child);
+			}
+		}
+		kernfs_destroy_root(cgrp->hidden_place);
+		atomic64_set(&cgrp->pool_amount, 0);
+	}
+
+
+	ret = nbytes;
+	goto out_success;
+
+out_fail:
+	spin_unlock(&cgrp->lock);
+	mutex_unlock(&cgroup_mutex);
+out_success:
+	mutex_unlock(&cgroup_pool_mutex);
+	kernfs_unbreak_active_protection(of->kn);
+	cgroup_put(cgrp);
+	return ret;
+}
+
 /* cgroup core interface files for the legacy hierarchies */
 struct cftype cgroup1_base_files[] = {
 	{
@@ -651,6 +781,11 @@ struct cftype cgroup1_base_files[] = {
 		.write = cgroup_release_agent_write,
 		.max_write_len = PATH_MAX - 1,
 	},
+	{
+		.name = "pool_size",
+		.read_u64 = cgroup_pool_size_read,
+		.write = cgroup_pool_size_write,
+	},
 	{ }	/* terminate */
 };
 
@@ -845,9 +980,13 @@ static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent
 
 	mutex_lock(&cgroup_mutex);
 
+	if (kn->parent->pinned)
+		spin_lock(kn->parent->lock);
 	ret = kernfs_rename(kn, new_parent, new_name_str);
 	if (!ret)
 		TRACE_CGROUP_PATH(rename, cgrp);
+	if (kn->parent->pinned)
+		spin_unlock(kn->parent->lock);
 
 	mutex_unlock(&cgroup_mutex);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 881ce1470beb..d259e3bdca2a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -245,6 +245,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			      struct cgroup *cgrp, struct cftype cfts[],
 			      bool is_add);
+static void cgroup_supply_work(struct work_struct *work);
 
 /**
  * cgroup_ssid_enabled - cgroup subsys enabled test by subsys ID
@@ -1925,6 +1926,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->cset_links);
 	INIT_LIST_HEAD(&cgrp->pidlists);
 	mutex_init(&cgrp->pidlist_mutex);
+	spin_lock_init(&cgrp->lock);
 	cgrp->self.cgroup = cgrp;
 	cgrp->self.flags |= CSS_ONLINE;
 	cgrp->dom_cgrp = cgrp;
@@ -1938,6 +1940,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 
 	init_waitqueue_head(&cgrp->offline_waitq);
 	INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
+	INIT_DELAYED_WORK(&cgrp->supply_pool_work, cgroup_supply_work);
 }
 
 void init_cgroup_root(struct cgroup_fs_context *ctx)
@@ -5419,15 +5422,113 @@ static bool cgroup_check_hierarchy_limits(struct cgroup *parent)
 	return ret;
 }
 
+extern struct kernfs_node *kernfs_get_active(struct kernfs_node *kn);
+extern void kernfs_put_active(struct kernfs_node *kn);
+
+unsigned int cgroup_supply_delay_time;
+
+/* supply pool_amount to 2*pool_size */
+static void cgroup_supply_work(struct work_struct *work)
+{
+	char name[NAME_MAX + 1];
+	struct cgroup *parent = container_of((struct delayed_work *)work,
+				struct cgroup, supply_pool_work);
+	struct kernfs_node *parent_kn = parent->kn;
+
+	while (atomic64_read(&parent->pool_amount) < 2 * parent->pool_size) {
+		sprintf(name, "pool-%llu", atomic64_add_return(1, &parent->pool_index));
+		kernfs_get_active(parent_kn);
+		if (!cgroup_mkdir(parent_kn, name, 0xffff))
+			atomic64_add(1, &parent->pool_amount);
+		kernfs_put_active(parent_kn);
+	}
+}
+
+static int cgroup_mkdir_fast_path(struct kernfs_node *parent_kn, const char *name)
+{
+	struct cgroup *parent;
+	struct rb_root *hidden_root;
+	int ret;
+
+	parent = parent_kn->priv;
+
+	if (!cgroup_tryget(parent))
+		return -ENODEV;
+
+	/*
+	 * acquire spinlock outside kernfs_rename because choosing kernfs node
+	 * and renaming need to be atomic.
+	 */
+	spin_lock(&parent->lock);
+
+	/* if pool is disabled or empty, return and take the slowpath */
+	if (!parent->enable_pool) {
+		ret = 1;
+		goto out_unlock;
+	}
+
+	hidden_root = &parent->hidden_place->kn->dir.children;
+	if (!hidden_root->rb_node) {
+		ret = 1;
+		goto out_unlock;
+	}
+
+#define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
+	ret = kernfs_rename(rb_to_kn(rb_first(hidden_root)), parent_kn, name);
+	if (ret)
+		goto out_unlock;
+
+	/* supply pool if pool_amount is less than pool_size */
+	if (atomic64_sub_return(1, &parent->pool_amount) < parent->pool_size)
+		schedule_delayed_work(&parent->supply_pool_work,
+				msecs_to_jiffies(cgroup_supply_delay_time));
+
+out_unlock:
+	spin_unlock(&parent->lock);
+	cgroup_put(parent);
+	return ret;
+}
+
+/* hide cgroup which belongs to pool */
+static void cgroup_hide(struct cgroup *parent, struct cgroup *cgrp, const char *name)
+{
+	/*
+	 * if cgroup_hide is called by cgroup_supply_work, pool is enabled,
+	 * it needs to acquire spinlock to protect kernfs_rename
+	 */
+	if (parent->enable_pool)
+		spin_lock(&parent->lock);
+	kernfs_get_active(parent->hidden_place->kn);
+	BUG_ON(kernfs_rename(cgrp->kn, parent->hidden_place->kn, name));
+	kernfs_put_active(parent->hidden_place->kn);
+	if (parent->enable_pool) {
+		spin_unlock(&parent->lock);
+		kernfs_set_pinned(cgrp->kn, &parent->lock);
+	}
+}
+
 int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 {
 	struct cgroup *parent, *cgrp;
-	int ret;
+	struct kernfs_node *kn;
+	int ret = 1;
+	bool hide = false;
 
 	/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
 	if (strchr(name, '\n'))
 		return -EINVAL;
 
+	/* 0xffff means cgroup is created for pool, set to default mode 0x1ed */
+	if (mode == 0xffff) {
+		hide = true;
+		mode = 0x1ed;
+	}
+
+	if (!hide)
+		ret = cgroup_mkdir_fast_path(parent_kn, name);
+	if (ret <= 0)
+		return ret;
+
 	parent = cgroup_kn_lock_live(parent_kn, false);
 	if (!parent)
 		return -ENODEV;
@@ -5466,6 +5567,9 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	/* let's create and online css's */
 	kernfs_activate(cgrp->kn);
 
+	if (hide)
+		cgroup_hide(parent, cgrp, name);
+
 	ret = 0;
 	goto out_unlock;
 
@@ -5658,10 +5762,17 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
+	/* it may creating cgroup in pool */
+	if (cgrp->pool_size) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	ret = cgroup_destroy_locked(cgrp);
 	if (!ret)
 		TRACE_CGROUP_PATH(rmdir, cgrp);
 
+out_unlock:
 	cgroup_kn_unlock(kn);
 	return ret;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 083be6af29d7..9406111a30cb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -73,6 +73,7 @@
 #include <linux/latencytop.h>
 #include <linux/pid.h>
 #include <linux/delayacct.h>
+#include <linux/cgroup.h>
 
 #include "../lib/kstrtox.h"
 
@@ -2718,6 +2719,13 @@ int proc_do_static_key(struct ctl_table *table, int write,
 		.extra2		= SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname       = "cgroup_supply_delay_time",
+		.data           = &cgroup_supply_delay_time,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
@ 2021-09-08 12:35     ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-08 12:35 UTC (permalink / raw)
  To: Yi Tao
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Wed, Sep 08, 2021 at 08:15:12PM +0800, Yi Tao wrote:
> This patch is preparing for the implementation of cgroup pool. If a
> kernfs node is set to pinned. the data of this node will no longer be
> protected by kernfs internally. When it performs the following actions,
> the area protected by kernfs_rwsem will be protected by the specific
> spinlock:
> 	1.rename this node
> 	2.remove this node
> 	3.create child node
> 
> Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>  fs/kernfs/dir.c        | 74 ++++++++++++++++++++++++++++++++++++--------------
>  include/linux/kernfs.h | 14 ++++++++++
>  2 files changed, 68 insertions(+), 20 deletions(-)

Ugh, this is going to get messy fast.

Why are kernfs changes needed for this?  kernfs creation is not
necessarily supposed to be "fast", what benchmark needs this type of
change to require the addition of this complexity?

And how are we going to audit things to ensure the "pinning" is working
properly?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
@ 2021-09-08 12:35     ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-08 12:35 UTC (permalink / raw)
  To: Yi Tao
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Wed, Sep 08, 2021 at 08:15:12PM +0800, Yi Tao wrote:
> This patch is preparing for the implementation of cgroup pool. If a
> kernfs node is set to pinned. the data of this node will no longer be
> protected by kernfs internally. When it performs the following actions,
> the area protected by kernfs_rwsem will be protected by the specific
> spinlock:
> 	1.rename this node
> 	2.remove this node
> 	3.create child node
> 
> Suggested-by: Shanpei Chen <shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Signed-off-by: Yi Tao <escape-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> ---
>  fs/kernfs/dir.c        | 74 ++++++++++++++++++++++++++++++++++++--------------
>  include/linux/kernfs.h | 14 ++++++++++
>  2 files changed, 68 insertions(+), 20 deletions(-)

Ugh, this is going to get messy fast.

Why are kernfs changes needed for this?  kernfs creation is not
necessarily supposed to be "fast", what benchmark needs this type of
change to require the addition of this complexity?

And how are we going to audit things to ensure the "pinning" is working
properly?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-08 12:35       ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-08 12:35 UTC (permalink / raw)
  To: Yi Tao
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Wed, Sep 08, 2021 at 08:15:13PM +0800, Yi Tao wrote:
> Add pool_size interface and delay_time interface. When the user writes
> pool_size, a cgroup pool will be created, and then when the user needs
> to create a cgroup, it will take the fast path protected by spinlock to
> obtain it from the resource pool. Performance is improved by the
> following aspects:
> 	1.reduce the critical area for creating cgroups
> 	2.reduce the scheduling time of sleep
> 	3.avoid competition with other cgroup behaviors which protected
> 	  by cgroup_mutex
> 
> The essence of obtaining resources from the pool is kernfs rename. With
> the help of the previous pinned kernfs node function, when the pool is
> enabled, these cgroups will be in the pinned state, and the protection
> of the kernfs data structure will be protected by the specified
> spinlock, thus getting rid of the cgroup_mutex and kernfs_rwsem.
> 
> In order to avoid random operations by users, the kernfs nodes of the
> cgroups in the pool will be placed under a hidden kernfs tree, and users
> can not directly touch them. When a user creates a cgroup, it will take
> the fast path, select a node from the hidden tree, and move it to the
> correct position.
> 
> As users continue to obtain resources from the pool, the number of
> cgroups in the pool will gradually decrease. When the number is less
> than a certain value, it will be supplemented. In order to avoid
> competition with the currently created cgroup, you can delay this by
> setting delay_time process
> 
> Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>  include/linux/cgroup-defs.h |  16 +++++
>  include/linux/cgroup.h      |   2 +
>  kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++

I thought cgroup v1 was "obsolete" and not getting new features added to
it.  What is wrong with just using cgroups 2 instead if you have a
problem with the v1 interface?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-08 12:35       ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-08 12:35 UTC (permalink / raw)
  To: Yi Tao
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Wed, Sep 08, 2021 at 08:15:13PM +0800, Yi Tao wrote:
> Add pool_size interface and delay_time interface. When the user writes
> pool_size, a cgroup pool will be created, and then when the user needs
> to create a cgroup, it will take the fast path protected by spinlock to
> obtain it from the resource pool. Performance is improved by the
> following aspects:
> 	1.reduce the critical area for creating cgroups
> 	2.reduce the scheduling time of sleep
> 	3.avoid competition with other cgroup behaviors which protected
> 	  by cgroup_mutex
> 
> The essence of obtaining resources from the pool is kernfs rename. With
> the help of the previous pinned kernfs node function, when the pool is
> enabled, these cgroups will be in the pinned state, and the protection
> of the kernfs data structure will be protected by the specified
> spinlock, thus getting rid of the cgroup_mutex and kernfs_rwsem.
> 
> In order to avoid random operations by users, the kernfs nodes of the
> cgroups in the pool will be placed under a hidden kernfs tree, and users
> can not directly touch them. When a user creates a cgroup, it will take
> the fast path, select a node from the hidden tree, and move it to the
> correct position.
> 
> As users continue to obtain resources from the pool, the number of
> cgroups in the pool will gradually decrease. When the number is less
> than a certain value, it will be supplemented. In order to avoid
> competition with the currently created cgroup, you can delay this by
> setting delay_time process
> 
> Suggested-by: Shanpei Chen <shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Signed-off-by: Yi Tao <escape-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> ---
>  include/linux/cgroup-defs.h |  16 +++++
>  include/linux/cgroup.h      |   2 +
>  kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++

I thought cgroup v1 was "obsolete" and not getting new features added to
it.  What is wrong with just using cgroups 2 instead if you have a
problem with the v1 interface?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
  2021-09-08 12:15 ` Yi Tao
  (?)
  (?)
@ 2021-09-08 12:37 ` Greg KH
  2021-09-10  2:11   ` taoyi.ty
  -1 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2021-09-08 12:37 UTC (permalink / raw)
  To: Yi Tao
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
> In a scenario where containers are started with high concurrency, in
> order to control the use of system resources by the container, it is
> necessary to create a corresponding cgroup for each container and
> attach the process. The kernel uses the cgroup_mutex global lock to
> protect the consistency of the data, which results in a higher
> long-tail delay for cgroup-related operations during concurrent startup.
> For example, long-tail delay of creating cgroup under each subsystems
> is 900ms when starting 400 containers, which becomes bottleneck of
> performance. The delay is mainly composed of two parts, namely the
> time of the critical section protected by cgroup_mutex and the
> scheduling time of sleep. The scheduling time will increase with
> the increase of the cpu overhead.

Perhaps you shouldn't be creating that many containers all at once?
What normal workload requires this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
  2021-09-08 12:15     ` Yi Tao
  (?)
  (?)
@ 2021-09-08 15:30     ` kernel test robot
  -1 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-09-08 15:30 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 11910 bytes --]

Hi Yi,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on linus/master next-20210908]
[cannot apply to cgroup/for-next kees/for-next/pstore v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4b93c544e90e2b28326182d31ee008eb80e02074
config: arm-randconfig-r013-20210908 (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
        git checkout 6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> kernel/sysctl.c:2724:36: error: 'cgroup_supply_delay_time' undeclared here (not in a function)
    2724 |                 .data           = &cgroup_supply_delay_time,
         |                                    ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/cgroup_supply_delay_time +2724 kernel/sysctl.c

  2424	
  2425	#if (defined(CONFIG_X86_32) || defined(CONFIG_PARISC)) && \
  2426		defined(CONFIG_DEBUG_STACKOVERFLOW)
  2427		{
  2428			.procname	= "panic_on_stackoverflow",
  2429			.data		= &sysctl_panic_on_stackoverflow,
  2430			.maxlen		= sizeof(int),
  2431			.mode		= 0644,
  2432			.proc_handler	= proc_dointvec,
  2433		},
  2434	#endif
  2435	#if defined(CONFIG_X86)
  2436		{
  2437			.procname	= "panic_on_unrecovered_nmi",
  2438			.data		= &panic_on_unrecovered_nmi,
  2439			.maxlen		= sizeof(int),
  2440			.mode		= 0644,
  2441			.proc_handler	= proc_dointvec,
  2442		},
  2443		{
  2444			.procname	= "panic_on_io_nmi",
  2445			.data		= &panic_on_io_nmi,
  2446			.maxlen		= sizeof(int),
  2447			.mode		= 0644,
  2448			.proc_handler	= proc_dointvec,
  2449		},
  2450		{
  2451			.procname	= "bootloader_type",
  2452			.data		= &bootloader_type,
  2453			.maxlen		= sizeof (int),
  2454			.mode		= 0444,
  2455			.proc_handler	= proc_dointvec,
  2456		},
  2457		{
  2458			.procname	= "bootloader_version",
  2459			.data		= &bootloader_version,
  2460			.maxlen		= sizeof (int),
  2461			.mode		= 0444,
  2462			.proc_handler	= proc_dointvec,
  2463		},
  2464		{
  2465			.procname	= "io_delay_type",
  2466			.data		= &io_delay_type,
  2467			.maxlen		= sizeof(int),
  2468			.mode		= 0644,
  2469			.proc_handler	= proc_dointvec,
  2470		},
  2471	#endif
  2472	#if defined(CONFIG_MMU)
  2473		{
  2474			.procname	= "randomize_va_space",
  2475			.data		= &randomize_va_space,
  2476			.maxlen		= sizeof(int),
  2477			.mode		= 0644,
  2478			.proc_handler	= proc_dointvec,
  2479		},
  2480	#endif
  2481	#if defined(CONFIG_S390) && defined(CONFIG_SMP)
  2482		{
  2483			.procname	= "spin_retry",
  2484			.data		= &spin_retry,
  2485			.maxlen		= sizeof (int),
  2486			.mode		= 0644,
  2487			.proc_handler	= proc_dointvec,
  2488		},
  2489	#endif
  2490	#if	defined(CONFIG_ACPI_SLEEP) && defined(CONFIG_X86)
  2491		{
  2492			.procname	= "acpi_video_flags",
  2493			.data		= &acpi_realmode_flags,
  2494			.maxlen		= sizeof (unsigned long),
  2495			.mode		= 0644,
  2496			.proc_handler	= proc_doulongvec_minmax,
  2497		},
  2498	#endif
  2499	#ifdef CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
  2500		{
  2501			.procname	= "ignore-unaligned-usertrap",
  2502			.data		= &no_unaligned_warning,
  2503			.maxlen		= sizeof (int),
  2504			.mode		= 0644,
  2505			.proc_handler	= proc_dointvec,
  2506		},
  2507	#endif
  2508	#ifdef CONFIG_IA64
  2509		{
  2510			.procname	= "unaligned-dump-stack",
  2511			.data		= &unaligned_dump_stack,
  2512			.maxlen		= sizeof (int),
  2513			.mode		= 0644,
  2514			.proc_handler	= proc_dointvec,
  2515		},
  2516	#endif
  2517	#ifdef CONFIG_DETECT_HUNG_TASK
  2518	#ifdef CONFIG_SMP
  2519		{
  2520			.procname	= "hung_task_all_cpu_backtrace",
  2521			.data		= &sysctl_hung_task_all_cpu_backtrace,
  2522			.maxlen		= sizeof(int),
  2523			.mode		= 0644,
  2524			.proc_handler	= proc_dointvec_minmax,
  2525			.extra1		= SYSCTL_ZERO,
  2526			.extra2		= SYSCTL_ONE,
  2527		},
  2528	#endif /* CONFIG_SMP */
  2529		{
  2530			.procname	= "hung_task_panic",
  2531			.data		= &sysctl_hung_task_panic,
  2532			.maxlen		= sizeof(int),
  2533			.mode		= 0644,
  2534			.proc_handler	= proc_dointvec_minmax,
  2535			.extra1		= SYSCTL_ZERO,
  2536			.extra2		= SYSCTL_ONE,
  2537		},
  2538		{
  2539			.procname	= "hung_task_check_count",
  2540			.data		= &sysctl_hung_task_check_count,
  2541			.maxlen		= sizeof(int),
  2542			.mode		= 0644,
  2543			.proc_handler	= proc_dointvec_minmax,
  2544			.extra1		= SYSCTL_ZERO,
  2545		},
  2546		{
  2547			.procname	= "hung_task_timeout_secs",
  2548			.data		= &sysctl_hung_task_timeout_secs,
  2549			.maxlen		= sizeof(unsigned long),
  2550			.mode		= 0644,
  2551			.proc_handler	= proc_dohung_task_timeout_secs,
  2552			.extra2		= &hung_task_timeout_max,
  2553		},
  2554		{
  2555			.procname	= "hung_task_check_interval_secs",
  2556			.data		= &sysctl_hung_task_check_interval_secs,
  2557			.maxlen		= sizeof(unsigned long),
  2558			.mode		= 0644,
  2559			.proc_handler	= proc_dohung_task_timeout_secs,
  2560			.extra2		= &hung_task_timeout_max,
  2561		},
  2562		{
  2563			.procname	= "hung_task_warnings",
  2564			.data		= &sysctl_hung_task_warnings,
  2565			.maxlen		= sizeof(int),
  2566			.mode		= 0644,
  2567			.proc_handler	= proc_dointvec_minmax,
  2568			.extra1		= &neg_one,
  2569		},
  2570	#endif
  2571	#ifdef CONFIG_RT_MUTEXES
  2572		{
  2573			.procname	= "max_lock_depth",
  2574			.data		= &max_lock_depth,
  2575			.maxlen		= sizeof(int),
  2576			.mode		= 0644,
  2577			.proc_handler	= proc_dointvec,
  2578		},
  2579	#endif
  2580		{
  2581			.procname	= "poweroff_cmd",
  2582			.data		= &poweroff_cmd,
  2583			.maxlen		= POWEROFF_CMD_PATH_LEN,
  2584			.mode		= 0644,
  2585			.proc_handler	= proc_dostring,
  2586		},
  2587	#ifdef CONFIG_KEYS
  2588		{
  2589			.procname	= "keys",
  2590			.mode		= 0555,
  2591			.child		= key_sysctls,
  2592		},
  2593	#endif
  2594	#ifdef CONFIG_PERF_EVENTS
  2595		/*
  2596		 * User-space scripts rely on the existence of this file
  2597		 * as a feature check for perf_events being enabled.
  2598		 *
  2599		 * So it's an ABI, do not remove!
  2600		 */
  2601		{
  2602			.procname	= "perf_event_paranoid",
  2603			.data		= &sysctl_perf_event_paranoid,
  2604			.maxlen		= sizeof(sysctl_perf_event_paranoid),
  2605			.mode		= 0644,
  2606			.proc_handler	= proc_dointvec,
  2607		},
  2608		{
  2609			.procname	= "perf_event_mlock_kb",
  2610			.data		= &sysctl_perf_event_mlock,
  2611			.maxlen		= sizeof(sysctl_perf_event_mlock),
  2612			.mode		= 0644,
  2613			.proc_handler	= proc_dointvec,
  2614		},
  2615		{
  2616			.procname	= "perf_event_max_sample_rate",
  2617			.data		= &sysctl_perf_event_sample_rate,
  2618			.maxlen		= sizeof(sysctl_perf_event_sample_rate),
  2619			.mode		= 0644,
  2620			.proc_handler	= perf_proc_update_handler,
  2621			.extra1		= SYSCTL_ONE,
  2622		},
  2623		{
  2624			.procname	= "perf_cpu_time_max_percent",
  2625			.data		= &sysctl_perf_cpu_time_max_percent,
  2626			.maxlen		= sizeof(sysctl_perf_cpu_time_max_percent),
  2627			.mode		= 0644,
  2628			.proc_handler	= perf_cpu_time_max_percent_handler,
  2629			.extra1		= SYSCTL_ZERO,
  2630			.extra2		= &one_hundred,
  2631		},
  2632		{
  2633			.procname	= "perf_event_max_stack",
  2634			.data		= &sysctl_perf_event_max_stack,
  2635			.maxlen		= sizeof(sysctl_perf_event_max_stack),
  2636			.mode		= 0644,
  2637			.proc_handler	= perf_event_max_stack_handler,
  2638			.extra1		= SYSCTL_ZERO,
  2639			.extra2		= &six_hundred_forty_kb,
  2640		},
  2641		{
  2642			.procname	= "perf_event_max_contexts_per_stack",
  2643			.data		= &sysctl_perf_event_max_contexts_per_stack,
  2644			.maxlen		= sizeof(sysctl_perf_event_max_contexts_per_stack),
  2645			.mode		= 0644,
  2646			.proc_handler	= perf_event_max_stack_handler,
  2647			.extra1		= SYSCTL_ZERO,
  2648			.extra2		= &one_thousand,
  2649		},
  2650	#endif
  2651		{
  2652			.procname	= "panic_on_warn",
  2653			.data		= &panic_on_warn,
  2654			.maxlen		= sizeof(int),
  2655			.mode		= 0644,
  2656			.proc_handler	= proc_dointvec_minmax,
  2657			.extra1		= SYSCTL_ZERO,
  2658			.extra2		= SYSCTL_ONE,
  2659		},
  2660	#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
  2661		{
  2662			.procname	= "timer_migration",
  2663			.data		= &sysctl_timer_migration,
  2664			.maxlen		= sizeof(unsigned int),
  2665			.mode		= 0644,
  2666			.proc_handler	= timer_migration_handler,
  2667			.extra1		= SYSCTL_ZERO,
  2668			.extra2		= SYSCTL_ONE,
  2669		},
  2670	#endif
  2671	#ifdef CONFIG_BPF_SYSCALL
  2672		{
  2673			.procname	= "unprivileged_bpf_disabled",
  2674			.data		= &sysctl_unprivileged_bpf_disabled,
  2675			.maxlen		= sizeof(sysctl_unprivileged_bpf_disabled),
  2676			.mode		= 0644,
  2677			.proc_handler	= bpf_unpriv_handler,
  2678			.extra1		= SYSCTL_ZERO,
  2679			.extra2		= &two,
  2680		},
  2681		{
  2682			.procname	= "bpf_stats_enabled",
  2683			.data		= &bpf_stats_enabled_key.key,
  2684			.maxlen		= sizeof(bpf_stats_enabled_key),
  2685			.mode		= 0644,
  2686			.proc_handler	= bpf_stats_handler,
  2687		},
  2688	#endif
  2689	#if defined(CONFIG_TREE_RCU)
  2690		{
  2691			.procname	= "panic_on_rcu_stall",
  2692			.data		= &sysctl_panic_on_rcu_stall,
  2693			.maxlen		= sizeof(sysctl_panic_on_rcu_stall),
  2694			.mode		= 0644,
  2695			.proc_handler	= proc_dointvec_minmax,
  2696			.extra1		= SYSCTL_ZERO,
  2697			.extra2		= SYSCTL_ONE,
  2698		},
  2699	#endif
  2700	#if defined(CONFIG_TREE_RCU)
  2701		{
  2702			.procname	= "max_rcu_stall_to_panic",
  2703			.data		= &sysctl_max_rcu_stall_to_panic,
  2704			.maxlen		= sizeof(sysctl_max_rcu_stall_to_panic),
  2705			.mode		= 0644,
  2706			.proc_handler	= proc_dointvec_minmax,
  2707			.extra1		= SYSCTL_ONE,
  2708			.extra2		= SYSCTL_INT_MAX,
  2709		},
  2710	#endif
  2711	#ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE
  2712		{
  2713			.procname	= "stack_erasing",
  2714			.data		= NULL,
  2715			.maxlen		= sizeof(int),
  2716			.mode		= 0600,
  2717			.proc_handler	= stack_erasing_sysctl,
  2718			.extra1		= SYSCTL_ZERO,
  2719			.extra2		= SYSCTL_ONE,
  2720		},
  2721	#endif
  2722		{
  2723			.procname       = "cgroup_supply_delay_time",
> 2724			.data           = &cgroup_supply_delay_time,
  2725			.maxlen         = sizeof(unsigned int),
  2726			.mode           = 0644,
  2727			.proc_handler   = proc_dointvec,
  2728		},
  2729		{ }
  2730	};
  2731	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35180 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
  2021-09-08 12:15 ` [RFC PATCH 1/2] add pinned flags for kernfs node Yi Tao
  2021-09-08 12:15     ` Yi Tao
  2021-09-08 12:35     ` Greg KH
@ 2021-09-08 16:26   ` kernel test robot
  2 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-09-08 16:26 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 11509 bytes --]

Hi Yi,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on linus/master next-20210908]
[cannot apply to cgroup/for-next kees/for-next/pstore v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4b93c544e90e2b28326182d31ee008eb80e02074
config: i386-tinyconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/fa7929f29afb7923a37f7ca5d08e4a4c13c7ca08
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
        git checkout fa7929f29afb7923a37f7ca5d08e4a4c13c7ca08
        # save the attached .config to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: init/do_mounts.o: in function `kernfs_set_pinned':
>> do_mounts.c:(.text+0x10): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: init/do_mounts.o: in function `kernfs_clear_pinned':
>> do_mounts.c:(.text+0x20): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: init/noinitramfs.o: in function `kernfs_set_pinned':
   noinitramfs.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: init/noinitramfs.o: in function `kernfs_clear_pinned':
   noinitramfs.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: init/init_task.o: in function `kernfs_set_pinned':
   init_task.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: init/init_task.o: in function `kernfs_clear_pinned':
   init_task.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/entry/syscall_32.o: in function `kernfs_set_pinned':
   syscall_32.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/entry/syscall_32.o: in function `kernfs_clear_pinned':
   syscall_32.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/entry/common.o: in function `kernfs_set_pinned':
   common.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/entry/common.o: in function `kernfs_clear_pinned':
   common.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/entry/vdso/vma.o: in function `kernfs_set_pinned':
   vma.c:(.text+0x260): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/entry/vdso/vma.o: in function `kernfs_clear_pinned':
   vma.c:(.text+0x270): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/entry/vdso/extable.o: in function `kernfs_set_pinned':
   extable.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/entry/vdso/extable.o: in function `kernfs_clear_pinned':
   extable.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/core.o: in function `kernfs_set_pinned':
   core.c:(.text+0xbc0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/core.o: in function `kernfs_clear_pinned':
   core.c:(.text+0xbd0): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/probe.o: in function `kernfs_set_pinned':
   probe.c:(.text+0x10): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/probe.o: in function `kernfs_clear_pinned':
   probe.c:(.text+0x20): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/amd/core.o: in function `kernfs_set_pinned':
   core.c:(.text+0x900): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/amd/core.o: in function `kernfs_clear_pinned':
   core.c:(.text+0x910): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/core.o: in function `kernfs_set_pinned':
   core.c:(.text+0x27d0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/core.o: in function `kernfs_clear_pinned':
   core.c:(.text+0x27e0): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/bts.o: in function `kernfs_set_pinned':
   bts.c:(.text+0x9f0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/bts.o: in function `kernfs_clear_pinned':
   bts.c:(.text+0xa00): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/ds.o: in function `kernfs_set_pinned':
   ds.c:(.text+0x1ea0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/ds.o: in function `kernfs_clear_pinned':
   ds.c:(.text+0x1eb0): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/knc.o: in function `kernfs_set_pinned':
   knc.c:(.text+0x320): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/knc.o: in function `kernfs_clear_pinned':
   knc.c:(.text+0x330): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/lbr.o: in function `kernfs_set_pinned':
   lbr.c:(.text+0xc80): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/lbr.o: in function `kernfs_clear_pinned':
   lbr.c:(.text+0xc90): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/p4.o: in function `kernfs_set_pinned':
   p4.c:(.text+0x740): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/p4.o: in function `kernfs_clear_pinned':
   p4.c:(.text+0x750): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/p6.o: in function `kernfs_set_pinned':
   p6.c:(.text+0x170): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/p6.o: in function `kernfs_clear_pinned':
   p6.c:(.text+0x180): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/intel/pt.o: in function `kernfs_set_pinned':
   pt.c:(.text+0x1930): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/intel/pt.o: in function `kernfs_clear_pinned':
   pt.c:(.text+0x1940): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/events/zhaoxin/core.o: in function `kernfs_set_pinned':
   core.c:(.text+0x580): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/events/zhaoxin/core.o: in function `kernfs_clear_pinned':
   core.c:(.text+0x590): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/process_32.o: in function `kernfs_set_pinned':
   process_32.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/process_32.o: in function `kernfs_clear_pinned':
   process_32.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/signal.o: in function `kernfs_set_pinned':
   signal.c:(.text+0x700): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/signal.o: in function `kernfs_clear_pinned':
   signal.c:(.text+0x710): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/traps.o: in function `kernfs_set_pinned':
   traps.c:(.text+0x340): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/traps.o: in function `kernfs_clear_pinned':
   traps.c:(.text+0x350): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/idt.o: in function `kernfs_set_pinned':
   idt.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/idt.o: in function `kernfs_clear_pinned':
   idt.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/irq.o: in function `kernfs_set_pinned':
   irq.c:(.text+0x60): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/irq.o: in function `kernfs_clear_pinned':
   irq.c:(.text+0x70): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/irq_32.o: in function `kernfs_set_pinned':
   irq_32.c:(.text+0x10): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/irq_32.o: in function `kernfs_clear_pinned':
   irq_32.c:(.text+0x20): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here
   ld: arch/x86/kernel/dumpstack_32.o: in function `kernfs_set_pinned':
   dumpstack_32.c:(.text+0x0): multiple definition of `kernfs_set_pinned'; init/main.o:main.c:(.text+0x40): first defined here
   ld: arch/x86/kernel/dumpstack_32.o: in function `kernfs_clear_pinned':
   dumpstack_32.c:(.text+0x10): multiple definition of `kernfs_clear_pinned'; init/main.o:main.c:(.text+0x50): first defined here

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 7533 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
  2021-09-08 12:15 ` Yi Tao
                   ` (2 preceding siblings ...)
  (?)
@ 2021-09-08 16:35 ` Tejun Heo
  2021-09-10  2:12   ` taoyi.ty
  -1 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2021-09-08 16:35 UTC (permalink / raw)
  To: Yi Tao
  Cc: gregkh, lizefan.x, hannes, mcgrof, keescook, yzaikin,
	linux-kernel, cgroups, linux-fsdevel, shanpeic

Hello,

On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
> In order to solve this long-tail delay problem, we designed a cgroup
> pool. The cgroup pool will create a certain number of cgroups in advance.
> When a user creates a cgroup through the mkdir system call, a clean cgroup
> can be quickly obtained from the pool. Cgroup pool draws on the idea of
> cgroup rename. By creating pool and rename in advance, it reduces the
> critical area of cgroup creation, and uses a spinlock different from
> cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
> competition with attaching processes on the other hand.

I'm not sure this is the right way to go about it. There are more
conventional ways to improve scalability - making locking more granular and
hunting down specific operations which take long time. I don't think cgroup
management operations need the level of scalability which requires front
caching.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
  2021-09-08 12:15     ` Yi Tao
                       ` (2 preceding siblings ...)
  (?)
@ 2021-09-08 16:52     ` kernel test robot
  -1 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-09-08 16:52 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3999 bytes --]

Hi Yi,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on linus/master next-20210908]
[cannot apply to cgroup/for-next kees/for-next/pstore v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4b93c544e90e2b28326182d31ee008eb80e02074
config: m68k-randconfig-r014-20210908 (attached as .config)
compiler: m68k-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
        git checkout 6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/cgroup/cgroup.c: In function 'cgroup_mkdir':
>> kernel/cgroup/cgroup.c:5513:29: error: unused variable 'kn' [-Werror=unused-variable]
    5513 |         struct kernfs_node *kn;
         |                             ^~
   cc1: all warnings being treated as errors


vim +/kn +5513 kernel/cgroup/cgroup.c

  5509	
  5510	int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
  5511	{
  5512		struct cgroup *parent, *cgrp;
> 5513		struct kernfs_node *kn;
  5514		int ret = 1;
  5515		bool hide = false;
  5516	
  5517		/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
  5518		if (strchr(name, '\n'))
  5519			return -EINVAL;
  5520	
  5521		/* 0xffff means cgroup is created for pool, set to default mode 0x1ed */
  5522		if (mode == 0xffff) {
  5523			hide = true;
  5524			mode = 0x1ed;
  5525		}
  5526	
  5527		if (!hide)
  5528			ret = cgroup_mkdir_fast_path(parent_kn, name);
  5529		if (ret <= 0)
  5530			return ret;
  5531	
  5532		parent = cgroup_kn_lock_live(parent_kn, false);
  5533		if (!parent)
  5534			return -ENODEV;
  5535	
  5536		if (!cgroup_check_hierarchy_limits(parent)) {
  5537			ret = -EAGAIN;
  5538			goto out_unlock;
  5539		}
  5540	
  5541		cgrp = cgroup_create(parent, name, mode);
  5542		if (IS_ERR(cgrp)) {
  5543			ret = PTR_ERR(cgrp);
  5544			goto out_unlock;
  5545		}
  5546	
  5547		/*
  5548		 * This extra ref will be put in cgroup_free_fn() and guarantees
  5549		 * that @cgrp->kn is always accessible.
  5550		 */
  5551		kernfs_get(cgrp->kn);
  5552	
  5553		ret = cgroup_kn_set_ugid(cgrp->kn);
  5554		if (ret)
  5555			goto out_destroy;
  5556	
  5557		ret = css_populate_dir(&cgrp->self);
  5558		if (ret)
  5559			goto out_destroy;
  5560	
  5561		ret = cgroup_apply_control_enable(cgrp);
  5562		if (ret)
  5563			goto out_destroy;
  5564	
  5565		TRACE_CGROUP_PATH(mkdir, cgrp);
  5566	
  5567		/* let's create and online css's */
  5568		kernfs_activate(cgrp->kn);
  5569	
  5570		if (hide)
  5571			cgroup_hide(parent, cgrp, name);
  5572	
  5573		ret = 0;
  5574		goto out_unlock;
  5575	
  5576	out_destroy:
  5577		cgroup_destroy_locked(cgrp);
  5578	out_unlock:
  5579		cgroup_kn_unlock(parent_kn);
  5580		return ret;
  5581	}
  5582	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 26366 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
  2021-09-08 12:15     ` Yi Tao
                       ` (3 preceding siblings ...)
  (?)
@ 2021-09-08 17:39     ` kernel test robot
  -1 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-09-08 17:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

Hi Yi,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on linus/master next-20210908]
[cannot apply to cgroup/for-next kees/for-next/pstore v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git 4b93c544e90e2b28326182d31ee008eb80e02074
config: i386-randconfig-s001-20210908 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Yi-Tao/support-cgroup-pool-in-v1/20210908-201642
        git checkout 6fadbb12de926d6d4fd6d903304d5cbe52bbbc52
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> kernel/cgroup/cgroup-v1.c:617:1: sparse: sparse: symbol 'cgroup_pool_mutex' was not declared. Should it be static?
   kernel/cgroup/cgroup-v1.c:490:16: sparse: sparse: context imbalance in '__cgroup1_procs_write' - wrong count at exit
   kernel/cgroup/cgroup-v1.c:991:9: sparse: sparse: context imbalance in 'cgroup1_rename' - different lock contexts for basic block

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 27563 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH] cgroup_pool_mutex can be static
  2021-09-08 12:15     ` Yi Tao
                       ` (4 preceding siblings ...)
  (?)
@ 2021-09-08 17:39     ` kernel test robot
  -1 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2021-09-08 17:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

kernel/cgroup/cgroup-v1.c:617:1: warning: symbol 'cgroup_pool_mutex' was not declared. Should it be static?

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 cgroup-v1.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 8964c4d0741b5..54dfc3831d3b0 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -614,7 +614,7 @@ static int cgroup_clone_children_write(struct cgroup_subsys_state *css,
  * of parent cgroup and child cgroup at the same time. use cgroup_pool_mutex
  * to serialize any write operation to pool_size.
  */
-DEFINE_MUTEX(cgroup_pool_mutex);
+static DEFINE_MUTEX(cgroup_pool_mutex);
 
 static u64 cgroup_pool_size_read(struct cgroup_subsys_state *css,
 				 struct cftype *cft)

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
       [not found]       ` <084930d2-057a-04a7-76d1-b2a7bd37deb0@linux.alibaba.com>
@ 2021-09-09 13:27         ` Greg KH
  2021-09-10  2:20           ` taoyi.ty
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2021-09-09 13:27 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Thu, Sep 09, 2021 at 08:36:10PM +0800, taoyi.ty wrote:

<snip>

Please do not send HTML email, all of the mailing lists reject it.

Please fix up your email client and resend your replies and I will be
glad to continue the discussion.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
  2021-09-08 12:37 ` [RFC PATCH 0/2] support cgroup pool in v1 Greg KH
@ 2021-09-10  2:11   ` taoyi.ty
  2021-09-10  6:01       ` Greg KH
  2021-09-10 16:49       ` Tejun Heo
  0 siblings, 2 replies; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:11 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On 2021/9/8 下午8:37, Greg KH wrote:

> Perhaps you shouldn't be creating that many containers all at once?
> What normal workload requires this?

Thank you for your reply.


The scenario is the function computing of the public

cloud. Each instance of function computing will be

allocated about 0.1 core cpu and 100M memory. On

a high-end server, for example, 104 cores and 384G,

it is normal to create hundreds of containers at the

same time if burst of requests comes.

thanks,

Yi Tao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
  2021-09-08 16:35 ` Tejun Heo
@ 2021-09-10  2:12   ` taoyi.ty
  0 siblings, 0 replies; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, lizefan.x, hannes, mcgrof, keescook, yzaikin,
	linux-kernel, cgroups, linux-fsdevel, shanpeic

I am glad to receive your reply.

cgroup pool is a relatively simple solution that I think can

solve the problem.

I have tried making locking more granular, but in the end found

it too diffcult. cgroup_mutex protects almost all operation related

to cgroup. If not use cgroup_mutex, I have no idea how to design

lock mechanism to take both concurrent performance and

existing interfaces into account. Do you have any good advice?


thanks,


Yi Tao


On 2021/9/9 上午12:35, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
>> In order to solve this long-tail delay problem, we designed a cgroup
>> pool. The cgroup pool will create a certain number of cgroups in advance.
>> When a user creates a cgroup through the mkdir system call, a clean cgroup
>> can be quickly obtained from the pool. Cgroup pool draws on the idea of
>> cgroup rename. By creating pool and rename in advance, it reduces the
>> critical area of cgroup creation, and uses a spinlock different from
>> cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
>> competition with attaching processes on the other hand.
> I'm not sure this is the right way to go about it. There are more
> conventional ways to improve scalability - making locking more granular and
> hunting down specific operations which take long time. I don't think cgroup
> management operations need the level of scalability which requires front
> caching.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
  2021-09-08 12:35     ` Greg KH
  (?)
@ 2021-09-10  2:14     ` taoyi.ty
  2021-09-10  6:00         ` Greg KH
  -1 siblings, 1 reply; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:14 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic


On 2021/9/8 下午8:35, Greg KH wrote:
> Why are kernfs changes needed for this?  kernfs creation is not
> necessarily supposed to be "fast", what benchmark needs this type of
> change to require the addition of this complexity?

The implementation of the cgroup pool should have nothing

to do with kernfs, but during the development process,

I found that when there is a background cpu load, it takes

a very significant time for a process to get the mutex from

being awakened to starting execution.

To create 400 cgroups concurrently, if there is no background

cpu load, it takes about 80ms, but if the cpu usage rate is

40%, it takes about 700ms. If you reduce

sched_wakeup_granularity_ns, the time consumption will also

be reduced. If you change mutex to spinlock, the situation

will be very much improved.

So to solve this problem, mutex should not be used. The

cgroup pool relies on kernfs_rename which uses

kernfs_mutex, so I need to bypass kernfs_mutex and

add a pinned flag for this.

Because the lock mechanism of kernfs_rename has been

changed, in order to maintain data consistency, the creation

and deletion of kernfs have also been changed accordingly

I admit that this is really not a very elegant design, but I don’t

know how to make it better, so I throw out the problem and

try to seek help from the community.


thanks,


Yi Tao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-10  2:15         ` taoyi.ty
  0 siblings, 0 replies; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:15 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic


On 2021/9/8 下午8:35, Greg KH wrote:
> I thought cgroup v1 was "obsolete" and not getting new features added to
> it.  What is wrong with just using cgroups 2 instead if you have a
> problem with the v1 interface?
>
>    

There are two reasons for developing based on cgroup v1:


1. In the Internet scenario, a large number of services

are still using cgroup v1, cgroup v2 has not yet been

popularized.


2. The mechanism of cgroup pool refers to cgroup1_rename,

but for some reasons, a similar rename mechanism is not

implemented on cgroup v2, and I don't know the thoughts

behind this.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-10  2:15         ` taoyi.ty
  0 siblings, 0 replies; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:15 UTC (permalink / raw)
  To: Greg KH
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf


On 2021/9/8 下午8:35, Greg KH wrote:
> I thought cgroup v1 was "obsolete" and not getting new features added to
> it.  What is wrong with just using cgroups 2 instead if you have a
> problem with the v1 interface?
>
>    

There are two reasons for developing based on cgroup v1:


1. In the Internet scenario, a large number of services

are still using cgroup v1, cgroup v2 has not yet been

popularized.


2. The mechanism of cgroup pool refers to cgroup1_rename,

but for some reasons, a similar rename mechanism is not

implemented on cgroup v2, and I don't know the thoughts

behind this.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
  2021-09-09 13:27         ` Greg KH
@ 2021-09-10  2:20           ` taoyi.ty
  0 siblings, 0 replies; 35+ messages in thread
From: taoyi.ty @ 2021-09-10  2:20 UTC (permalink / raw)
  To: Greg KH
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

Thank you for your reminder. This is my first time communicating with 
the community and I'm sorry to make these low-level mistakes. The HTML 
emails have been corrected to plain text and resent.

Thanks,

Yi Tao

On 9/9/21 9:27 PM, Greg KH wrote:
> On Thu, Sep 09, 2021 at 08:36:10PM +0800, taoyi.ty wrote:
> 
> <snip>
> 
> Please do not send HTML email, all of the mailing lists reject it.
> 
> Please fix up your email client and resend your replies and I will be
> glad to continue the discussion.
> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
@ 2021-09-10  6:00         ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:00 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Fri, Sep 10, 2021 at 10:14:28AM +0800, taoyi.ty wrote:
> 
> On 2021/9/8 下午8:35, Greg KH wrote:
> > Why are kernfs changes needed for this?  kernfs creation is not
> > necessarily supposed to be "fast", what benchmark needs this type of
> > change to require the addition of this complexity?
> 
> The implementation of the cgroup pool should have nothing
> 
> to do with kernfs, but during the development process,
> 
> I found that when there is a background cpu load, it takes
> 
> a very significant time for a process to get the mutex from
> 
> being awakened to starting execution.
> 
> To create 400 cgroups concurrently, if there is no background
> 
> cpu load, it takes about 80ms, but if the cpu usage rate is
> 
> 40%, it takes about 700ms. If you reduce
> 
> sched_wakeup_granularity_ns, the time consumption will also
> 
> be reduced. If you change mutex to spinlock, the situation
> 
> will be very much improved.
> 
> So to solve this problem, mutex should not be used. The
> 
> cgroup pool relies on kernfs_rename which uses
> 
> kernfs_mutex, so I need to bypass kernfs_mutex and
> 
> add a pinned flag for this.
> 
> Because the lock mechanism of kernfs_rename has been
> 
> changed, in order to maintain data consistency, the creation
> 
> and deletion of kernfs have also been changed accordingly
> 
> I admit that this is really not a very elegant design, but I don’t
> 
> know how to make it better, so I throw out the problem and
> 
> try to seek help from the community.

Look at the changes to kernfs for 5.15-rc1 where a lot of the lock
contention was removed based on benchmarks where kernfs (through sysfs)
was accessed by lots of processes all at once.

That should help a bit in your case, but remember, the creation of
kernfs files is not the "normal" case, so it is not optimized at all.
We have optimized the access case, which is by far the most common.

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/2] add pinned flags for kernfs node
@ 2021-09-10  6:00         ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:00 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Fri, Sep 10, 2021 at 10:14:28AM +0800, taoyi.ty wrote:
> 
> On 2021/9/8 下午8:35, Greg KH wrote:
> > Why are kernfs changes needed for this?  kernfs creation is not
> > necessarily supposed to be "fast", what benchmark needs this type of
> > change to require the addition of this complexity?
> 
> The implementation of the cgroup pool should have nothing
> 
> to do with kernfs, but during the development process,
> 
> I found that when there is a background cpu load, it takes
> 
> a very significant time for a process to get the mutex from
> 
> being awakened to starting execution.
> 
> To create 400 cgroups concurrently, if there is no background
> 
> cpu load, it takes about 80ms, but if the cpu usage rate is
> 
> 40%, it takes about 700ms. If you reduce
> 
> sched_wakeup_granularity_ns, the time consumption will also
> 
> be reduced. If you change mutex to spinlock, the situation
> 
> will be very much improved.
> 
> So to solve this problem, mutex should not be used. The
> 
> cgroup pool relies on kernfs_rename which uses
> 
> kernfs_mutex, so I need to bypass kernfs_mutex and
> 
> add a pinned flag for this.
> 
> Because the lock mechanism of kernfs_rename has been
> 
> changed, in order to maintain data consistency, the creation
> 
> and deletion of kernfs have also been changed accordingly
> 
> I admit that this is really not a very elegant design, but I don’t
> 
> know how to make it better, so I throw out the problem and
> 
> try to seek help from the community.

Look at the changes to kernfs for 5.15-rc1 where a lot of the lock
contention was removed based on benchmarks where kernfs (through sysfs)
was accessed by lots of processes all at once.

That should help a bit in your case, but remember, the creation of
kernfs files is not the "normal" case, so it is not optimized at all.
We have optimized the access case, which is by far the most common.

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-10  6:01           ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:01 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Fri, Sep 10, 2021 at 10:15:02AM +0800, taoyi.ty wrote:
> 
> On 2021/9/8 下午8:35, Greg KH wrote:
> > I thought cgroup v1 was "obsolete" and not getting new features added to
> > it.  What is wrong with just using cgroups 2 instead if you have a
> > problem with the v1 interface?
> > 
> 
> There are two reasons for developing based on cgroup v1:
> 
> 
> 1. In the Internet scenario, a large number of services
> 
> are still using cgroup v1, cgroup v2 has not yet been
> 
> popularized.

That does not mean we have to add additional kernel complexity for an
obsolete feature that we are not adding support for anymore.  If
anything, this would be a good reason to move those userspace services
to the new api to solve this issue, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/2] support cgroup pool in v1
@ 2021-09-10  6:01           ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:01 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Fri, Sep 10, 2021 at 10:15:02AM +0800, taoyi.ty wrote:
> 
> On 2021/9/8 下午8:35, Greg KH wrote:
> > I thought cgroup v1 was "obsolete" and not getting new features added to
> > it.  What is wrong with just using cgroups 2 instead if you have a
> > problem with the v1 interface?
> > 
> 
> There are two reasons for developing based on cgroup v1:
> 
> 
> 1. In the Internet scenario, a large number of services
> 
> are still using cgroup v1, cgroup v2 has not yet been
> 
> popularized.

That does not mean we have to add additional kernel complexity for an
obsolete feature that we are not adding support for anymore.  If
anything, this would be a good reason to move those userspace services
to the new api to solve this issue, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-10  6:01       ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:01 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj, lizefan.x, hannes, mcgrof, keescook, yzaikin, linux-kernel,
	cgroups, linux-fsdevel, shanpeic

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> On 2021/9/8 下午8:37, Greg KH wrote:
> 
> > Perhaps you shouldn't be creating that many containers all at once?
> > What normal workload requires this?
> 
> Thank you for your reply.
> 
> 
> The scenario is the function computing of the public
> 
> cloud. Each instance of function computing will be
> 
> allocated about 0.1 core cpu and 100M memory. On
> 
> a high-end server, for example, 104 cores and 384G,
> 
> it is normal to create hundreds of containers at the
> 
> same time if burst of requests comes.

So it is a resource management issue on your side, right?  Perhaps
stagger the creation of new containers to allow the overall creation
time to be less?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-10  6:01       ` Greg KH
  0 siblings, 0 replies; 35+ messages in thread
From: Greg KH @ 2021-09-10  6:01 UTC (permalink / raw)
  To: taoyi.ty
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> On 2021/9/8 下午8:37, Greg KH wrote:
> 
> > Perhaps you shouldn't be creating that many containers all at once?
> > What normal workload requires this?
> 
> Thank you for your reply.
> 
> 
> The scenario is the function computing of the public
> 
> cloud. Each instance of function computing will be
> 
> allocated about 0.1 core cpu and 100M memory. On
> 
> a high-end server, for example, 104 cores and 384G,
> 
> it is normal to create hundreds of containers at the
> 
> same time if burst of requests comes.

So it is a resource management issue on your side, right?  Perhaps
stagger the creation of new containers to allow the overall creation
time to be less?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-10 16:49       ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2021-09-10 16:49 UTC (permalink / raw)
  To: taoyi.ty
  Cc: Greg KH, lizefan.x, hannes, mcgrof, keescook, yzaikin,
	linux-kernel, cgroups, linux-fsdevel, shanpeic

Hello,

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> The scenario is the function computing of the public
> cloud. Each instance of function computing will be
> allocated about 0.1 core cpu and 100M memory. On
> a high-end server, for example, 104 cores and 384G,
> it is normal to create hundreds of containers at the
> same time if burst of requests comes.

This type of use case isn't something cgroup is good at, at least not
currently. The problem is that trying to scale management operations like
creating and destroying cgroups has implications on how each controller is
implemented - we want the hot paths which get used while cgroups are running
actively to be as efficient and scalable as possible even if that requires a
lot of extra preparation and lazy cleanup operations. We don't really want
to push for cgroup creation / destruction efficiency at the cost of hot path
overhead.

This has implications for use cases like you describe. Even if the kernel
pre-prepare cgroups to low latency for cgroup creation, it means that the
system would be doing a *lot* of managerial extra work creating and
destroying cgroups constantly for not much actual work.

Usually, the right solution for this sort of situations is pooling cgroups
from the userspace which usually has a lot better insight into which cgroups
can be recycled and can also adjust the cgroup hierarchy to better fit the
use case (e.g. some rapid-cycling cgroups can benefit from higher-level
resource configurations).

So, it'd be great to make the managerial operations more efficient from
cgroup core side but there are inherent architectural reasons why
rapid-cycling use cases aren't and won't be prioritized.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-10 16:49       ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2021-09-10 16:49 UTC (permalink / raw)
  To: taoyi.ty
  Cc: Greg KH, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

Hello,

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> The scenario is the function computing of the public
> cloud. Each instance of function computing will be
> allocated about 0.1 core cpu and 100M memory. On
> a high-end server, for example, 104 cores and 384G,
> it is normal to create hundreds of containers at the
> same time if burst of requests comes.

This type of use case isn't something cgroup is good at, at least not
currently. The problem is that trying to scale management operations like
creating and destroying cgroups has implications on how each controller is
implemented - we want the hot paths which get used while cgroups are running
actively to be as efficient and scalable as possible even if that requires a
lot of extra preparation and lazy cleanup operations. We don't really want
to push for cgroup creation / destruction efficiency at the cost of hot path
overhead.

This has implications for use cases like you describe. Even if the kernel
pre-prepare cgroups to low latency for cgroup creation, it means that the
system would be doing a *lot* of managerial extra work creating and
destroying cgroups constantly for not much actual work.

Usually, the right solution for this sort of situations is pooling cgroups
from the userspace which usually has a lot better insight into which cgroups
can be recycled and can also adjust the cgroup hierarchy to better fit the
use case (e.g. some rapid-cycling cgroups can benefit from higher-level
resource configurations).

So, it'd be great to make the managerial operations more efficient from
cgroup core side but there are inherent architectural reasons why
rapid-cycling use cases aren't and won't be prioritized.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-13 14:20         ` Christian Brauner
  0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2021-09-13 14:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: taoyi.ty, Greg KH, lizefan.x, hannes, mcgrof, keescook, yzaikin,
	linux-kernel, cgroups, linux-fsdevel, shanpeic

On Fri, Sep 10, 2021 at 06:49:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> > The scenario is the function computing of the public
> > cloud. Each instance of function computing will be
> > allocated about 0.1 core cpu and 100M memory. On
> > a high-end server, for example, 104 cores and 384G,
> > it is normal to create hundreds of containers at the
> > same time if burst of requests comes.
> 
> This type of use case isn't something cgroup is good at, at least not
> currently. The problem is that trying to scale management operations like
> creating and destroying cgroups has implications on how each controller is
> implemented - we want the hot paths which get used while cgroups are running
> actively to be as efficient and scalable as possible even if that requires a
> lot of extra preparation and lazy cleanup operations. We don't really want
> to push for cgroup creation / destruction efficiency at the cost of hot path
> overhead.
> 
> This has implications for use cases like you describe. Even if the kernel
> pre-prepare cgroups to low latency for cgroup creation, it means that the
> system would be doing a *lot* of managerial extra work creating and
> destroying cgroups constantly for not much actual work.
> 
> Usually, the right solution for this sort of situations is pooling cgroups
> from the userspace which usually has a lot better insight into which cgroups
> can be recycled and can also adjust the cgroup hierarchy to better fit the
> use case (e.g. some rapid-cycling cgroups can benefit from higher-level
> resource configurations).

I had the same reaction and I wanted to do something like this before,
i.e. maintain a pool of pre-allocated cgroups in userspace. But there
were some problems.

Afaict, there is currently now way to prevent the deletion of empty
cgroups, especially newly created ones. So for example, if I have a
cgroup manager that prunes the cgroup tree whenever they detect empty
cgroups they can delete cgroups that were pre-allocated. This is
something we have run into before.

A related problem is a crashed or killed container manager 
(segfault, sigkill, etc.). It might not have had the chance to cleanup
cgroups it allocated for the container. If the container manager is
restarted it can't reuse the existing cgroup it found because it has no
way of guaranteeing whether in between the time it crashed and got
restarted another program has just created a cgroup with the same name.
We usually solve this by just creating another cgroup with an index
appended until we we find an unallocated one setting an arbitrary cut
off point until we require manual intervention by the user (e.g. 1000).

Right now iirc, one can rmdir() an empty cgroup while someone still
holds a file descriptor open for it. This can lead to situation where a
cgroup got created but before moving into the cgroup (via clone3() or
write()) someone else has deleted it. What would already be helpful is
if one had a way to prevent the deletion of cgroups when someone still
has an open reference to it. This would allow a pool of cgroups to be
created that can't simply be deleted.

Christian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-13 14:20         ` Christian Brauner
  0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2021-09-13 14:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: taoyi.ty, Greg KH, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

On Fri, Sep 10, 2021 at 06:49:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> > The scenario is the function computing of the public
> > cloud. Each instance of function computing will be
> > allocated about 0.1 core cpu and 100M memory. On
> > a high-end server, for example, 104 cores and 384G,
> > it is normal to create hundreds of containers at the
> > same time if burst of requests comes.
> 
> This type of use case isn't something cgroup is good at, at least not
> currently. The problem is that trying to scale management operations like
> creating and destroying cgroups has implications on how each controller is
> implemented - we want the hot paths which get used while cgroups are running
> actively to be as efficient and scalable as possible even if that requires a
> lot of extra preparation and lazy cleanup operations. We don't really want
> to push for cgroup creation / destruction efficiency at the cost of hot path
> overhead.
> 
> This has implications for use cases like you describe. Even if the kernel
> pre-prepare cgroups to low latency for cgroup creation, it means that the
> system would be doing a *lot* of managerial extra work creating and
> destroying cgroups constantly for not much actual work.
> 
> Usually, the right solution for this sort of situations is pooling cgroups
> from the userspace which usually has a lot better insight into which cgroups
> can be recycled and can also adjust the cgroup hierarchy to better fit the
> use case (e.g. some rapid-cycling cgroups can benefit from higher-level
> resource configurations).

I had the same reaction and I wanted to do something like this before,
i.e. maintain a pool of pre-allocated cgroups in userspace. But there
were some problems.

Afaict, there is currently now way to prevent the deletion of empty
cgroups, especially newly created ones. So for example, if I have a
cgroup manager that prunes the cgroup tree whenever they detect empty
cgroups they can delete cgroups that were pre-allocated. This is
something we have run into before.

A related problem is a crashed or killed container manager 
(segfault, sigkill, etc.). It might not have had the chance to cleanup
cgroups it allocated for the container. If the container manager is
restarted it can't reuse the existing cgroup it found because it has no
way of guaranteeing whether in between the time it crashed and got
restarted another program has just created a cgroup with the same name.
We usually solve this by just creating another cgroup with an index
appended until we we find an unallocated one setting an arbitrary cut
off point until we require manual intervention by the user (e.g. 1000).

Right now iirc, one can rmdir() an empty cgroup while someone still
holds a file descriptor open for it. This can lead to situation where a
cgroup got created but before moving into the cgroup (via clone3() or
write()) someone else has deleted it. What would already be helpful is
if one had a way to prevent the deletion of cgroups when someone still
has an open reference to it. This would allow a pool of cgroups to be
created that can't simply be deleted.

Christian

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
  2021-09-13 14:20         ` Christian Brauner
@ 2021-09-13 16:24           ` Tejun Heo
  -1 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2021-09-13 16:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: taoyi.ty, Greg KH, lizefan.x, hannes, mcgrof, keescook, yzaikin,
	linux-kernel, cgroups, linux-fsdevel, shanpeic

Hello,

On Mon, Sep 13, 2021 at 04:20:59PM +0200, Christian Brauner wrote:
> Afaict, there is currently now way to prevent the deletion of empty
> cgroups, especially newly created ones. So for example, if I have a
> cgroup manager that prunes the cgroup tree whenever they detect empty
> cgroups they can delete cgroups that were pre-allocated. This is
> something we have run into before.

systemd doesn't mess with cgroups behind a delegation point.

> A related problem is a crashed or killed container manager 
> (segfault, sigkill, etc.). It might not have had the chance to cleanup
> cgroups it allocated for the container. If the container manager is
> restarted it can't reuse the existing cgroup it found because it has no
> way of guaranteeing whether in between the time it crashed and got
> restarted another program has just created a cgroup with the same name.
> We usually solve this by just creating another cgroup with an index
> appended until we we find an unallocated one setting an arbitrary cut
> off point until we require manual intervention by the user (e.g. 1000).
> 
> Right now iirc, one can rmdir() an empty cgroup while someone still
> holds a file descriptor open for it. This can lead to situation where a
> cgroup got created but before moving into the cgroup (via clone3() or
> write()) someone else has deleted it. What would already be helpful is
> if one had a way to prevent the deletion of cgroups when someone still
> has an open reference to it. This would allow a pool of cgroups to be
> created that can't simply be deleted.

The above are problems common for any entity managing cgroup hierarchy.
Beyond the permission and delegation based access control, cgroup doesn't
have a mechanism to grant exclusive managerial operations to a specific
application. It's the userspace's responsibility to coordinate these
operations like in most other kernel interfaces.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/2] support cgroup pool in v1
@ 2021-09-13 16:24           ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2021-09-13 16:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: taoyi.ty, Greg KH, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, mcgrof-DgEjT+Ai2ygdnm+yROfE0A,
	keescook-F7+t8E8rja9g9hUCZPvPmw, yzaikin-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf

Hello,

On Mon, Sep 13, 2021 at 04:20:59PM +0200, Christian Brauner wrote:
> Afaict, there is currently now way to prevent the deletion of empty
> cgroups, especially newly created ones. So for example, if I have a
> cgroup manager that prunes the cgroup tree whenever they detect empty
> cgroups they can delete cgroups that were pre-allocated. This is
> something we have run into before.

systemd doesn't mess with cgroups behind a delegation point.

> A related problem is a crashed or killed container manager 
> (segfault, sigkill, etc.). It might not have had the chance to cleanup
> cgroups it allocated for the container. If the container manager is
> restarted it can't reuse the existing cgroup it found because it has no
> way of guaranteeing whether in between the time it crashed and got
> restarted another program has just created a cgroup with the same name.
> We usually solve this by just creating another cgroup with an index
> appended until we we find an unallocated one setting an arbitrary cut
> off point until we require manual intervention by the user (e.g. 1000).
> 
> Right now iirc, one can rmdir() an empty cgroup while someone still
> holds a file descriptor open for it. This can lead to situation where a
> cgroup got created but before moving into the cgroup (via clone3() or
> write()) someone else has deleted it. What would already be helpful is
> if one had a way to prevent the deletion of cgroups when someone still
> has an open reference to it. This would allow a pool of cgroups to be
> created that can't simply be deleted.

The above are problems common for any entity managing cgroup hierarchy.
Beyond the permission and delegation based access control, cgroup doesn't
have a mechanism to grant exclusive managerial operations to a specific
application. It's the userspace's responsibility to coordinate these
operations like in most other kernel interfaces.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-09-13 16:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-08 12:15 [RFC PATCH 0/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15 ` Yi Tao
2021-09-08 12:15 ` [RFC PATCH 1/2] add pinned flags for kernfs node Yi Tao
2021-09-08 12:15   ` [RFC PATCH 2/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15     ` Yi Tao
2021-09-08 12:35     ` Greg KH
2021-09-08 12:35       ` Greg KH
     [not found]       ` <084930d2-057a-04a7-76d1-b2a7bd37deb0@linux.alibaba.com>
2021-09-09 13:27         ` Greg KH
2021-09-10  2:20           ` taoyi.ty
2021-09-10  2:15       ` taoyi.ty
2021-09-10  2:15         ` taoyi.ty
2021-09-10  6:01         ` Greg KH
2021-09-10  6:01           ` Greg KH
2021-09-08 15:30     ` kernel test robot
2021-09-08 16:52     ` kernel test robot
2021-09-08 17:39     ` kernel test robot
2021-09-08 17:39     ` [RFC PATCH] cgroup_pool_mutex can be static kernel test robot
2021-09-08 12:35   ` [RFC PATCH 1/2] add pinned flags for kernfs node Greg KH
2021-09-08 12:35     ` Greg KH
2021-09-10  2:14     ` taoyi.ty
2021-09-10  6:00       ` Greg KH
2021-09-10  6:00         ` Greg KH
2021-09-08 16:26   ` kernel test robot
2021-09-08 12:37 ` [RFC PATCH 0/2] support cgroup pool in v1 Greg KH
2021-09-10  2:11   ` taoyi.ty
2021-09-10  6:01     ` Greg KH
2021-09-10  6:01       ` Greg KH
2021-09-10 16:49     ` Tejun Heo
2021-09-10 16:49       ` Tejun Heo
2021-09-13 14:20       ` Christian Brauner
2021-09-13 14:20         ` Christian Brauner
2021-09-13 16:24         ` Tejun Heo
2021-09-13 16:24           ` Tejun Heo
2021-09-08 16:35 ` Tejun Heo
2021-09-10  2:12   ` taoyi.ty

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.