All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yi Tao <escape@linux.alibaba.com>
To: gregkh@linuxfoundation.org, tj@kernel.org,
	lizefan.x@bytedance.com, hannes@cmpxchg.org, mcgrof@kernel.org,
	keescook@chromium.org, yzaikin@google.com
Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, shanpeic@linux.alibaba.com
Subject: [RFC PATCH 0/2] support cgroup pool in v1
Date: Wed,  8 Sep 2021 20:15:11 +0800	[thread overview]
Message-ID: <cover.1631102579.git.escape@linux.alibaba.com> (raw)

In a scenario where containers are started with high concurrency, in
order to control the use of system resources by the container, it is
necessary to create a corresponding cgroup for each container and
attach the process. The kernel uses the cgroup_mutex global lock to
protect the consistency of the data, which results in a higher
long-tail delay for cgroup-related operations during concurrent startup.
For example, long-tail delay of creating cgroup under each subsystems
is 900ms when starting 400 containers, which becomes bottleneck of
performance. The delay is mainly composed of two parts, namely the
time of the critical section protected by cgroup_mutex and the
scheduling time of sleep. The scheduling time will increase with
the increase of the cpu overhead.

In order to solve this long-tail delay problem, we designed a cgroup
pool. The cgroup pool will create a certain number of cgroups in advance.
When a user creates a cgroup through the mkdir system call, a clean cgroup
can be quickly obtained from the pool. Cgroup pool draws on the idea of
cgroup rename. By creating pool and rename in advance, it reduces the
critical area of cgroup creation, and uses a spinlock different from
cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
competition with attaching processes on the other hand.

The core idea of implementing a cgroup pool is to create a hidden kernfs
tree. Cgroup is implemented based on the kernfs file system. The user
manipulates the cgroup through the kernfs file. Therefore, we can create
a cgroup in advance and place it in a hidden kernfs tree, so that the user
can not operate the cgroup. When the user needs to create one, move the
cgroup to its original location. Because this only needs to remove a node
from one kernfs tree and move it to another tree, it does not affect other
data of the cgroup and related subsystems, so this operation is very
efficient and fast, and there is no need to hold cgroup_mutex. In this
way, we get rid of the limitation of cgroup_mutex and reduce the time
consumption of the critical section, but the kernfs_rwsem is still
protecting the kernfs-related data structure, and the scheduling time
of sleep still exists.

In order to avoid the use of kernfs_rwsem, we introduced a pinned state for
the kernfs node. When the pinned state of this node is true, the lock that
protects the data of this node is changed from kernfs_rwsem to a lock that
can be set. In the scenario of a cgroup pool, the parent cgroup will have a
corresponding spinlock. When the pool is enabled, the kernfs nodes of all
cgroups under the parent cgroup are set to the pinned state. Create,
delete, and move these kernfs nodes are protected by the spinlock of the
parent cgroup, so data consistency will not be a problem.

After opening the pool, the user creates a cgroup will take the fast path
and obtain it from the cgroup pool. Deleting cgroups still take the slow
path. When resources in the pool are insufficient, a delayed task will be
triggered, and the pool will be replenished after a period of time. This
is done to avoid competition with the current creation of cgroups and thus
affect performance. When the resources in the pool are exhausted and not
replenished in time, the creation of a cgroup will take a slow path,
so users need to set an appropriate pool size and supplementary delay time.

What we did in the patches are:
	1.add pinned flags for kernfs nodes, so that they can get rid of
	kernfs_rwsem and choose to be protected by other locks.
	2.add pool_size interface which used to open cgroup pool and
	close cgroup pool.
	3.add extra kernfs tree which used to hide cgroup in pool.
	4.add spinlock to protect kernfs nodes of cgroup in pool


Yi Tao (2):
  add pinned flags for kernfs node
  support cgroup pool in v1

 fs/kernfs/dir.c             |  74 ++++++++++++++++-------
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 include/linux/kernfs.h      |  14 +++++
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 7 files changed, 345 insertions(+), 21 deletions(-)

-- 
1.8.3.1


WARNING: multiple messages have this Message-ID (diff)
From: Yi Tao <escape-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org,
	tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org
Subject: [RFC PATCH 0/2] support cgroup pool in v1
Date: Wed,  8 Sep 2021 20:15:11 +0800	[thread overview]
Message-ID: <cover.1631102579.git.escape@linux.alibaba.com> (raw)

In a scenario where containers are started with high concurrency, in
order to control the use of system resources by the container, it is
necessary to create a corresponding cgroup for each container and
attach the process. The kernel uses the cgroup_mutex global lock to
protect the consistency of the data, which results in a higher
long-tail delay for cgroup-related operations during concurrent startup.
For example, long-tail delay of creating cgroup under each subsystems
is 900ms when starting 400 containers, which becomes bottleneck of
performance. The delay is mainly composed of two parts, namely the
time of the critical section protected by cgroup_mutex and the
scheduling time of sleep. The scheduling time will increase with
the increase of the cpu overhead.

In order to solve this long-tail delay problem, we designed a cgroup
pool. The cgroup pool will create a certain number of cgroups in advance.
When a user creates a cgroup through the mkdir system call, a clean cgroup
can be quickly obtained from the pool. Cgroup pool draws on the idea of
cgroup rename. By creating pool and rename in advance, it reduces the
critical area of cgroup creation, and uses a spinlock different from
cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
competition with attaching processes on the other hand.

The core idea of implementing a cgroup pool is to create a hidden kernfs
tree. Cgroup is implemented based on the kernfs file system. The user
manipulates the cgroup through the kernfs file. Therefore, we can create
a cgroup in advance and place it in a hidden kernfs tree, so that the user
can not operate the cgroup. When the user needs to create one, move the
cgroup to its original location. Because this only needs to remove a node
from one kernfs tree and move it to another tree, it does not affect other
data of the cgroup and related subsystems, so this operation is very
efficient and fast, and there is no need to hold cgroup_mutex. In this
way, we get rid of the limitation of cgroup_mutex and reduce the time
consumption of the critical section, but the kernfs_rwsem is still
protecting the kernfs-related data structure, and the scheduling time
of sleep still exists.

In order to avoid the use of kernfs_rwsem, we introduced a pinned state for
the kernfs node. When the pinned state of this node is true, the lock that
protects the data of this node is changed from kernfs_rwsem to a lock that
can be set. In the scenario of a cgroup pool, the parent cgroup will have a
corresponding spinlock. When the pool is enabled, the kernfs nodes of all
cgroups under the parent cgroup are set to the pinned state. Create,
delete, and move these kernfs nodes are protected by the spinlock of the
parent cgroup, so data consistency will not be a problem.

After opening the pool, the user creates a cgroup will take the fast path
and obtain it from the cgroup pool. Deleting cgroups still take the slow
path. When resources in the pool are insufficient, a delayed task will be
triggered, and the pool will be replenished after a period of time. This
is done to avoid competition with the current creation of cgroups and thus
affect performance. When the resources in the pool are exhausted and not
replenished in time, the creation of a cgroup will take a slow path,
so users need to set an appropriate pool size and supplementary delay time.

What we did in the patches are:
	1.add pinned flags for kernfs nodes, so that they can get rid of
	kernfs_rwsem and choose to be protected by other locks.
	2.add pool_size interface which used to open cgroup pool and
	close cgroup pool.
	3.add extra kernfs tree which used to hide cgroup in pool.
	4.add spinlock to protect kernfs nodes of cgroup in pool


Yi Tao (2):
  add pinned flags for kernfs node
  support cgroup pool in v1

 fs/kernfs/dir.c             |  74 ++++++++++++++++-------
 include/linux/cgroup-defs.h |  16 +++++
 include/linux/cgroup.h      |   2 +
 include/linux/kernfs.h      |  14 +++++
 kernel/cgroup/cgroup-v1.c   | 139 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/cgroup.c      | 113 ++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c             |   8 +++
 7 files changed, 345 insertions(+), 21 deletions(-)

-- 
1.8.3.1


             reply	other threads:[~2021-09-08 12:15 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-08 12:15 Yi Tao [this message]
2021-09-08 12:15 ` [RFC PATCH 0/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15 ` [RFC PATCH 1/2] add pinned flags for kernfs node Yi Tao
2021-09-08 12:15   ` [RFC PATCH 2/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15     ` Yi Tao
2021-09-08 12:35     ` Greg KH
2021-09-08 12:35       ` Greg KH
     [not found]       ` <084930d2-057a-04a7-76d1-b2a7bd37deb0@linux.alibaba.com>
2021-09-09 13:27         ` Greg KH
2021-09-10  2:20           ` taoyi.ty
2021-09-10  2:15       ` taoyi.ty
2021-09-10  2:15         ` taoyi.ty
2021-09-10  6:01         ` Greg KH
2021-09-10  6:01           ` Greg KH
2021-09-08 15:30     ` kernel test robot
2021-09-08 16:52     ` kernel test robot
2021-09-08 17:39     ` kernel test robot
2021-09-08 17:39     ` [RFC PATCH] cgroup_pool_mutex can be static kernel test robot
2021-09-08 12:35   ` [RFC PATCH 1/2] add pinned flags for kernfs node Greg KH
2021-09-08 12:35     ` Greg KH
2021-09-10  2:14     ` taoyi.ty
2021-09-10  6:00       ` Greg KH
2021-09-10  6:00         ` Greg KH
2021-09-08 16:26   ` kernel test robot
2021-09-08 12:37 ` [RFC PATCH 0/2] support cgroup pool in v1 Greg KH
2021-09-10  2:11   ` taoyi.ty
2021-09-10  6:01     ` Greg KH
2021-09-10  6:01       ` Greg KH
2021-09-10 16:49     ` Tejun Heo
2021-09-10 16:49       ` Tejun Heo
2021-09-13 14:20       ` Christian Brauner
2021-09-13 14:20         ` Christian Brauner
2021-09-13 16:24         ` Tejun Heo
2021-09-13 16:24           ` Tejun Heo
2021-09-08 16:35 ` Tejun Heo
2021-09-10  2:12   ` taoyi.ty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1631102579.git.escape@linux.alibaba.com \
    --to=escape@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=shanpeic@linux.alibaba.com \
    --cc=tj@kernel.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.