From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8AA9BC433FE for ; Wed, 8 Sep 2021 12:15:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6BABD61153 for ; Wed, 8 Sep 2021 12:15:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351684AbhIHMQo (ORCPT ); Wed, 8 Sep 2021 08:16:44 -0400 Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54]:54381 "EHLO out30-54.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231372AbhIHMQn (ORCPT ); Wed, 8 Sep 2021 08:16:43 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=escape@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0Ungmnr5_1631103332; Received: from localhost(mailfrom:escape@linux.alibaba.com fp:SMTPD_---0Ungmnr5_1631103332) by smtp.aliyun-inc.com(127.0.0.1); Wed, 08 Sep 2021 20:15:32 +0800 From: Yi Tao To: gregkh@linuxfoundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-fsdevel@vger.kernel.org, shanpeic@linux.alibaba.com Subject: [RFC PATCH 0/2] support cgroup pool in v1 Date: Wed, 8 Sep 2021 20:15:11 +0800 Message-Id: X-Mailer: git-send-email 1.8.3.1 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In a scenario where containers are started with high concurrency, in order to control the use of system resources by the container, it is necessary to create a corresponding cgroup for each container and attach the process. The kernel uses the cgroup_mutex global lock to protect the consistency of the data, which results in a higher long-tail delay for cgroup-related operations during concurrent startup. For example, long-tail delay of creating cgroup under each subsystems is 900ms when starting 400 containers, which becomes bottleneck of performance. The delay is mainly composed of two parts, namely the time of the critical section protected by cgroup_mutex and the scheduling time of sleep. The scheduling time will increase with the increase of the cpu overhead. In order to solve this long-tail delay problem, we designed a cgroup pool. The cgroup pool will create a certain number of cgroups in advance. When a user creates a cgroup through the mkdir system call, a clean cgroup can be quickly obtained from the pool. Cgroup pool draws on the idea of cgroup rename. By creating pool and rename in advance, it reduces the critical area of cgroup creation, and uses a spinlock different from cgroup_mutex, which reduces scheduling overhead on the one hand, and eases competition with attaching processes on the other hand. The core idea of implementing a cgroup pool is to create a hidden kernfs tree. Cgroup is implemented based on the kernfs file system. The user manipulates the cgroup through the kernfs file. Therefore, we can create a cgroup in advance and place it in a hidden kernfs tree, so that the user can not operate the cgroup. When the user needs to create one, move the cgroup to its original location. Because this only needs to remove a node from one kernfs tree and move it to another tree, it does not affect other data of the cgroup and related subsystems, so this operation is very efficient and fast, and there is no need to hold cgroup_mutex. In this way, we get rid of the limitation of cgroup_mutex and reduce the time consumption of the critical section, but the kernfs_rwsem is still protecting the kernfs-related data structure, and the scheduling time of sleep still exists. In order to avoid the use of kernfs_rwsem, we introduced a pinned state for the kernfs node. When the pinned state of this node is true, the lock that protects the data of this node is changed from kernfs_rwsem to a lock that can be set. In the scenario of a cgroup pool, the parent cgroup will have a corresponding spinlock. When the pool is enabled, the kernfs nodes of all cgroups under the parent cgroup are set to the pinned state. Create, delete, and move these kernfs nodes are protected by the spinlock of the parent cgroup, so data consistency will not be a problem. After opening the pool, the user creates a cgroup will take the fast path and obtain it from the cgroup pool. Deleting cgroups still take the slow path. When resources in the pool are insufficient, a delayed task will be triggered, and the pool will be replenished after a period of time. This is done to avoid competition with the current creation of cgroups and thus affect performance. When the resources in the pool are exhausted and not replenished in time, the creation of a cgroup will take a slow path, so users need to set an appropriate pool size and supplementary delay time. What we did in the patches are: 1.add pinned flags for kernfs nodes, so that they can get rid of kernfs_rwsem and choose to be protected by other locks. 2.add pool_size interface which used to open cgroup pool and close cgroup pool. 3.add extra kernfs tree which used to hide cgroup in pool. 4.add spinlock to protect kernfs nodes of cgroup in pool Yi Tao (2): add pinned flags for kernfs node support cgroup pool in v1 fs/kernfs/dir.c | 74 ++++++++++++++++------- include/linux/cgroup-defs.h | 16 +++++ include/linux/cgroup.h | 2 + include/linux/kernfs.h | 14 +++++ kernel/cgroup/cgroup-v1.c | 139 ++++++++++++++++++++++++++++++++++++++++++++ kernel/cgroup/cgroup.c | 113 ++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 8 +++ 7 files changed, 345 insertions(+), 21 deletions(-) -- 1.8.3.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yi Tao Subject: [RFC PATCH 0/2] support cgroup pool in v1 Date: Wed, 8 Sep 2021 20:15:11 +0800 Message-ID: Return-path: List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org, yzaikin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org In a scenario where containers are started with high concurrency, in order to control the use of system resources by the container, it is necessary to create a corresponding cgroup for each container and attach the process. The kernel uses the cgroup_mutex global lock to protect the consistency of the data, which results in a higher long-tail delay for cgroup-related operations during concurrent startup. For example, long-tail delay of creating cgroup under each subsystems is 900ms when starting 400 containers, which becomes bottleneck of performance. The delay is mainly composed of two parts, namely the time of the critical section protected by cgroup_mutex and the scheduling time of sleep. The scheduling time will increase with the increase of the cpu overhead. In order to solve this long-tail delay problem, we designed a cgroup pool. The cgroup pool will create a certain number of cgroups in advance. When a user creates a cgroup through the mkdir system call, a clean cgroup can be quickly obtained from the pool. Cgroup pool draws on the idea of cgroup rename. By creating pool and rename in advance, it reduces the critical area of cgroup creation, and uses a spinlock different from cgroup_mutex, which reduces scheduling overhead on the one hand, and eases competition with attaching processes on the other hand. The core idea of implementing a cgroup pool is to create a hidden kernfs tree. Cgroup is implemented based on the kernfs file system. The user manipulates the cgroup through the kernfs file. Therefore, we can create a cgroup in advance and place it in a hidden kernfs tree, so that the user can not operate the cgroup. When the user needs to create one, move the cgroup to its original location. Because this only needs to remove a node from one kernfs tree and move it to another tree, it does not affect other data of the cgroup and related subsystems, so this operation is very efficient and fast, and there is no need to hold cgroup_mutex. In this way, we get rid of the limitation of cgroup_mutex and reduce the time consumption of the critical section, but the kernfs_rwsem is still protecting the kernfs-related data structure, and the scheduling time of sleep still exists. In order to avoid the use of kernfs_rwsem, we introduced a pinned state for the kernfs node. When the pinned state of this node is true, the lock that protects the data of this node is changed from kernfs_rwsem to a lock that can be set. In the scenario of a cgroup pool, the parent cgroup will have a corresponding spinlock. When the pool is enabled, the kernfs nodes of all cgroups under the parent cgroup are set to the pinned state. Create, delete, and move these kernfs nodes are protected by the spinlock of the parent cgroup, so data consistency will not be a problem. After opening the pool, the user creates a cgroup will take the fast path and obtain it from the cgroup pool. Deleting cgroups still take the slow path. When resources in the pool are insufficient, a delayed task will be triggered, and the pool will be replenished after a period of time. This is done to avoid competition with the current creation of cgroups and thus affect performance. When the resources in the pool are exhausted and not replenished in time, the creation of a cgroup will take a slow path, so users need to set an appropriate pool size and supplementary delay time. What we did in the patches are: 1.add pinned flags for kernfs nodes, so that they can get rid of kernfs_rwsem and choose to be protected by other locks. 2.add pool_size interface which used to open cgroup pool and close cgroup pool. 3.add extra kernfs tree which used to hide cgroup in pool. 4.add spinlock to protect kernfs nodes of cgroup in pool Yi Tao (2): add pinned flags for kernfs node support cgroup pool in v1 fs/kernfs/dir.c | 74 ++++++++++++++++------- include/linux/cgroup-defs.h | 16 +++++ include/linux/cgroup.h | 2 + include/linux/kernfs.h | 14 +++++ kernel/cgroup/cgroup-v1.c | 139 ++++++++++++++++++++++++++++++++++++++++++++ kernel/cgroup/cgroup.c | 113 ++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 8 +++ 7 files changed, 345 insertions(+), 21 deletions(-) -- 1.8.3.1