All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Tejun Heo <tj@kernel.org>
Cc: "taoyi.ty" <escape@linux.alibaba.com>,
	Greg KH <gregkh@linuxfoundation.org>,
	lizefan.x@bytedance.com, hannes@cmpxchg.org, mcgrof@kernel.org,
	keescook@chromium.org, yzaikin@google.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, shanpeic@linux.alibaba.com
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1
Date: Mon, 13 Sep 2021 16:20:59 +0200	[thread overview]
Message-ID: <20210913142059.qbypd4vfq6wdzqfw@wittgenstein> (raw)
In-Reply-To: <YTuMl+cC6FyA/Hsv@slm.duckdns.org>

On Fri, Sep 10, 2021 at 06:49:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> > The scenario is the function computing of the public
> > cloud. Each instance of function computing will be
> > allocated about 0.1 core cpu and 100M memory. On
> > a high-end server, for example, 104 cores and 384G,
> > it is normal to create hundreds of containers at the
> > same time if burst of requests comes.
> 
> This type of use case isn't something cgroup is good at, at least not
> currently. The problem is that trying to scale management operations like
> creating and destroying cgroups has implications on how each controller is
> implemented - we want the hot paths which get used while cgroups are running
> actively to be as efficient and scalable as possible even if that requires a
> lot of extra preparation and lazy cleanup operations. We don't really want
> to push for cgroup creation / destruction efficiency at the cost of hot path
> overhead.
> 
> This has implications for use cases like you describe. Even if the kernel
> pre-prepare cgroups to low latency for cgroup creation, it means that the
> system would be doing a *lot* of managerial extra work creating and
> destroying cgroups constantly for not much actual work.
> 
> Usually, the right solution for this sort of situations is pooling cgroups
> from the userspace which usually has a lot better insight into which cgroups
> can be recycled and can also adjust the cgroup hierarchy to better fit the
> use case (e.g. some rapid-cycling cgroups can benefit from higher-level
> resource configurations).

I had the same reaction and I wanted to do something like this before,
i.e. maintain a pool of pre-allocated cgroups in userspace. But there
were some problems.

Afaict, there is currently now way to prevent the deletion of empty
cgroups, especially newly created ones. So for example, if I have a
cgroup manager that prunes the cgroup tree whenever they detect empty
cgroups they can delete cgroups that were pre-allocated. This is
something we have run into before.

A related problem is a crashed or killed container manager 
(segfault, sigkill, etc.). It might not have had the chance to cleanup
cgroups it allocated for the container. If the container manager is
restarted it can't reuse the existing cgroup it found because it has no
way of guaranteeing whether in between the time it crashed and got
restarted another program has just created a cgroup with the same name.
We usually solve this by just creating another cgroup with an index
appended until we we find an unallocated one setting an arbitrary cut
off point until we require manual intervention by the user (e.g. 1000).

Right now iirc, one can rmdir() an empty cgroup while someone still
holds a file descriptor open for it. This can lead to situation where a
cgroup got created but before moving into the cgroup (via clone3() or
write()) someone else has deleted it. What would already be helpful is
if one had a way to prevent the deletion of cgroups when someone still
has an open reference to it. This would allow a pool of cgroups to be
created that can't simply be deleted.

Christian

WARNING: multiple messages have this Message-ID (diff)
From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: "taoyi.ty"
	<escape-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
	Greg KH
	<gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org,
	yzaikin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	shanpeic-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1
Date: Mon, 13 Sep 2021 16:20:59 +0200	[thread overview]
Message-ID: <20210913142059.qbypd4vfq6wdzqfw@wittgenstein> (raw)
In-Reply-To: <YTuMl+cC6FyA/Hsv-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>

On Fri, Sep 10, 2021 at 06:49:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> > The scenario is the function computing of the public
> > cloud. Each instance of function computing will be
> > allocated about 0.1 core cpu and 100M memory. On
> > a high-end server, for example, 104 cores and 384G,
> > it is normal to create hundreds of containers at the
> > same time if burst of requests comes.
> 
> This type of use case isn't something cgroup is good at, at least not
> currently. The problem is that trying to scale management operations like
> creating and destroying cgroups has implications on how each controller is
> implemented - we want the hot paths which get used while cgroups are running
> actively to be as efficient and scalable as possible even if that requires a
> lot of extra preparation and lazy cleanup operations. We don't really want
> to push for cgroup creation / destruction efficiency at the cost of hot path
> overhead.
> 
> This has implications for use cases like you describe. Even if the kernel
> pre-prepare cgroups to low latency for cgroup creation, it means that the
> system would be doing a *lot* of managerial extra work creating and
> destroying cgroups constantly for not much actual work.
> 
> Usually, the right solution for this sort of situations is pooling cgroups
> from the userspace which usually has a lot better insight into which cgroups
> can be recycled and can also adjust the cgroup hierarchy to better fit the
> use case (e.g. some rapid-cycling cgroups can benefit from higher-level
> resource configurations).

I had the same reaction and I wanted to do something like this before,
i.e. maintain a pool of pre-allocated cgroups in userspace. But there
were some problems.

Afaict, there is currently now way to prevent the deletion of empty
cgroups, especially newly created ones. So for example, if I have a
cgroup manager that prunes the cgroup tree whenever they detect empty
cgroups they can delete cgroups that were pre-allocated. This is
something we have run into before.

A related problem is a crashed or killed container manager 
(segfault, sigkill, etc.). It might not have had the chance to cleanup
cgroups it allocated for the container. If the container manager is
restarted it can't reuse the existing cgroup it found because it has no
way of guaranteeing whether in between the time it crashed and got
restarted another program has just created a cgroup with the same name.
We usually solve this by just creating another cgroup with an index
appended until we we find an unallocated one setting an arbitrary cut
off point until we require manual intervention by the user (e.g. 1000).

Right now iirc, one can rmdir() an empty cgroup while someone still
holds a file descriptor open for it. This can lead to situation where a
cgroup got created but before moving into the cgroup (via clone3() or
write()) someone else has deleted it. What would already be helpful is
if one had a way to prevent the deletion of cgroups when someone still
has an open reference to it. This would allow a pool of cgroups to be
created that can't simply be deleted.

Christian

  reply	other threads:[~2021-09-13 14:51 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-08 12:15 [RFC PATCH 0/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15 ` Yi Tao
2021-09-08 12:15 ` [RFC PATCH 1/2] add pinned flags for kernfs node Yi Tao
2021-09-08 12:15   ` [RFC PATCH 2/2] support cgroup pool in v1 Yi Tao
2021-09-08 12:15     ` Yi Tao
2021-09-08 12:35     ` Greg KH
2021-09-08 12:35       ` Greg KH
     [not found]       ` <084930d2-057a-04a7-76d1-b2a7bd37deb0@linux.alibaba.com>
2021-09-09 13:27         ` Greg KH
2021-09-10  2:20           ` taoyi.ty
2021-09-10  2:15       ` taoyi.ty
2021-09-10  2:15         ` taoyi.ty
2021-09-10  6:01         ` Greg KH
2021-09-10  6:01           ` Greg KH
2021-09-08 15:30     ` kernel test robot
2021-09-08 16:52     ` kernel test robot
2021-09-08 17:39     ` kernel test robot
2021-09-08 17:39     ` [RFC PATCH] cgroup_pool_mutex can be static kernel test robot
2021-09-08 12:35   ` [RFC PATCH 1/2] add pinned flags for kernfs node Greg KH
2021-09-08 12:35     ` Greg KH
2021-09-10  2:14     ` taoyi.ty
2021-09-10  6:00       ` Greg KH
2021-09-10  6:00         ` Greg KH
2021-09-08 16:26   ` kernel test robot
2021-09-08 12:37 ` [RFC PATCH 0/2] support cgroup pool in v1 Greg KH
2021-09-10  2:11   ` taoyi.ty
2021-09-10  6:01     ` Greg KH
2021-09-10  6:01       ` Greg KH
2021-09-10 16:49     ` Tejun Heo
2021-09-10 16:49       ` Tejun Heo
2021-09-13 14:20       ` Christian Brauner [this message]
2021-09-13 14:20         ` Christian Brauner
2021-09-13 16:24         ` Tejun Heo
2021-09-13 16:24           ` Tejun Heo
2021-09-08 16:35 ` Tejun Heo
2021-09-10  2:12   ` taoyi.ty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210913142059.qbypd4vfq6wdzqfw@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=cgroups@vger.kernel.org \
    --cc=escape@linux.alibaba.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=shanpeic@linux.alibaba.com \
    --cc=tj@kernel.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.