From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@suse.cz>, Li Zefan <lizefan@huawei.com>,
<linux-mm@kvack.org>, <cgroups@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <kernel-team@fb.com>
Subject: Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
Date: Tue, 21 Jun 2016 13:16:51 +0300 [thread overview]
Message-ID: <20160621101650.GD15970@esperanza> (raw)
In-Reply-To: <20160617162516.GD19084@cmpxchg.org>
On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
>
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
>
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
>
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
>
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
>
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
>
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
> [ $((x % 1000)) -eq 0 ] && echo $x
> mkdir /cgroup/foo
> echo $$ >/cgroup/foo/cgroup.procs
> echo trex >pages/$x
> echo $$ >/cgroup/cgroup.procs
> rmdir /cgroup/foo
> done
>
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
>
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
>
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.
With 65K cgroups it will take the reclaimer a substantial amount of time
to iterate over all of them, which might result in latency spikes.
Probably, to avoid that, we could move pages from a dead cgroup's lru to
its parent's one on offline while still leaving dead cgroups pinned,
like we do in case of list_lru entries.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
One nit below.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..dc92b2df2585 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
> { }, /* terminate */
> };
>
> +/*
> + * Private memory cgroup IDR
> + *
> + * Swap-out records and page cache shadow entries need to store memcg
> + * references in constrained space, so we maintain an ID space that is
> + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
> + * memory-controlled cgroups to 64k.
> + *
> + * However, there usually are many references to the oflline CSS after
> + * the cgroup has been destroyed, such as page cache or reclaimable
> + * slab objects, that don't need to hang on to the ID. We want to keep
> + * those dead CSS from occupying IDs, or we might quickly exhaust the
> + * relatively small ID space and prevent the creation of new cgroups
> + * even when there are much fewer than 64k cgroups - possibly none.
> + *
> + * Maintain a private 16-bit ID space for memcg, and allow the ID to
> + * be freed and recycled when it's no longer needed, which is usually
> + * when the CSS is offlined.
> + *
> + * The only exception to that are records of swapped out tmpfs/shmem
> + * pages that need to be attributed to live ancestors on swapin. But
> + * those references are manageable from userspace.
> + */
> +
> +static struct idr mem_cgroup_idr;
static DEFINE_IDR(mem_cgroup_idr);
next prev parent reply other threads:[~2016-06-21 10:34 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-16 3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-16 20:06 ` Tejun Heo
2016-06-17 16:23 ` Johannes Weiner
2016-06-17 16:23 ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
2016-06-17 16:24 ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
2016-06-17 18:17 ` Tejun Heo
2016-06-17 16:25 ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-17 18:18 ` Tejun Heo
2016-06-20 6:14 ` Nikolay Borisov
2016-06-21 10:16 ` Vladimir Davydov [this message]
2016-06-21 15:46 ` Johannes Weiner
2016-06-17 9:06 ` [PATCH] " Vladimir Davydov
2016-06-17 16:40 ` Johannes Weiner
2016-07-14 15:37 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160621101650.GD15970@esperanza \
--to=vdavydov@virtuozzo.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan@huawei.com \
--cc=mhocko@suse.cz \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).