From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com [209.85.215.197]) by kanga.kvack.org (Postfix) with ESMTP id 48D0A8E00AE for ; Fri, 4 Jan 2019 00:12:38 -0500 (EST) Received: by mail-pg1-f197.google.com with SMTP id r13so29976068pgb.7 for ; Thu, 03 Jan 2019 21:12:38 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id k196sor25747789pga.61.2019.01.03.21.12.37 for (Google Transport Security); Thu, 03 Jan 2019 21:12:37 -0800 (PST) From: Fam Zheng Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_71B2F602-8443-4CFB-A28C-E3CD2C057CD5" Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: Re: memory cgroup pagecache and inode problem Date: Fri, 4 Jan 2019 13:12:29 +0800 In-Reply-To: References: <15614FDC-198E-449B-BFAF-B00D6EF61155@bytedance.com> <97A4C2CA-97BA-46DB-964A-E44410BB1730@bytedance.com> Sender: owner-linux-mm@kvack.org List-ID: To: Yang Shi Cc: cgroups@vger.kernel.org, Linux MM , tj@kernel.org, Johannes Weiner , lizefan@huawei.com, Michal Hocko , Vladimir Davydov , duanxiongchun@bytedance.com, =?utf-8?B?5byg5rC46IKD?= , liuxiaozhou@bytedance.com --Apple-Mail=_71B2F602-8443-4CFB-A28C-E3CD2C057CD5 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Jan 4, 2019, at 13:00, Yang Shi wrote: >=20 > On Thu, Jan 3, 2019 at 8:45 PM Fam Zheng > wrote: >>=20 >> Fixing the mm list address. Sorry for the noise. >>=20 >> Fam >>=20 >>=20 >>> On Jan 4, 2019, at 12:43, Fam Zheng = wrote: >>>=20 >>> Hi, >>>=20 >>> In our server which frequently spawns containers, we find that if a = process used pagecache in memory cgroup, after the process exits and = memory cgroup is offlined, because the pagecache is still charged in = this memory cgroup, this memory cgroup will not be destroyed until the = pagecaches are dropped. This brings huge memory stress over time. We = find that over one hundred thounsand such offlined memory cgroup in = system hold too much memory (~100G). This memory can not be released = immediately even after all associated pagecahes are released, because = those memory cgroups are destroy asynchronously by a kworker. In some = cases this can cause oom, since the synchronous memory allocation = failed. >>>=20 >=20 > Does force_empty help out your usecase? You can write to > memory.force_empty to reclaim as much as possible memory before > rmdir'ing memcg. This would prevent from page cache accumulating. Hmm, this might be an option. FWIW we have been using drop_caches to = workaround. >=20 > BTW, this is cgroup v1 only, I'm working on a patch to bring this back > into v2 as discussed in https://lkml.org/lkml/2019/1/3/484 = . >=20 >>> We think a fix is to create a kworker that scans all pagecaches and = dentry caches etc. in the background, if a referenced memory cgroup is = offline, try to drop the cache or move it to the parent cgroup. This = kworker can wake up periodically, or upon memory cgroup offline event = (or both). >=20 > Reparenting has been deprecated for a long time. I don't think we want > to bring it back. Actually, css offline is handled by kworker now. I > proposed a patch to do force_empty in kworker, please see > https://lkml.org/lkml/2019/1/2/377 = . Could you elaborate a bit about why reparenting is not a good idea? >=20 >>>=20 >>> There is a similar problem in inode. After digging in ext4 code, we = find that when creating inode cache, SLAB_ACCOUNT is used. In this case, = inode will alloc in slab which belongs to the current memory cgroup. = After this memory cgroup goes offline, this inode may be held by a = dentry cache. If another process uses the same file. this inode will be = held by that process, preventing the previous memory cgroup from being = destroyed until this other process closes the file and drops the dentry = cache. >=20 > I'm not sure if you really need kmem charge. If not, you may try > cgroup.memory=3Dnokmem. A very good hint, we=E2=80=99ll investigate, thanks! Fam >=20 > Regards, > Yang >=20 >>>=20 >>> We still don't have a reasonable way to fix this. >>>=20 >>> Ideas? --Apple-Mail=_71B2F602-8443-4CFB-A28C-E3CD2C057CD5 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On Jan 4, 2019, at 13:00, Yang Shi <shy828301@gmail.com>= wrote:

On Thu, Jan = 3, 2019 at 8:45 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:

Fixing the mm list = address. Sorry for the noise.

Fam


On Jan 4, 2019, at 12:43, Fam Zheng <zhengfeiran@bytedance.com> wrote:

Hi,

In our server which = frequently spawns containers, we find that if a process used pagecache = in memory cgroup, after the process exits and memory cgroup is offlined, = because the pagecache is still charged in this memory cgroup, this = memory cgroup will not be destroyed until the pagecaches are dropped. = This brings huge memory stress over time. We find that over one hundred = thounsand such offlined memory cgroup in system hold too much memory = (~100G). This memory can not be released immediately even after all = associated pagecahes are released, because those memory cgroups are = destroy asynchronously by a kworker. In some cases this can cause oom, = since the synchronous memory allocation failed.


Does force_empty help out your usecase? You can write = to
memory.force_empty to reclaim as much as possible memory = before
rmdir'ing = memcg. This would prevent from page cache accumulating.

Hmm, this = might be an option. FWIW we have been using drop_caches to = workaround.


BTW, this is cgroup v1 only, I'm working on a patch to bring = this back
into v2 as = discussed in https://lkml.org/lkml/2019/1/3/484.

We think a fix is to = create a kworker that scans all pagecaches and dentry caches etc. in the = background, if a referenced memory cgroup is offline, try to drop the = cache or move it to the parent cgroup. This kworker can wake up = periodically, or upon memory cgroup offline event (or both).

Reparenting has been deprecated for a long time. I don't = think we want
to bring it = back. Actually, css offline is handled by kworker now. I
proposed a = patch to do force_empty in kworker, please see
https://lkml.org/lkml/2019/1/2/377.

Could = you elaborate a bit about why reparenting is not a good = idea?



There is a similar problem in inode. After = digging in ext4 code, we find that when creating inode cache, = SLAB_ACCOUNT is used. In this case, inode will alloc in slab which = belongs to the current memory cgroup. After this memory cgroup goes = offline, this inode may be held by a dentry cache. If another process = uses the same file. this inode will be held by that process, preventing = the previous memory cgroup from being destroyed until this other process = closes the file and drops the dentry cache.

I'm not sure if you really need kmem charge. If not, you may = try
cgroup.memory=3Dnokmem.

A very good hint, we=E2=80=99ll investigate, = thanks!

Fam


Regards,
Yang


We still don't have a reasonable way to fix = this.

Ideas?

= --Apple-Mail=_71B2F602-8443-4CFB-A28C-E3CD2C057CD5--