Re: memory cgroup pagecache and inode problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: memory cgroup pagecache and inode problem
       [not found] <15614FDC-198E-449B-BFAF-B00D6EF61155@bytedance.com>
@ 2019-01-04  4:44 ` Fam Zheng
  2019-01-04  5:00   ` Yang Shi
  2019-01-04  9:04 ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Fam Zheng @ 2019-01-04  4:44 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: tj, hannes, lizefan, mhocko, vdavydov.dev, duanxiongchun,
	张永肃

Fixing the mm list address. Sorry for the noise.

Fam


> On Jan 4, 2019, at 12:43, Fam Zheng <zhengfeiran@bytedance.com> wrote:
> 
> Hi,
> 
> In our server which frequently spawns containers, we find that if a process used pagecache in memory cgroup, after the process exits and memory cgroup is offlined, because the pagecache is still charged in this memory cgroup, this memory cgroup will not be destroyed until the pagecaches are dropped. This brings huge memory stress over time. We find that over one hundred thounsand such offlined memory cgroup in system hold too much memory (~100G). This memory can not be released immediately even after all associated pagecahes are released, because those memory cgroups are destroy asynchronously by a kworker. In some cases this can cause oom, since the synchronous memory allocation failed.
> 
> We think a fix is to create a kworker that scans all pagecaches and dentry caches etc. in the background, if a referenced memory cgroup is offline, try to drop the cache or move it to the parent cgroup. This kworker can wake up periodically, or upon memory cgroup offline event (or both).
> 
> There is a similar problem in inode. After digging in ext4 code, we find that when creating inode cache, SLAB_ACCOUNT is used. In this case, inode will alloc in slab which belongs to the current memory cgroup. After this memory cgroup goes offline, this inode may be held by a dentry cache. If another process uses the same file. this inode will be held by that process, preventing the previous memory cgroup from being destroyed until this other process closes the file and drops the dentry cache.
> 
> We still don't have a reasonable way to fix this.
> 
> Ideas?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04  4:44 ` memory cgroup pagecache and inode problem Fam Zheng
@ 2019-01-04  5:00   ` Yang Shi
  2019-01-04  5:12     ` Fam Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2019-01-04  5:00 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃

On Thu, Jan 3, 2019 at 8:45 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
> Fixing the mm list address. Sorry for the noise.
>
> Fam
>
>
> > On Jan 4, 2019, at 12:43, Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >
> > Hi,
> >
> > In our server which frequently spawns containers, we find that if a process used pagecache in memory cgroup, after the process exits and memory cgroup is offlined, because the pagecache is still charged in this memory cgroup, this memory cgroup will not be destroyed until the pagecaches are dropped. This brings huge memory stress over time. We find that over one hundred thounsand such offlined memory cgroup in system hold too much memory (~100G). This memory can not be released immediately even after all associated pagecahes are released, because those memory cgroups are destroy asynchronously by a kworker. In some cases this can cause oom, since the synchronous memory allocation failed.
> >

Does force_empty help out your usecase? You can write to
memory.force_empty to reclaim as much as possible memory before
rmdir'ing memcg. This would prevent from page cache accumulating.

BTW, this is cgroup v1 only, I'm working on a patch to bring this back
into v2 as discussed in https://lkml.org/lkml/2019/1/3/484.

> > We think a fix is to create a kworker that scans all pagecaches and dentry caches etc. in the background, if a referenced memory cgroup is offline, try to drop the cache or move it to the parent cgroup. This kworker can wake up periodically, or upon memory cgroup offline event (or both).

Reparenting has been deprecated for a long time. I don't think we want
to bring it back. Actually, css offline is handled by kworker now. I
proposed a patch to do force_empty in kworker, please see
https://lkml.org/lkml/2019/1/2/377.

> >
> > There is a similar problem in inode. After digging in ext4 code, we find that when creating inode cache, SLAB_ACCOUNT is used. In this case, inode will alloc in slab which belongs to the current memory cgroup. After this memory cgroup goes offline, this inode may be held by a dentry cache. If another process uses the same file. this inode will be held by that process, preventing the previous memory cgroup from being destroyed until this other process closes the file and drops the dentry cache.

I'm not sure if you really need kmem charge. If not, you may try
cgroup.memory=nokmem.

Regards,
Yang

> >
> > We still don't have a reasonable way to fix this.
> >
> > Ideas?
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04  5:00   ` Yang Shi
@ 2019-01-04  5:12     ` Fam Zheng
  2019-01-04 19:36       ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Fam Zheng @ 2019-01-04  5:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]



> On Jan 4, 2019, at 13:00, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Thu, Jan 3, 2019 at 8:45 PM Fam Zheng <zhengfeiran@bytedance.com <mailto:zhengfeiran@bytedance.com>> wrote:
>> 
>> Fixing the mm list address. Sorry for the noise.
>> 
>> Fam
>> 
>> 
>>> On Jan 4, 2019, at 12:43, Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>> 
>>> Hi,
>>> 
>>> In our server which frequently spawns containers, we find that if a process used pagecache in memory cgroup, after the process exits and memory cgroup is offlined, because the pagecache is still charged in this memory cgroup, this memory cgroup will not be destroyed until the pagecaches are dropped. This brings huge memory stress over time. We find that over one hundred thounsand such offlined memory cgroup in system hold too much memory (~100G). This memory can not be released immediately even after all associated pagecahes are released, because those memory cgroups are destroy asynchronously by a kworker. In some cases this can cause oom, since the synchronous memory allocation failed.
>>> 
> 
> Does force_empty help out your usecase? You can write to
> memory.force_empty to reclaim as much as possible memory before
> rmdir'ing memcg. This would prevent from page cache accumulating.

Hmm, this might be an option. FWIW we have been using drop_caches to workaround.

> 
> BTW, this is cgroup v1 only, I'm working on a patch to bring this back
> into v2 as discussed in https://lkml.org/lkml/2019/1/3/484 <https://lkml.org/lkml/2019/1/3/484>.
> 
>>> We think a fix is to create a kworker that scans all pagecaches and dentry caches etc. in the background, if a referenced memory cgroup is offline, try to drop the cache or move it to the parent cgroup. This kworker can wake up periodically, or upon memory cgroup offline event (or both).
> 
> Reparenting has been deprecated for a long time. I don't think we want
> to bring it back. Actually, css offline is handled by kworker now. I
> proposed a patch to do force_empty in kworker, please see
> https://lkml.org/lkml/2019/1/2/377 <https://lkml.org/lkml/2019/1/2/377>.

Could you elaborate a bit about why reparenting is not a good idea?

> 
>>> 
>>> There is a similar problem in inode. After digging in ext4 code, we find that when creating inode cache, SLAB_ACCOUNT is used. In this case, inode will alloc in slab which belongs to the current memory cgroup. After this memory cgroup goes offline, this inode may be held by a dentry cache. If another process uses the same file. this inode will be held by that process, preventing the previous memory cgroup from being destroyed until this other process closes the file and drops the dentry cache.
> 
> I'm not sure if you really need kmem charge. If not, you may try
> cgroup.memory=nokmem.

A very good hint, we’ll investigate, thanks!

Fam

> 
> Regards,
> Yang
> 
>>> 
>>> We still don't have a reasonable way to fix this.
>>> 
>>> Ideas?


[-- Attachment #2: Type: text/html, Size: 20479 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
       [not found] <15614FDC-198E-449B-BFAF-B00D6EF61155@bytedance.com>
  2019-01-04  4:44 ` memory cgroup pagecache and inode problem Fam Zheng
@ 2019-01-04  9:04 ` Michal Hocko
  2019-01-04 10:02   ` Fam Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-04  9:04 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, linux-mm, tj, hannes, lizefan, vdavydov.dev,
	duanxiongchun, 张永肃

On Fri 04-01-19 12:43:40, Fam Zheng wrote:
> Hi,
> 
> In our server which frequently spawns containers, we find that if a
> process used pagecache in memory cgroup, after the process exits and
> memory cgroup is offlined, because the pagecache is still charged in
> this memory cgroup, this memory cgroup will not be destroyed until the
> pagecaches are dropped. This brings huge memory stress over time. We
> find that over one hundred thounsand such offlined memory cgroup in
> system hold too much memory (~100G). This memory can not be released
> immediately even after all associated pagecahes are released, because
> those memory cgroups are destroy asynchronously by a kworker. In some
> cases this can cause oom, since the synchronous memory allocation
> failed.

You are right that an offline memcg keeps memory behind and expects
kswapd or the direct reclaim to prune that memory on demand. Do you have
any examples when this would cause extreme memory stress though? For
example a high direct reclaim activity that would be result of these
offline memcgs? You are mentioning OOM which is even more unexpected.
I haven't seen such a disruptive behavior.

> We think a fix is to create a kworker that scans all pagecaches and
> dentry caches etc. in the background, if a referenced memory cgroup is
> offline, try to drop the cache or move it to the parent cgroup. This
> kworker can wake up periodically, or upon memory cgroup offline event
> (or both).

We do that from the kswapd context already. I do not think we need
another kworker.

Another option might be to enforce the reclaim on the offline path.
We are discussing a similar issue with Yang Shi
http://lkml.kernel.org/r/1546459533-36247-1-git-send-email-yang.shi@linux.alibaba.com

> There is a similar problem in inode. After digging in ext4 code, we
> find that when creating inode cache, SLAB_ACCOUNT is used. In this
> case, inode will alloc in slab which belongs to the current memory
> cgroup. After this memory cgroup goes offline, this inode may be held
> by a dentry cache. If another process uses the same file. this inode
> will be held by that process, preventing the previous memory cgroup
> from being destroyed until this other process closes the file and
> drops the dentry cache.

This is a natural side effect of shared memory, I am afraid. Isolated
memory cgroups should limit any shared resources to bare minimum. You
will get "who touches first gets charged" behavior otherwise and that is
not really deterministic.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04  9:04 ` Michal Hocko
@ 2019-01-04 10:02   ` Fam Zheng
  2019-01-04 10:12     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Fam Zheng @ 2019-01-04 10:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃



> On Jan 4, 2019, at 17:04, Michal Hocko <mhocko@kernel.org> wrote:
> 
> This is a natural side effect of shared memory, I am afraid. Isolated
> memory cgroups should limit any shared resources to bare minimum. You
> will get "who touches first gets charged" behavior otherwise and that is
> not really deterministic.

I don’t quite understand your comment. I think the current behavior for the ext4_inode_cachep slab family is just “who touches first gets charged”, and later users of the same file from a different mem cgroup can benefit from the cache, keep it from being released, but doesn’t get charged.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04 10:02   ` Fam Zheng
@ 2019-01-04 10:12     ` Michal Hocko
  2019-01-04 10:35       ` Fam Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-04 10:12 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃

On Fri 04-01-19 18:02:19, Fam Zheng wrote:
> 
> 
> > On Jan 4, 2019, at 17:04, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > This is a natural side effect of shared memory, I am afraid. Isolated
> > memory cgroups should limit any shared resources to bare minimum. You
> > will get "who touches first gets charged" behavior otherwise and that is
> > not really deterministic.
> 
> I don’t quite understand your comment. I think the current behavior
> for the ext4_inode_cachep slab family is just “who touches first
> gets charged”, and later users of the same file from a different mem
> cgroup can benefit from the cache, keep it from being released, but
> doesn’t get charged.

Yes, this is exactly what I've said. And that leads to non-deterministic
behavior because users from other memcgs are keeping charges alive and
the isolation really doesn't work properly. Think of it as using memory
on behalf of other party that is supposed to be isolated from you.

Sure this can work reasonably well if the sharing is not really
predominated.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04 10:12     ` Michal Hocko
@ 2019-01-04 10:35       ` Fam Zheng
  0 siblings, 0 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-04 10:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃



> On Jan 4, 2019, at 18:12, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Fri 04-01-19 18:02:19, Fam Zheng wrote:
>> 
>> 
>>> On Jan 4, 2019, at 17:04, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> This is a natural side effect of shared memory, I am afraid. Isolated
>>> memory cgroups should limit any shared resources to bare minimum. You
>>> will get "who touches first gets charged" behavior otherwise and that is
>>> not really deterministic.
>> 
>> I don’t quite understand your comment. I think the current behavior
>> for the ext4_inode_cachep slab family is just “who touches first
>> gets charged”, and later users of the same file from a different mem
>> cgroup can benefit from the cache, keep it from being released, but
>> doesn’t get charged.
> 
> Yes, this is exactly what I've said. And that leads to non-deterministic
> behavior because users from other memcgs are keeping charges alive and
> the isolation really doesn't work properly. Think of it as using memory
> on behalf of other party that is supposed to be isolated from you.
> 
> Sure this can work reasonably well if the sharing is not really
> predominated.

OK, I see what you mean. The reality is that the applications want to share files (e.g. docker run -v ...) , and IMO charging accuracy is not the trouble here. The problem is that there are memory usages which are not strictly necessary once a mem cgroup is deleted, such as the biggish struct mem_cgroup and the shadow slabs from which we no longer alloc new objects.

Fam

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04  5:12     ` Fam Zheng
@ 2019-01-04 19:36       ` Yang Shi
  2019-01-07  5:10         ` Fam Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Yang Shi @ 2019-01-04 19:36 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Thu, Jan 3, 2019 at 9:12 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
>
> On Jan 4, 2019, at 13:00, Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jan 3, 2019 at 8:45 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
> Fixing the mm list address. Sorry for the noise.
>
> Fam
>
>
> On Jan 4, 2019, at 12:43, Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
> Hi,
>
> In our server which frequently spawns containers, we find that if a process used pagecache in memory cgroup, after the process exits and memory cgroup is offlined, because the pagecache is still charged in this memory cgroup, this memory cgroup will not be destroyed until the pagecaches are dropped. This brings huge memory stress over time. We find that over one hundred thounsand such offlined memory cgroup in system hold too much memory (~100G). This memory can not be released immediately even after all associated pagecahes are released, because those memory cgroups are destroy asynchronously by a kworker. In some cases this can cause oom, since the synchronous memory allocation failed.
>
>
> Does force_empty help out your usecase? You can write to
> memory.force_empty to reclaim as much as possible memory before
> rmdir'ing memcg. This would prevent from page cache accumulating.
>
>
> Hmm, this might be an option. FWIW we have been using drop_caches to workaround.

drop_caches would drop all page caches globally. You may not want to
drop the page caches used by other memcgs.

>
>
> BTW, this is cgroup v1 only, I'm working on a patch to bring this back
> into v2 as discussed in https://lkml.org/lkml/2019/1/3/484.
>
> We think a fix is to create a kworker that scans all pagecaches and dentry caches etc. in the background, if a referenced memory cgroup is offline, try to drop the cache or move it to the parent cgroup. This kworker can wake up periodically, or upon memory cgroup offline event (or both).
>
>
> Reparenting has been deprecated for a long time. I don't think we want
> to bring it back. Actually, css offline is handled by kworker now. I
> proposed a patch to do force_empty in kworker, please see
> https://lkml.org/lkml/2019/1/2/377.
>
>
> Could you elaborate a bit about why reparenting is not a good idea?

AFAIK, reparenting may cause some tricky race condition. Since we can
iterate offline memcgs now, so the memory charged to offline memcg
could get reclaimed when memory pressure happens.

Johannes and Michal would know more about the background than me.

Yang

>
>
>
> There is a similar problem in inode. After digging in ext4 code, we find that when creating inode cache, SLAB_ACCOUNT is used. In this case, inode will alloc in slab which belongs to the current memory cgroup. After this memory cgroup goes offline, this inode may be held by a dentry cache. If another process uses the same file. this inode will be held by that process, preventing the previous memory cgroup from being destroyed until this other process closes the file and drops the dentry cache.
>
>
> I'm not sure if you really need kmem charge. If not, you may try
> cgroup.memory=nokmem.
>
>
> A very good hint, we’ll investigate, thanks!
>
> Fam
>
>
> Regards,
> Yang
>
>
> We still don't have a reasonable way to fix this.
>
> Ideas?
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-04 19:36       ` Yang Shi
@ 2019-01-07  5:10         ` Fam Zheng
  2019-01-07  8:53           ` Michal Hocko
  2019-01-10  5:36           ` Yang Shi
  0 siblings, 2 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-07  5:10 UTC (permalink / raw)
  To: Yang Shi
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Michal Hocko, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou

> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> 
> 
> drop_caches would drop all page caches globally. You may not want to
> drop the page caches used by other memcgs.

We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.

We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.

This is a bit unexpected.

Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?

Fam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-07  5:10         ` Fam Zheng
@ 2019-01-07  8:53           ` Michal Hocko
  2019-01-07  9:01             ` Fam Zheng
  2019-01-10  5:36           ` Yang Shi
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-07  8:53 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Yang Shi, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Mon 07-01-19 13:10:17, Fam Zheng wrote:
> 
> 
> > On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> > 
> > 
> > drop_caches would drop all page caches globally. You may not want to
> > drop the page caches used by other memcgs.
> 
> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
> 
> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
> 
> This is a bit unexpected.
> 
> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?

I would suspect that not all slab pages holding dentries and inodes got
reclaimed during the slab shrinking inoked by the direct reclaimed
triggered by force emptying.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-07  8:53           ` Michal Hocko
@ 2019-01-07  9:01             ` Fam Zheng
  2019-01-07  9:13               ` Michal Hocko
  2019-01-09  4:33               ` Fam Zheng
  0 siblings, 2 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-07  9:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fam Zheng, Yang Shi, cgroups, Linux MM, tj, Johannes Weiner,
	lizefan, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou



> On Jan 7, 2019, at 16:53, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Mon 07-01-19 13:10:17, Fam Zheng wrote:
>> 
>> 
>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> 
>>> drop_caches would drop all page caches globally. You may not want to
>>> drop the page caches used by other memcgs.
>> 
>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>> 
>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>> 
>> This is a bit unexpected.
>> 
>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> 
> I would suspect that not all slab pages holding dentries and inodes got
> reclaimed during the slab shrinking inoked by the direct reclaimed
> triggered by force emptying.

I don’t think so, we’ve ensured cgroup.memory=nokmem,nosocket first, as observed with the result of ‘echo 1’ command. It’s not slabs but the page caches holding mem cgroups.

It might well be that we’ve missing 68600f623d6, though. We’ll check it.

Thanks,

Fam

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-07  9:01             ` Fam Zheng
@ 2019-01-07  9:13               ` Michal Hocko
  2019-01-09  4:33               ` Fam Zheng
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-07  9:13 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Yang Shi, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Mon 07-01-19 17:01:27, Fam Zheng wrote:
> 
> 
> > On Jan 7, 2019, at 16:53, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Mon 07-01-19 13:10:17, Fam Zheng wrote:
> >> 
> >> 
> >>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> >>> 
> >>> 
> >>> drop_caches would drop all page caches globally. You may not want to
> >>> drop the page caches used by other memcgs.
> >> 
> >> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
> >> 
> >> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
> >> 
> >> This is a bit unexpected.
> >> 
> >> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> > 
> > I would suspect that not all slab pages holding dentries and inodes got
> > reclaimed during the slab shrinking inoked by the direct reclaimed
> > triggered by force emptying.
> 
> I don’t think so, we’ve ensured cgroup.memory=nokmem,nosocket
> first, as observed with the result of ‘echo 1’ command. It’s not
> slabs but the page caches holding mem cgroups.

I see

> It might well be that we’ve missing 68600f623d6, though. We’ll check it.

This might be possible. drop_caches doesn't use this path so the
rounding error will not be possible.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-07  9:01             ` Fam Zheng
  2019-01-07  9:13               ` Michal Hocko
@ 2019-01-09  4:33               ` Fam Zheng
  1 sibling, 0 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-09  4:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fam Zheng, Yang Shi, cgroups, Linux MM, tj, Johannes Weiner,
	lizefan, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou



> On Jan 7, 2019, at 17:01, Fam Zheng <zhengfeiran@bytedance.com> wrote:
> 
> 
> 
>> On Jan 7, 2019, at 16:53, Michal Hocko <mhocko@kernel.org> wrote:
>> 
>> On Mon 07-01-19 13:10:17, Fam Zheng wrote:
>>> 
>>> 
>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>>> 
>>>> 
>>>> drop_caches would drop all page caches globally. You may not want to
>>>> drop the page caches used by other memcgs.
>>> 
>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>>> 
>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>>> 
>>> This is a bit unexpected.
>>> 
>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
>> 
>> I would suspect that not all slab pages holding dentries and inodes got
>> reclaimed during the slab shrinking inoked by the direct reclaimed
>> triggered by force emptying.
> 
> I don’t think so, we’ve ensured cgroup.memory=nokmem,nosocket first, as observed with the result of ‘echo 1’ command. It’s not slabs but the page caches holding mem cgroups.
> 
> It might well be that we’ve missing 68600f623d6, though. We’ll check it.

Just a follow-up: We’ve applied 68600f623d6 to 4.14, but it didn’t make a difference.

Fam

> 
> Thanks,
> 
> Fam
> 
>> -- 
>> Michal Hocko
>> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-07  5:10         ` Fam Zheng
  2019-01-07  8:53           ` Michal Hocko
@ 2019-01-10  5:36           ` Yang Shi
  2019-01-10  8:30             ` Fam Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Yang Shi @ 2019-01-10  5:36 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
>
> > On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> >
> >
> > drop_caches would drop all page caches globally. You may not want to
> > drop the page caches used by other memcgs.
>
> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>
> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>
> This is a bit unexpected.
>
> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?

Drop caches does invalidate pages inode by inode. But, memcg
force_empty does call memcg direct reclaim.

Offlined memcgs will not go away if there is still page charged. Maybe
relate to per cpu memcg stock. I recall there are some commits which
do solve the per cpu page counter cache problem.

591edfb10a94 mm: drain memcg stocks on css offlining
d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit

Not sure if they would help out.

Yang

>
> Fam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-10  5:36           ` Yang Shi
@ 2019-01-10  8:30             ` Fam Zheng
  2019-01-10  8:41               ` Michal Hocko
  2019-01-16  0:50               ` Yang Shi
  0 siblings, 2 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-10  8:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Michal Hocko, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou



> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>> 
>> 
>> 
>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> 
>>> drop_caches would drop all page caches globally. You may not want to
>>> drop the page caches used by other memcgs.
>> 
>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>> 
>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>> 
>> This is a bit unexpected.
>> 
>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> 
> Drop caches does invalidate pages inode by inode. But, memcg
> force_empty does call memcg direct reclaim.

But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said, dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?


> 
> Offlined memcgs will not go away if there is still page charged. Maybe
> relate to per cpu memcg stock. I recall there are some commits which
> do solve the per cpu page counter cache problem.
> 
> 591edfb10a94 mm: drain memcg stocks on css offlining
> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
> 
> Not sure if they would help out.

These are all in 4.20, which is tested but not helpful.

Fam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-10  8:30             ` Fam Zheng
@ 2019-01-10  8:41               ` Michal Hocko
  2019-01-16  0:50               ` Yang Shi
  1 sibling, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-10  8:41 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Yang Shi, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Thu 10-01-19 16:30:42, Fam Zheng wrote:
[...]
> > 591edfb10a94 mm: drain memcg stocks on css offlining
> > d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
> > bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
> > 
> > Not sure if they would help out.
> 
> These are all in 4.20, which is tested but not helpful.

I would recommend enabling vmscan tracepoints to see what is going on in
you case.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-10  8:30             ` Fam Zheng
  2019-01-10  8:41               ` Michal Hocko
@ 2019-01-16  0:50               ` Yang Shi
  2019-01-16  3:52                 ` Fam Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Yang Shi @ 2019-01-16  0:50 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
>
> > On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>
> >>
> >>
> >>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> >>>
> >>>
> >>> drop_caches would drop all page caches globally. You may not want to
> >>> drop the page caches used by other memcgs.
> >>
> >> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
> >>
> >> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
> >>
> >> This is a bit unexpected.
> >>
> >> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> >
> > Drop caches does invalidate pages inode by inode. But, memcg
> > force_empty does call memcg direct reclaim.
>
> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,

AFAICS, force_empty may unmap pages, but drop_caches doesn't.

> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?

This is what force_empty is supposed to do.  But, as your test shows
some page cache may still remain after force_empty, then cause offline
memcgs accumulated.  I haven't figured out what happened.  You may try
what Michal suggested.

Yang

>
>
> >
> > Offlined memcgs will not go away if there is still page charged. Maybe
> > relate to per cpu memcg stock. I recall there are some commits which
> > do solve the per cpu page counter cache problem.
> >
> > 591edfb10a94 mm: drain memcg stocks on css offlining
> > d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
> > bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
> >
> > Not sure if they would help out.
>
> These are all in 4.20, which is tested but not helpful.
>
> Fam
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-16  0:50               ` Yang Shi
@ 2019-01-16  3:52                 ` Fam Zheng
  2019-01-16  7:06                   ` Michal Hocko
  2019-01-16 21:06                   ` Yang Shi
  0 siblings, 2 replies; 28+ messages in thread
From: Fam Zheng @ 2019-01-16  3:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Michal Hocko, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou



> On Jan 16, 2019, at 08:50, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>> 
>> 
>> 
>>> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> drop_caches would drop all page caches globally. You may not want to
>>>>> drop the page caches used by other memcgs.
>>>> 
>>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>>>> 
>>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>>>> 
>>>> This is a bit unexpected.
>>>> 
>>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
>>> 
>>> Drop caches does invalidate pages inode by inode. But, memcg
>>> force_empty does call memcg direct reclaim.
>> 
>> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,
> 
> AFAICS, force_empty may unmap pages, but drop_caches doesn't.
> 
>> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?
> 
> This is what force_empty is supposed to do.  But, as your test shows
> some page cache may still remain after force_empty, then cause offline
> memcgs accumulated.  I haven't figured out what happened.  You may try
> what Michal suggested.

None of the existing patches helped so far, but we suspect that the pages cannot be locked at the force_empty moment. We have being working on a “retry” patch which does solve the problem. We’ll do more tracing (to have a better understanding of the issue) and post the findings and/or the patch later. Thanks.

Fam

> 
> Yang
> 
>> 
>> 
>>> 
>>> Offlined memcgs will not go away if there is still page charged. Maybe
>>> relate to per cpu memcg stock. I recall there are some commits which
>>> do solve the per cpu page counter cache problem.
>>> 
>>> 591edfb10a94 mm: drain memcg stocks on css offlining
>>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
>>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
>>> 
>>> Not sure if they would help out.
>> 
>> These are all in 4.20, which is tested but not helpful.
>> 
>> Fam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-16  3:52                 ` Fam Zheng
@ 2019-01-16  7:06                   ` Michal Hocko
  2019-01-16 21:08                     ` Yang Shi
  2019-01-16 21:06                   ` Yang Shi
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-16  7:06 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Yang Shi, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Wed 16-01-19 11:52:08, Fam Zheng wrote:
[...]
> > This is what force_empty is supposed to do.  But, as your test shows
> > some page cache may still remain after force_empty, then cause offline
> > memcgs accumulated.  I haven't figured out what happened.  You may try
> > what Michal suggested.
> 
> None of the existing patches helped so far, but we suspect that the
> pages cannot be locked at the force_empty moment. We have being
> working on a “retry” patch which does solve the problem. We’ll
> do more tracing (to have a better understanding of the issue) and post
> the findings and/or the patch later. Thanks.

Just for the record. There was a patch to remove
MEM_CGROUP_RECLAIM_RETRIES restriction in the path. I cannot find the
link right now but that is something we certainly can do. The context is
interruptible by signal and it from my experience any retry count can
lead to unexpected failures. But I guess you really want to check
vmscan tracepoints to see why you cannot reclaim pages on memcg LRUs
first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-16  3:52                 ` Fam Zheng
  2019-01-16  7:06                   ` Michal Hocko
@ 2019-01-16 21:06                   ` Yang Shi
  2019-01-17  2:41                     ` Fam Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Yang Shi @ 2019-01-16 21:06 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Tue, Jan 15, 2019 at 7:52 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
>
> > On Jan 16, 2019, at 08:50, Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>
> >>
> >>
> >>> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
> >>>
> >>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>> drop_caches would drop all page caches globally. You may not want to
> >>>>> drop the page caches used by other memcgs.
> >>>>
> >>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
> >>>>
> >>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
> >>>>
> >>>> This is a bit unexpected.
> >>>>
> >>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> >>>
> >>> Drop caches does invalidate pages inode by inode. But, memcg
> >>> force_empty does call memcg direct reclaim.
> >>
> >> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,
> >
> > AFAICS, force_empty may unmap pages, but drop_caches doesn't.
> >
> >> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?
> >
> > This is what force_empty is supposed to do.  But, as your test shows
> > some page cache may still remain after force_empty, then cause offline
> > memcgs accumulated.  I haven't figured out what happened.  You may try
> > what Michal suggested.
>
> None of the existing patches helped so far, but we suspect that the pages cannot be locked at the force_empty moment. We have being working on a “retry” patch which does solve the problem. We’ll do more tracing (to have a better understanding of the issue) and post the findings and/or the patch later. Thanks.

You mean it solves the problem by retrying more times?  Actually, I'm
not sure if you have swap setup in your test, but force_empty does do
swap if swap is on. This may cause it can't reclaim all the page cache
in 5 retries.  I have a patch within that series to skip swap.

Yang

>
> Fam
>
> >
> > Yang
> >
> >>
> >>
> >>>
> >>> Offlined memcgs will not go away if there is still page charged. Maybe
> >>> relate to per cpu memcg stock. I recall there are some commits which
> >>> do solve the per cpu page counter cache problem.
> >>>
> >>> 591edfb10a94 mm: drain memcg stocks on css offlining
> >>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
> >>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
> >>>
> >>> Not sure if they would help out.
> >>
> >> These are all in 4.20, which is tested but not helpful.
> >>
> >> Fam
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-16  7:06                   ` Michal Hocko
@ 2019-01-16 21:08                     ` Yang Shi
  0 siblings, 0 replies; 28+ messages in thread
From: Yang Shi @ 2019-01-16 21:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Tue, Jan 15, 2019 at 11:06 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 16-01-19 11:52:08, Fam Zheng wrote:
> [...]
> > > This is what force_empty is supposed to do.  But, as your test shows
> > > some page cache may still remain after force_empty, then cause offline
> > > memcgs accumulated.  I haven't figured out what happened.  You may try
> > > what Michal suggested.
> >
> > None of the existing patches helped so far, but we suspect that the
> > pages cannot be locked at the force_empty moment. We have being
> > working on a “retry” patch which does solve the problem. We’ll
> > do more tracing (to have a better understanding of the issue) and post
> > the findings and/or the patch later. Thanks.
>
> Just for the record. There was a patch to remove
> MEM_CGROUP_RECLAIM_RETRIES restriction in the path. I cannot find the
> link right now but that is something we certainly can do. The context is
> interruptible by signal and it from my experience any retry count can

Do you mean this one https://lore.kernel.org/patchwork/patch/865835/ ?

I think removing retries is feasible as long as exit is handled correctly.

Yang

> lead to unexpected failures. But I guess you really want to check
> vmscan tracepoints to see why you cannot reclaim pages on memcg LRUs
> first.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-16 21:06                   ` Yang Shi
@ 2019-01-17  2:41                     ` Fam Zheng
  2019-01-17  5:06                       ` Yang Shi
  0 siblings, 1 reply; 28+ messages in thread
From: Fam Zheng @ 2019-01-17  2:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Michal Hocko, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou



> On Jan 17, 2019, at 05:06, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Tue, Jan 15, 2019 at 7:52 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>> 
>> 
>> 
>>> On Jan 16, 2019, at 08:50, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
>>>>> 
>>>>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> drop_caches would drop all page caches globally. You may not want to
>>>>>>> drop the page caches used by other memcgs.
>>>>>> 
>>>>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>>>>>> 
>>>>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>>>>>> 
>>>>>> This is a bit unexpected.
>>>>>> 
>>>>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
>>>>> 
>>>>> Drop caches does invalidate pages inode by inode. But, memcg
>>>>> force_empty does call memcg direct reclaim.
>>>> 
>>>> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,
>>> 
>>> AFAICS, force_empty may unmap pages, but drop_caches doesn't.
>>> 
>>>> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?
>>> 
>>> This is what force_empty is supposed to do.  But, as your test shows
>>> some page cache may still remain after force_empty, then cause offline
>>> memcgs accumulated.  I haven't figured out what happened.  You may try
>>> what Michal suggested.
>> 
>> None of the existing patches helped so far, but we suspect that the pages cannot be locked at the force_empty moment. We have being working on a “retry” patch which does solve the problem. We’ll do more tracing (to have a better understanding of the issue) and post the findings and/or the patch later. Thanks.
> 
> You mean it solves the problem by retrying more times?  Actually, I'm
> not sure if you have swap setup in your test, but force_empty does do
> swap if swap is on. This may cause it can't reclaim all the page cache
> in 5 retries.  I have a patch within that series to skip swap.

Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.

We don’t have swap on.

What do you mean by 5 retries? I’m still a bit lost in the LRU code and patches.

> 
> Yang
> 
>> 
>> Fam
>> 
>>> 
>>> Yang
>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Offlined memcgs will not go away if there is still page charged. Maybe
>>>>> relate to per cpu memcg stock. I recall there are some commits which
>>>>> do solve the per cpu page counter cache problem.
>>>>> 
>>>>> 591edfb10a94 mm: drain memcg stocks on css offlining
>>>>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
>>>>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
>>>>> 
>>>>> Not sure if they would help out.
>>>> 
>>>> These are all in 4.20, which is tested but not helpful.
>>>> 
>>>> Fam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-17  2:41                     ` Fam Zheng
@ 2019-01-17  5:06                       ` Yang Shi
  2019-01-19  3:17                         ` 段熊春
  2019-01-20 23:15                           ` Shakeel Butt
  0 siblings, 2 replies; 28+ messages in thread
From: Yang Shi @ 2019-01-17  5:06 UTC (permalink / raw)
  To: Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou

On Wed, Jan 16, 2019 at 6:41 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>
>
>
> > On Jan 17, 2019, at 05:06, Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Jan 15, 2019 at 7:52 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>
> >>
> >>
> >>> On Jan 16, 2019, at 08:50, Yang Shi <shy828301@gmail.com> wrote:
> >>>
> >>> On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
> >>>>>
> >>>>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> drop_caches would drop all page caches globally. You may not want to
> >>>>>>> drop the page caches used by other memcgs.
> >>>>>>
> >>>>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
> >>>>>>
> >>>>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
> >>>>>>
> >>>>>> This is a bit unexpected.
> >>>>>>
> >>>>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
> >>>>>
> >>>>> Drop caches does invalidate pages inode by inode. But, memcg
> >>>>> force_empty does call memcg direct reclaim.
> >>>>
> >>>> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,
> >>>
> >>> AFAICS, force_empty may unmap pages, but drop_caches doesn't.
> >>>
> >>>> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?
> >>>
> >>> This is what force_empty is supposed to do.  But, as your test shows
> >>> some page cache may still remain after force_empty, then cause offline
> >>> memcgs accumulated.  I haven't figured out what happened.  You may try
> >>> what Michal suggested.
> >>
> >> None of the existing patches helped so far, but we suspect that the pages cannot be locked at the force_empty moment. We have being working on a “retry” patch which does solve the problem. We’ll do more tracing (to have a better understanding of the issue) and post the findings and/or the patch later. Thanks.
> >
> > You mean it solves the problem by retrying more times?  Actually, I'm
> > not sure if you have swap setup in your test, but force_empty does do
> > swap if swap is on. This may cause it can't reclaim all the page cache
> > in 5 retries.  I have a patch within that series to skip swap.
>
> Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.

This may suggest doing force_empty in a worker is more effective in
fact. Not sure if this is good enough to convince Johannes or not.

>
> We don’t have swap on.
>
> What do you mean by 5 retries? I’m still a bit lost in the LRU code and patches.

MEM_CGROUP_RECLAIM_RETRIES is 5.

Yang

>
> >
> > Yang
> >
> >>
> >> Fam
> >>
> >>>
> >>> Yang
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>> Offlined memcgs will not go away if there is still page charged. Maybe
> >>>>> relate to per cpu memcg stock. I recall there are some commits which
> >>>>> do solve the per cpu page counter cache problem.
> >>>>>
> >>>>> 591edfb10a94 mm: drain memcg stocks on css offlining
> >>>>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
> >>>>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
> >>>>>
> >>>>> Not sure if they would help out.
> >>>>
> >>>> These are all in 4.20, which is tested but not helpful.
> >>>>
> >>>> Fam
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-17  5:06                       ` Yang Shi
@ 2019-01-19  3:17                         ` 段熊春
  2019-01-20 23:15                           ` Shakeel Butt
  1 sibling, 0 replies; 28+ messages in thread
From: 段熊春 @ 2019-01-19  3:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner, lizefan,
	Michal Hocko, Vladimir Davydov, 张永肃,
	liuxiaozhou

[-- Attachment #1: Type: text/plain, Size: 4873 bytes --]

Hi Yang Shi:
We had try your patch. But  still there are lots of memcgroup can’t be release.
We guest maybe trylock(pagecache) fail may cause of that.
So we think we should try more force empty  in another time.
Inspired by you, I made a series patchs that are valid in our business sysetem. Memcgroup has decrease to 100 instead of 100000.

We think if memcgroup didn’t release. We should trigger force empty in 1 2  4 8 16 … second. That will get better chance to release this memcgroup.



bytedance.net
段熊春
duanxiongchun@bytedance.com




> On Jan 17, 2019, at 1:06 PM, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Wed, Jan 16, 2019 at 6:41 PM Fam Zheng <zhengfeiran@bytedance.com <mailto:zhengfeiran@bytedance.com>> wrote:
>> 
>> 
>> 
>>> On Jan 17, 2019, at 05:06, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> On Tue, Jan 15, 2019 at 7:52 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jan 16, 2019, at 08:50, Yang Shi <shy828301@gmail.com> wrote:
>>>>> 
>>>>> On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 10, 2019, at 13:36, Yang Shi <shy828301@gmail.com> wrote:
>>>>>>> 
>>>>>>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng <zhengfeiran@bytedance.com> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jan 5, 2019, at 03:36, Yang Shi <shy828301@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> drop_caches would drop all page caches globally. You may not want to
>>>>>>>>> drop the page caches used by other memcgs.
>>>>>>>> 
>>>>>>>> We’ve tried your async force_empty patch (with a modification to default it to true to make it transparently enabled for the sake of testing), and for the past few days the stale mem cgroups still accumulate, up to 40k.
>>>>>>>> 
>>>>>>>> We’ve double checked that the force_empty routines are invoked when a mem cgroup is offlined. But this doesn’t look very effective so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all the groups immediately go away.
>>>>>>>> 
>>>>>>>> This is a bit unexpected.
>>>>>>>> 
>>>>>>>> Yang, could you hint what are missing in the force_empty operation, compared to a blanket drop cache?
>>>>>>> 
>>>>>>> Drop caches does invalidate pages inode by inode. But, memcg
>>>>>>> force_empty does call memcg direct reclaim.
>>>>>> 
>>>>>> But force_empty touches things that drop_caches doesn’t? If so then maybe combining both approaches is more reliable. Since like you said,
>>>>> 
>>>>> AFAICS, force_empty may unmap pages, but drop_caches doesn't.
>>>>> 
>>>>>> dropping _all_ pages is usually too much thus not desired, we may want to somehow limit the dropped caches to those that are in the memory cgroup in question. What do you think?
>>>>> 
>>>>> This is what force_empty is supposed to do.  But, as your test shows
>>>>> some page cache may still remain after force_empty, then cause offline
>>>>> memcgs accumulated.  I haven't figured out what happened.  You may try
>>>>> what Michal suggested.
>>>> 
>>>> None of the existing patches helped so far, but we suspect that the pages cannot be locked at the force_empty moment. We have being working on a “retry” patch which does solve the problem. We’ll do more tracing (to have a better understanding of the issue) and post the findings and/or the patch later. Thanks.
>>> 
>>> You mean it solves the problem by retrying more times?  Actually, I'm
>>> not sure if you have swap setup in your test, but force_empty does do
>>> swap if swap is on. This may cause it can't reclaim all the page cache
>>> in 5 retries.  I have a patch within that series to skip swap.
>> 
>> Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.
> 
> This may suggest doing force_empty in a worker is more effective in
> fact. Not sure if this is good enough to convince Johannes or not.
> 
>> 
>> We don’t have swap on.
>> 
>> What do you mean by 5 retries? I’m still a bit lost in the LRU code and patches.
> 
> MEM_CGROUP_RECLAIM_RETRIES is 5.
> 
> Yang
> 
>> 
>>> 
>>> Yang
>>> 
>>>> 
>>>> Fam
>>>> 
>>>>> 
>>>>> Yang
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Offlined memcgs will not go away if there is still page charged. Maybe
>>>>>>> relate to per cpu memcg stock. I recall there are some commits which
>>>>>>> do solve the per cpu page counter cache problem.
>>>>>>> 
>>>>>>> 591edfb10a94 mm: drain memcg stocks on css offlining
>>>>>>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty
>>>>>>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit
>>>>>>> 
>>>>>>> Not sure if they would help out.
>>>>>> 
>>>>>> These are all in 4.20, which is tested but not helpful.
>>>>>> 
>>>>>> Fam


[-- Attachment #2.1: Type: text/html, Size: 913 bytes --]

[-- Attachment #2.2: memcgroup_release.patch.tar.gz --]
[-- Type: application/x-gzip, Size: 5537 bytes --]

[-- Attachment #2.3: Type: text/html, Size: 14131 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* memory cgroup pagecache and inode problem
@ 2019-01-20 23:15                           ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2019-01-20 23:15 UTC (permalink / raw)
  To: Yang Shi, Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou, Shakeel Butt

On Wed, Jan 16, 2019 at 9:07 PM Yang Shi <shy828301@gmail.com> wrote:
...
> > > You mean it solves the problem by retrying more times?  Actually, I'm
> > > not sure if you have swap setup in your test, but force_empty does do
> > > swap if swap is on. This may cause it can't reclaim all the page cache
> > > in 5 retries.  I have a patch within that series to skip swap.
> >
> > Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.
>
> This may suggest doing force_empty in a worker is more effective in
> fact. Not sure if this is good enough to convince Johannes or not.
>

>From what I understand what we actually want is to force_empty an
offlined memcg. How about we change the semantics of force_empty on
root_mem_cgroup? Currently force_empty on root_mem_cgroup returns
-EINVAL. Rather than that, let's do force_empty on all offlined memcgs
if user does force_empty on root_mem_cgroup. Something like following.

---
 mm/memcontrol.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4ac554be7e8..51daa2935c41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2898,14 +2898,16 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
  *
  * Caller is responsible for holding css reference for memcg.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
+static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool online)
 {
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
 	/* we call try-to-free pages for make this cgroup empty */
-	lru_add_drain_all();
 
-	drain_all_stock(memcg);
+	if (online) {
+		lru_add_drain_all();
+		drain_all_stock(memcg);
+	}
 
 	/* try to free all pages in this cgroup */
 	while (nr_retries && page_counter_read(&memcg->memory)) {
@@ -2915,7 +2917,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
+							GFP_KERNEL, online);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -2932,10 +2934,16 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
 					    loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct mem_cgroup *mi;
 
-	if (mem_cgroup_is_root(memcg))
-		return -EINVAL;
-	return mem_cgroup_force_empty(memcg) ?: nbytes;
+	if (mem_cgroup_is_root(memcg)) {
+		for_each_mem_cgroup_tree(mi, memcg) {
+			if (!mem_cgroup_online(mi))
+				mem_cgroup_force_empty(mi, false);
+		}
+		return 0;
+	}
+	return mem_cgroup_force_empty(memcg, true) ?: nbytes;
 }
 
 static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
-- 
2.20.1.321.g9e740568ce-goog

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* memory cgroup pagecache and inode problem
@ 2019-01-20 23:15                           ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2019-01-20 23:15 UTC (permalink / raw)
  To: Yang Shi, Fam Zheng
  Cc: cgroups, Linux MM, tj, Johannes Weiner, lizefan, Michal Hocko,
	Vladimir Davydov, duanxiongchun, 张永肃,
	liuxiaozhou, Shakeel Butt

On Wed, Jan 16, 2019 at 9:07 PM Yang Shi <shy828301@gmail.com> wrote:
...
> > > You mean it solves the problem by retrying more times?  Actually, I'm
> > > not sure if you have swap setup in your test, but force_empty does do
> > > swap if swap is on. This may cause it can't reclaim all the page cache
> > > in 5 retries.  I have a patch within that series to skip swap.
> >
> > Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.
>
> This may suggest doing force_empty in a worker is more effective in
> fact. Not sure if this is good enough to convince Johannes or not.
>

From what I understand what we actually want is to force_empty an
offlined memcg. How about we change the semantics of force_empty on
root_mem_cgroup? Currently force_empty on root_mem_cgroup returns
-EINVAL. Rather than that, let's do force_empty on all offlined memcgs
if user does force_empty on root_mem_cgroup. Something like following.

---
 mm/memcontrol.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4ac554be7e8..51daa2935c41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2898,14 +2898,16 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
  *
  * Caller is responsible for holding css reference for memcg.
  */
-static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
+static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool online)
 {
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
 	/* we call try-to-free pages for make this cgroup empty */
-	lru_add_drain_all();
 
-	drain_all_stock(memcg);
+	if (online) {
+		lru_add_drain_all();
+		drain_all_stock(memcg);
+	}
 
 	/* try to free all pages in this cgroup */
 	while (nr_retries && page_counter_read(&memcg->memory)) {
@@ -2915,7 +2917,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
+							GFP_KERNEL, online);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -2932,10 +2934,16 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
 					    loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct mem_cgroup *mi;
 
-	if (mem_cgroup_is_root(memcg))
-		return -EINVAL;
-	return mem_cgroup_force_empty(memcg) ?: nbytes;
+	if (mem_cgroup_is_root(memcg)) {
+		for_each_mem_cgroup_tree(mi, memcg) {
+			if (!mem_cgroup_online(mi))
+				mem_cgroup_force_empty(mi, false);
+		}
+		return 0;
+	}
+	return mem_cgroup_force_empty(memcg, true) ?: nbytes;
 }
 
 static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
-- 
2.20.1.321.g9e740568ce-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-20 23:15                           ` Shakeel Butt
  (?)
@ 2019-01-20 23:20                           ` Shakeel Butt
  -1 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2019-01-20 23:20 UTC (permalink / raw)
  To: Yang Shi, Fam Zheng
  Cc: Cgroups, Linux MM, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou

On Sun, Jan 20, 2019 at 3:16 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Jan 16, 2019 at 9:07 PM Yang Shi <shy828301@gmail.com> wrote:
> ...
> > > > You mean it solves the problem by retrying more times?  Actually, I'm
> > > > not sure if you have swap setup in your test, but force_empty does do
> > > > swap if swap is on. This may cause it can't reclaim all the page cache
> > > > in 5 retries.  I have a patch within that series to skip swap.
> > >
> > > Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.
> >
> > This may suggest doing force_empty in a worker is more effective in
> > fact. Not sure if this is good enough to convince Johannes or not.
> >
>
> From what I understand what we actually want is to force_empty an
> offlined memcg. How about we change the semantics of force_empty on
> root_mem_cgroup? Currently force_empty on root_mem_cgroup returns
> -EINVAL. Rather than that, let's do force_empty on all offlined memcgs
> if user does force_empty on root_mem_cgroup. Something like following.
>

Basically we don't need to add more complexity in kernel like
async/workers/timeouts/workqueues to run force_empty, if we expose a
way to force_empty offlined memcgs.

> ---
>  mm/memcontrol.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a4ac554be7e8..51daa2935c41 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2898,14 +2898,16 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>   *
>   * Caller is responsible for holding css reference for memcg.
>   */
> -static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> +static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool online)
>  {
>         int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>
>         /* we call try-to-free pages for make this cgroup empty */
> -       lru_add_drain_all();
>
> -       drain_all_stock(memcg);
> +       if (online) {
> +               lru_add_drain_all();
> +               drain_all_stock(memcg);
> +       }
>
>         /* try to free all pages in this cgroup */
>         while (nr_retries && page_counter_read(&memcg->memory)) {
> @@ -2915,7 +2917,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 progress = try_to_free_mem_cgroup_pages(memcg, 1,
> -                                                       GFP_KERNEL, true);
> +                                                       GFP_KERNEL, online);
>                 if (!progress) {
>                         nr_retries--;
>                         /* maybe some writeback is necessary */
> @@ -2932,10 +2934,16 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
>                                             loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       struct mem_cgroup *mi;
>
> -       if (mem_cgroup_is_root(memcg))
> -               return -EINVAL;
> -       return mem_cgroup_force_empty(memcg) ?: nbytes;
> +       if (mem_cgroup_is_root(memcg)) {
> +               for_each_mem_cgroup_tree(mi, memcg) {
> +                       if (!mem_cgroup_online(mi))
> +                               mem_cgroup_force_empty(mi, false);
> +               }
> +               return 0;
> +       }
> +       return mem_cgroup_force_empty(memcg, true) ?: nbytes;
>  }
>
>  static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
> --
> 2.20.1.321.g9e740568ce-goog
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: memory cgroup pagecache and inode problem
  2019-01-20 23:15                           ` Shakeel Butt
  (?)
  (?)
@ 2019-01-21 10:27                           ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-21 10:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yang Shi, Fam Zheng, cgroups, Linux MM, tj, Johannes Weiner,
	lizefan, Vladimir Davydov, duanxiongchun,
	张永肃,
	liuxiaozhou

On Sun 20-01-19 15:15:51, Shakeel Butt wrote:
> On Wed, Jan 16, 2019 at 9:07 PM Yang Shi <shy828301@gmail.com> wrote:
> ...
> > > > You mean it solves the problem by retrying more times?  Actually, I'm
> > > > not sure if you have swap setup in your test, but force_empty does do
> > > > swap if swap is on. This may cause it can't reclaim all the page cache
> > > > in 5 retries.  I have a patch within that series to skip swap.
> > >
> > > Basically yes, retrying solves the problem. But compared to immediate retries, a scheduled retry in a few seconds is much more effective.
> >
> > This may suggest doing force_empty in a worker is more effective in
> > fact. Not sure if this is good enough to convince Johannes or not.
> >
> 
> >From what I understand what we actually want is to force_empty an
> offlined memcg. How about we change the semantics of force_empty on
> root_mem_cgroup? Currently force_empty on root_mem_cgroup returns
> -EINVAL. Rather than that, let's do force_empty on all offlined memcgs
> if user does force_empty on root_mem_cgroup. Something like following.

No, I do not thing we want to make root memcg somehow special here. I do
recognize two things here
1) people seem to want to have a control over when a specific cgroup
gets reclaimed (basically force_empty)
2) people would like the above to happen when a memcg is offlined

The first part is not present in v2 and we should discuss whether we
want to expose it because it hasn't been added due to lack of usecases.
The later is discussed [1] already so let's continue there.

[1] http://lkml.kernel.org/r/1547061285-100329-1-git-send-email-yang.shi@linux.alibaba.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-01-21 10:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <15614FDC-198E-449B-BFAF-B00D6EF61155@bytedance.com>
2019-01-04  4:44 ` memory cgroup pagecache and inode problem Fam Zheng
2019-01-04  5:00   ` Yang Shi
2019-01-04  5:12     ` Fam Zheng
2019-01-04 19:36       ` Yang Shi
2019-01-07  5:10         ` Fam Zheng
2019-01-07  8:53           ` Michal Hocko
2019-01-07  9:01             ` Fam Zheng
2019-01-07  9:13               ` Michal Hocko
2019-01-09  4:33               ` Fam Zheng
2019-01-10  5:36           ` Yang Shi
2019-01-10  8:30             ` Fam Zheng
2019-01-10  8:41               ` Michal Hocko
2019-01-16  0:50               ` Yang Shi
2019-01-16  3:52                 ` Fam Zheng
2019-01-16  7:06                   ` Michal Hocko
2019-01-16 21:08                     ` Yang Shi
2019-01-16 21:06                   ` Yang Shi
2019-01-17  2:41                     ` Fam Zheng
2019-01-17  5:06                       ` Yang Shi
2019-01-19  3:17                         ` 段熊春
2019-01-20 23:15                         ` Shakeel Butt
2019-01-20 23:15                           ` Shakeel Butt
2019-01-20 23:20                           ` Shakeel Butt
2019-01-21 10:27                           ` Michal Hocko
2019-01-04  9:04 ` Michal Hocko
2019-01-04 10:02   ` Fam Zheng
2019-01-04 10:12     ` Michal Hocko
2019-01-04 10:35       ` Fam Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.