linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
@ 2018-03-19  8:29 Li RongQing
  2018-03-19  8:53 ` Michal Hocko
  2018-03-23 10:34 ` Michal Hocko
  0 siblings, 2 replies; 17+ messages in thread
From: Li RongQing @ 2018-03-19  8:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: cgroups, hannes, mhocko

mem_cgroup_force_empty() tries to free only 32 (SWAP_CLUSTER_MAX) pages
on each iteration, if a memory cgroup has lots of page cache, it will
take many iterations to empty all page cache, so increase the reclaimed
number per iteration to speed it up. same as in mem_cgroup_resize_limit()

a simple test show:

  $dd if=aaa  of=bbb  bs=1k count=3886080
  $rm -f bbb
  $time echo 100000000 >/cgroup/memory/test/memory.limit_in_bytes

Before: 0m0.252s ===> after: 0m0.178s

Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..8910d9e8e908 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2480,7 +2480,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		if (!try_to_free_mem_cgroup_pages(memcg, 1,
+		if (!try_to_free_mem_cgroup_pages(memcg, 1024,
 					GFP_KERNEL, !memsw)) {
 			ret = -EBUSY;
 			break;
@@ -2610,7 +2610,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, 1,
+		progress = try_to_free_mem_cgroup_pages(memcg, 1024,
 							GFP_KERNEL, true);
 		if (!progress) {
 			nr_retries--;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19  8:29 [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup Li RongQing
@ 2018-03-19  8:53 ` Michal Hocko
       [not found]   ` <2AD939572F25A448A3AE3CAEA61328C23745764B@BC-MAIL-M28.internal.baidu.com>
  2018-03-23 10:34 ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-03-19  8:53 UTC (permalink / raw)
  To: Li RongQing; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Mon 19-03-18 16:29:30, Li RongQing wrote:
> mem_cgroup_force_empty() tries to free only 32 (SWAP_CLUSTER_MAX) pages
> on each iteration, if a memory cgroup has lots of page cache, it will
> take many iterations to empty all page cache, so increase the reclaimed
> number per iteration to speed it up. same as in mem_cgroup_resize_limit()
> 
> a simple test show:
> 
>   $dd if=aaa  of=bbb  bs=1k count=3886080
>   $rm -f bbb
>   $time echo 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> 
> Before: 0m0.252s ===> after: 0m0.178s

Andrey was proposing something similar [1]. My main objection was that
his approach might lead to over-reclaim. Your approach is more
conservative because it just increases the batch size. The size is still
rather arbitrary. Same as SWAP_CLUSTER_MAX but that one is a commonly
used unit of reclaim in the MM code.

I would be really curious about more detailed explanation why having a
larger batch yields to a better performance because we are doingg
SWAP_CLUSTER_MAX batches at the lower reclaim level anyway.

[1] http://lkml.kernel.org/r/20180119132544.19569-2-aryabinin@virtuozzo.com

> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 670e99b68aa6..8910d9e8e908 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2480,7 +2480,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		if (!try_to_free_mem_cgroup_pages(memcg, 1,
> +		if (!try_to_free_mem_cgroup_pages(memcg, 1024,
>  					GFP_KERNEL, !memsw)) {
>  			ret = -EBUSY;
>  			break;
> @@ -2610,7 +2610,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  		if (signal_pending(current))
>  			return -EINTR;
>  
> -		progress = try_to_free_mem_cgroup_pages(memcg, 1,
> +		progress = try_to_free_mem_cgroup_pages(memcg, 1024,
>  							GFP_KERNEL, true);
>  		if (!progress) {
>  			nr_retries--;
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
       [not found]   ` <2AD939572F25A448A3AE3CAEA61328C23745764B@BC-MAIL-M28.internal.baidu.com>
@ 2018-03-19 10:37     ` Michal Hocko
  2018-03-19 10:51       ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-03-19 10:37 UTC (permalink / raw)
  To: Li,Rongqing; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> 
> 
> > -----邮件原件-----
> > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > 发送时间: 2018年3月19日 16:54
> > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > <aryabinin@virtuozzo.com>
> > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a memory
> > cgroup
> > 
> > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > mem_cgroup_force_empty() tries to free only 32 (SWAP_CLUSTER_MAX)
> > > pages on each iteration, if a memory cgroup has lots of page cache, it
> > > will take many iterations to empty all page cache, so increase the
> > > reclaimed number per iteration to speed it up. same as in
> > > mem_cgroup_resize_limit()
> > >
> > > a simple test show:
> > >
> > >   $dd if=aaa  of=bbb  bs=1k count=3886080
> > >   $rm -f bbb
> > >   $time echo 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > >
> > > Before: 0m0.252s ===> after: 0m0.178s
> > 
> > Andrey was proposing something similar [1]. My main objection was that his
> > approach might lead to over-reclaim. Your approach is more conservative
> > because it just increases the batch size. The size is still rather arbitrary. Same
> > as SWAP_CLUSTER_MAX but that one is a commonly used unit of reclaim in
> > the MM code.
> > 
> > I would be really curious about more detailed explanation why having a
> > larger batch yields to a better performance because we are doingg
> > SWAP_CLUSTER_MAX batches at the lower reclaim level anyway.
> > 
> 
> Although SWAP_CLUSTER_MAX is used at the lower level, but the call stack of 
> try_to_free_mem_cgroup_pages is too long, increase the nr_to_reclaim can reduce
> times of calling function[do_try_to_free_pages, shrink_zones, hrink_node ]
> 
> mem_cgroup_resize_limit
> --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,  SWAP_CLUSTER_MAX),
>    ---> do_try_to_free_pages 
>      ---> shrink_zones
>       --->shrink_node
>        ---> shrink_node_memcg
>          ---> shrink_list          <-------loop will happen in this place [times=1024/32]
>            ---> shrink_page_list

Can you actually measure this to be the culprit. Because we should
rethink our call path if it is too complicated/deep to perform well.
Adding arbitrary batch sizes doesn't sound like a good way to go to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19 10:37     ` 答复: " Michal Hocko
@ 2018-03-19 10:51       ` Li,Rongqing
  2018-03-19 17:51         ` David Rientjes
  2018-03-23  2:58         ` Li,Rongqing
  0 siblings, 2 replies; 17+ messages in thread
From: Li,Rongqing @ 2018-03-19 10:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin



> -----邮件原件-----
> 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> 发送时间: 2018年3月19日 18:38
> 收件人: Li,Rongqing <lirongqing@baidu.com>
> 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> <aryabinin@virtuozzo.com>
> 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory cgroup
> 
> On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> >
> >
> > > -----邮件原件-----
> > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > 发送时间: 2018年3月19日 16:54
> > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > <aryabinin@virtuozzo.com>
> > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory
> > > cgroup
> > >
> > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > mem_cgroup_force_empty() tries to free only 32
> (SWAP_CLUSTER_MAX)
> > > > pages on each iteration, if a memory cgroup has lots of page
> > > > cache, it will take many iterations to empty all page cache, so
> > > > increase the reclaimed number per iteration to speed it up. same
> > > > as in
> > > > mem_cgroup_resize_limit()
> > > >
> > > > a simple test show:
> > > >
> > > >   $dd if=aaa  of=bbb  bs=1k count=3886080
> > > >   $rm -f bbb
> > > >   $time echo
> 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > >
> > > > Before: 0m0.252s ===> after: 0m0.178s
> > >
> > > Andrey was proposing something similar [1]. My main objection was
> > > that his approach might lead to over-reclaim. Your approach is more
> > > conservative because it just increases the batch size. The size is
> > > still rather arbitrary. Same as SWAP_CLUSTER_MAX but that one is a
> > > commonly used unit of reclaim in the MM code.
> > >
> > > I would be really curious about more detailed explanation why having
> > > a larger batch yields to a better performance because we are doingg
> > > SWAP_CLUSTER_MAX batches at the lower reclaim level anyway.
> > >
> >
> > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > nr_to_reclaim can reduce times of calling
> > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> >
> > mem_cgroup_resize_limit
> > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > --->SWAP_CLUSTER_MAX),
> >    ---> do_try_to_free_pages
> >      ---> shrink_zones
> >       --->shrink_node
> >        ---> shrink_node_memcg
> >          ---> shrink_list          <-------loop will happen in this place
> [times=1024/32]
> >            ---> shrink_page_list
> 
> Can you actually measure this to be the culprit. Because we should rethink
> our call path if it is too complicated/deep to perform well.
> Adding arbitrary batch sizes doesn't sound like a good way to go to me.

Ok, I will try

-RongQing
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19 10:51       ` 答复: " Li,Rongqing
@ 2018-03-19 17:51         ` David Rientjes
  2018-03-20  8:39           ` Michal Hocko
  2018-03-23  2:58         ` Li,Rongqing
  1 sibling, 1 reply; 17+ messages in thread
From: David Rientjes @ 2018-03-19 17:51 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: Michal Hocko, linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Mon, 19 Mar 2018, Li,Rongqing wrote:

> > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > nr_to_reclaim can reduce times of calling
> > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > >
> > > mem_cgroup_resize_limit
> > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > --->SWAP_CLUSTER_MAX),
> > >    ---> do_try_to_free_pages
> > >      ---> shrink_zones
> > >       --->shrink_node
> > >        ---> shrink_node_memcg
> > >          ---> shrink_list          <-------loop will happen in this place
> > [times=1024/32]
> > >            ---> shrink_page_list
> > 
> > Can you actually measure this to be the culprit. Because we should rethink
> > our call path if it is too complicated/deep to perform well.
> > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> 
> Ok, I will try
> 

Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on 
every iteration which contends with lowering limits in other cgroups (on 
our systems, thousands), calling try_to_free_mem_cgroup_pages() with less 
than SWAP_CLUSTER_MAX is lame.  It would probably be best to limit the 
nr_pages to the amount that needs to be reclaimed, though, rather than 
over reclaiming.

If you wanted to be invasive, you could change page_counter_limit() to 
return the count - limit, fix up the callers that look for -EBUSY, and 
then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19 17:51         ` David Rientjes
@ 2018-03-20  8:39           ` Michal Hocko
  2018-03-20 20:29             ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-03-20  8:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Mon 19-03-18 10:51:57, David Rientjes wrote:
> On Mon, 19 Mar 2018, Li,Rongqing wrote:
> 
> > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > nr_to_reclaim can reduce times of calling
> > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > >
> > > > mem_cgroup_resize_limit
> > > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > > --->SWAP_CLUSTER_MAX),
> > > >    ---> do_try_to_free_pages
> > > >      ---> shrink_zones
> > > >       --->shrink_node
> > > >        ---> shrink_node_memcg
> > > >          ---> shrink_list          <-------loop will happen in this place
> > > [times=1024/32]
> > > >            ---> shrink_page_list
> > > 
> > > Can you actually measure this to be the culprit. Because we should rethink
> > > our call path if it is too complicated/deep to perform well.
> > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > 
> > Ok, I will try
> > 
> 
> Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on 
> every iteration which contends with lowering limits in other cgroups (on 
> our systems, thousands), calling try_to_free_mem_cgroup_pages() with less 
> than SWAP_CLUSTER_MAX is lame.

Well, if the global lock is a bottleneck in your deployments then we
can come up with something more clever. E.g. per hierarchy locking
or even drop the lock for the reclaim altogether. If we reclaim in
SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
multiple users are shrinking the same (sub)hierarchy.

> It would probably be best to limit the 
> nr_pages to the amount that needs to be reclaimed, though, rather than 
> over reclaiming.

How do you achieve that? The charging path is not synchornized with the
shrinking one at all.

> If you wanted to be invasive, you could change page_counter_limit() to 
> return the count - limit, fix up the callers that look for -EBUSY, and 
> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.

I am not sure I understand

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20  8:39           ` Michal Hocko
@ 2018-03-20 20:29             ` David Rientjes
  2018-03-20 22:08               ` Andrey Ryabinin
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2018-03-20 20:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Tue, 20 Mar 2018, Michal Hocko wrote:

> > > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > > nr_to_reclaim can reduce times of calling
> > > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > > >
> > > > > mem_cgroup_resize_limit
> > > > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > > > --->SWAP_CLUSTER_MAX),
> > > > >    ---> do_try_to_free_pages
> > > > >      ---> shrink_zones
> > > > >       --->shrink_node
> > > > >        ---> shrink_node_memcg
> > > > >          ---> shrink_list          <-------loop will happen in this place
> > > > [times=1024/32]
> > > > >            ---> shrink_page_list
> > > > 
> > > > Can you actually measure this to be the culprit. Because we should rethink
> > > > our call path if it is too complicated/deep to perform well.
> > > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > > 
> > > Ok, I will try
> > > 
> > 
> > Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on 
> > every iteration which contends with lowering limits in other cgroups (on 
> > our systems, thousands), calling try_to_free_mem_cgroup_pages() with less 
> > than SWAP_CLUSTER_MAX is lame.
> 
> Well, if the global lock is a bottleneck in your deployments then we
> can come up with something more clever. E.g. per hierarchy locking
> or even drop the lock for the reclaim altogether. If we reclaim in
> SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
> multiple users are shrinking the same (sub)hierarchy.
> 

I don't believe this to be a bottleneck if nr_pages is increased in 
mem_cgroup_resize_limit().

> > It would probably be best to limit the 
> > nr_pages to the amount that needs to be reclaimed, though, rather than 
> > over reclaiming.
> 
> How do you achieve that? The charging path is not synchornized with the
> shrinking one at all.
> 

The point is to get a better guess at how many pages, up to 
SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.

> > If you wanted to be invasive, you could change page_counter_limit() to 
> > return the count - limit, fix up the callers that look for -EBUSY, and 
> > then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
> 
> I am not sure I understand
> 

Have page_counter_limit() return the number of pages over limit, i.e. 
count - limit, since it compares the two anyway.  Fix up existing callers 
and then clamp that value to SWAP_CLUSTER_MAX in 
mem_cgroup_resize_limit().  It's a more accurate guess than either 1 or 
1024.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20 20:29             ` David Rientjes
@ 2018-03-20 22:08               ` Andrey Ryabinin
  2018-03-20 22:15                 ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2018-03-20 22:08 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes



On 03/20/2018 11:29 PM, David Rientjes wrote:
> On Tue, 20 Mar 2018, Michal Hocko wrote:
> 
>>>>>> Although SWAP_CLUSTER_MAX is used at the lower level, but the call
>>>>>> stack of try_to_free_mem_cgroup_pages is too long, increase the
>>>>>> nr_to_reclaim can reduce times of calling
>>>>>> function[do_try_to_free_pages, shrink_zones, hrink_node ]
>>>>>>
>>>>>> mem_cgroup_resize_limit
>>>>>> --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
>>>>>> --->SWAP_CLUSTER_MAX),
>>>>>>    ---> do_try_to_free_pages
>>>>>>      ---> shrink_zones
>>>>>>       --->shrink_node
>>>>>>        ---> shrink_node_memcg
>>>>>>          ---> shrink_list          <-------loop will happen in this place
>>>>> [times=1024/32]
>>>>>>            ---> shrink_page_list
>>>>>
>>>>> Can you actually measure this to be the culprit. Because we should rethink
>>>>> our call path if it is too complicated/deep to perform well.
>>>>> Adding arbitrary batch sizes doesn't sound like a good way to go to me.
>>>>
>>>> Ok, I will try
>>>>
>>>
>>> Looping in mem_cgroup_resize_limit(), which takes memcg_limit_mutex on 
>>> every iteration which contends with lowering limits in other cgroups (on 
>>> our systems, thousands), calling try_to_free_mem_cgroup_pages() with less 
>>> than SWAP_CLUSTER_MAX is lame.
>>
>> Well, if the global lock is a bottleneck in your deployments then we
>> can come up with something more clever. E.g. per hierarchy locking
>> or even drop the lock for the reclaim altogether. If we reclaim in
>> SWAP_CLUSTER_MAX then the potential over-reclaim risk quite low when
>> multiple users are shrinking the same (sub)hierarchy.
>>
> 
> I don't believe this to be a bottleneck if nr_pages is increased in 
> mem_cgroup_resize_limit().
> 
>>> It would probably be best to limit the 
>>> nr_pages to the amount that needs to be reclaimed, though, rather than 
>>> over reclaiming.
>>
>> How do you achieve that? The charging path is not synchornized with the
>> shrinking one at all.
>>
> 
> The point is to get a better guess at how many pages, up to 
> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
> 
>>> If you wanted to be invasive, you could change page_counter_limit() to 
>>> return the count - limit, fix up the callers that look for -EBUSY, and 
>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
>>
>> I am not sure I understand
>>
> 
> Have page_counter_limit() return the number of pages over limit, i.e. 
> count - limit, since it compares the two anyway.  Fix up existing callers 
> and then clamp that value to SWAP_CLUSTER_MAX in 
> mem_cgroup_resize_limit().  It's a more accurate guess than either 1 or 
> 1024.
> 

JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
See try_to_free_mem_cgroup_pages():
....	
	struct scan_control sc = {
		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20 22:08               ` Andrey Ryabinin
@ 2018-03-20 22:15                 ` David Rientjes
  2018-03-20 22:35                   ` Andrey Ryabinin
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2018-03-20 22:15 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Michal Hocko, Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes

On Wed, 21 Mar 2018, Andrey Ryabinin wrote:

> >>> It would probably be best to limit the 
> >>> nr_pages to the amount that needs to be reclaimed, though, rather than 
> >>> over reclaiming.
> >>
> >> How do you achieve that? The charging path is not synchornized with the
> >> shrinking one at all.
> >>
> > 
> > The point is to get a better guess at how many pages, up to 
> > SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
> > 
> >>> If you wanted to be invasive, you could change page_counter_limit() to 
> >>> return the count - limit, fix up the callers that look for -EBUSY, and 
> >>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
> >>
> >> I am not sure I understand
> >>
> > 
> > Have page_counter_limit() return the number of pages over limit, i.e. 
> > count - limit, since it compares the two anyway.  Fix up existing callers 
> > and then clamp that value to SWAP_CLUSTER_MAX in 
> > mem_cgroup_resize_limit().  It's a more accurate guess than either 1 or 
> > 1024.
> > 
> 
> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
> See try_to_free_mem_cgroup_pages():
> ....	
> 	struct scan_control sc = {
> 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> 

Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20 22:15                 ` David Rientjes
@ 2018-03-20 22:35                   ` Andrey Ryabinin
  2018-03-20 22:45                     ` David Rientjes
  2018-03-21  9:59                     ` Michal Hocko
  0 siblings, 2 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2018-03-20 22:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes

On 03/21/2018 01:15 AM, David Rientjes wrote:
> On Wed, 21 Mar 2018, Andrey Ryabinin wrote:
> 
>>>>> It would probably be best to limit the 
>>>>> nr_pages to the amount that needs to be reclaimed, though, rather than 
>>>>> over reclaiming.
>>>>
>>>> How do you achieve that? The charging path is not synchornized with the
>>>> shrinking one at all.
>>>>
>>>
>>> The point is to get a better guess at how many pages, up to 
>>> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
>>>
>>>>> If you wanted to be invasive, you could change page_counter_limit() to 
>>>>> return the count - limit, fix up the callers that look for -EBUSY, and 
>>>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
>>>>
>>>> I am not sure I understand
>>>>
>>>
>>> Have page_counter_limit() return the number of pages over limit, i.e. 
>>> count - limit, since it compares the two anyway.  Fix up existing callers 
>>> and then clamp that value to SWAP_CLUSTER_MAX in 
>>> mem_cgroup_resize_limit().  It's a more accurate guess than either 1 or 
>>> 1024.
>>>
>>
>> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
>> See try_to_free_mem_cgroup_pages():
>> ....	
>> 	struct scan_control sc = {
>> 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>
> 
> Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
> 

Absolutely not. I completely on your side here. 
I've tried to fix this recently - http://lkml.kernel.org/r/20180119132544.19569-2-aryabinin@virtuozzo.com
I guess that Andrew decided to not take my patch, because Michal wasn't
happy about it (see mail archives if you want more details).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20 22:35                   ` Andrey Ryabinin
@ 2018-03-20 22:45                     ` David Rientjes
  2018-03-21  9:59                     ` Michal Hocko
  1 sibling, 0 replies; 17+ messages in thread
From: David Rientjes @ 2018-03-20 22:45 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Michal Hocko, Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes

On Wed, 21 Mar 2018, Andrey Ryabinin wrote:

> > Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
> > 
> 
> Absolutely not. I completely on your side here. 
> I've tried to fix this recently - http://lkml.kernel.org/r/20180119132544.19569-2-aryabinin@virtuozzo.com
> I guess that Andrew decided to not take my patch, because Michal wasn't
> happy about it (see mail archives if you want more details).
> 

I unfortunately didn't see this patch in January, it seems very similar to 
what I was suggesting in this thread.  You do a page_counter_read() 
directly in mem_cgroup_resize_limit() where my suggestion was to have 
page_counter_limit() return the difference, but there's nothing 
significantly different about what you proposed and what I suggested.

Perhaps the patch would be better off as a compromise between what you, I, 
and Li RongQing have proposed/suggested: have page_counter_limit() return 
the difference, and clamp it to some value proportional to the new limit.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-20 22:35                   ` Andrey Ryabinin
  2018-03-20 22:45                     ` David Rientjes
@ 2018-03-21  9:59                     ` Michal Hocko
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2018-03-21  9:59 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: David Rientjes, Li,Rongqing, linux-kernel, linux-mm, cgroups, hannes

On Wed 21-03-18 01:35:05, Andrey Ryabinin wrote:
> On 03/21/2018 01:15 AM, David Rientjes wrote:
> > On Wed, 21 Mar 2018, Andrey Ryabinin wrote:
> > 
> >>>>> It would probably be best to limit the 
> >>>>> nr_pages to the amount that needs to be reclaimed, though, rather than 
> >>>>> over reclaiming.
> >>>>
> >>>> How do you achieve that? The charging path is not synchornized with the
> >>>> shrinking one at all.
> >>>>
> >>>
> >>> The point is to get a better guess at how many pages, up to 
> >>> SWAP_CLUSTER_MAX, that need to be reclaimed instead of 1.
> >>>
> >>>>> If you wanted to be invasive, you could change page_counter_limit() to 
> >>>>> return the count - limit, fix up the callers that look for -EBUSY, and 
> >>>>> then use max(val, SWAP_CLUSTER_MAX) as your nr_pages.
> >>>>
> >>>> I am not sure I understand
> >>>>
> >>>
> >>> Have page_counter_limit() return the number of pages over limit, i.e. 
> >>> count - limit, since it compares the two anyway.  Fix up existing callers 
> >>> and then clamp that value to SWAP_CLUSTER_MAX in 
> >>> mem_cgroup_resize_limit().  It's a more accurate guess than either 1 or 
> >>> 1024.
> >>>
> >>
> >> JFYI, it's never 1, it's always SWAP_CLUSTER_MAX.
> >> See try_to_free_mem_cgroup_pages():
> >> ....	
> >> 	struct scan_control sc = {
> >> 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> >>
> > 
> > Is SWAP_CLUSTER_MAX the best answer if I'm lowering the limit by 1GB?
> > 
> 
> Absolutely not. I completely on your side here. 
> I've tried to fix this recently - http://lkml.kernel.org/r/20180119132544.19569-2-aryabinin@virtuozzo.com
> I guess that Andrew decided to not take my patch, because Michal wasn't
> happy about it (see mail archives if you want more details).

I was unhappy about the explanation and justification of the patch. It
is still not clear to me why try_to_free_mem_cgroup_pages with a single
target should be slower than multiple calls of this function with
smaller batches when the real reclaim is still SWAP_CLUSTER_MAX batched.

There is also a theoretical risk of over reclaim. Especially with large
targets.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19 10:51       ` 答复: " Li,Rongqing
  2018-03-19 17:51         ` David Rientjes
@ 2018-03-23  2:58         ` Li,Rongqing
  2018-03-23 10:08           ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Li,Rongqing @ 2018-03-23  2:58 UTC (permalink / raw)
  To: Li,Rongqing, Michal Hocko
  Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin



> -----邮件原件-----
> 发件人: linux-kernel-owner@vger.kernel.org
> [mailto:linux-kernel-owner@vger.kernel.org] 代表 Li,Rongqing
> 发送时间: 2018年3月19日 18:52
> 收件人: Michal Hocko <mhocko@kernel.org>
> 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> <aryabinin@virtuozzo.com>
> 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory cgroup
> 
> 
> 
> > -----邮件原件-----
> > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > 发送时间: 2018年3月19日 18:38
> > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > <aryabinin@virtuozzo.com>
> > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> memory
> > cgroup
> >
> > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > > 发送时间: 2018年3月19日 16:54
> > > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > > <aryabinin@virtuozzo.com>
> > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory
> > > > cgroup
> > > >
> > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > mem_cgroup_force_empty() tries to free only 32
> > (SWAP_CLUSTER_MAX)
> > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > cache, it will take many iterations to empty all page cache, so
> > > > > increase the reclaimed number per iteration to speed it up. same
> > > > > as in
> > > > > mem_cgroup_resize_limit()
> > > > >
> > > > > a simple test show:
> > > > >
> > > > >   $dd if=aaa  of=bbb  bs=1k count=3886080
> > > > >   $rm -f bbb
> > > > >   $time echo
> > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > >
> > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > >
> > > > Andrey was proposing something similar [1]. My main objection was
> > > > that his approach might lead to over-reclaim. Your approach is
> > > > more conservative because it just increases the batch size. The
> > > > size is still rather arbitrary. Same as SWAP_CLUSTER_MAX but that
> > > > one is a commonly used unit of reclaim in the MM code.
> > > >
> > > > I would be really curious about more detailed explanation why
> > > > having a larger batch yields to a better performance because we
> > > > are doingg SWAP_CLUSTER_MAX batches at the lower reclaim level
> anyway.
> > > >
> > >
> > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > nr_to_reclaim can reduce times of calling
> > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > >
> > > mem_cgroup_resize_limit
> > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > --->SWAP_CLUSTER_MAX),
> > >    ---> do_try_to_free_pages
> > >      ---> shrink_zones
> > >       --->shrink_node
> > >        ---> shrink_node_memcg
> > >          ---> shrink_list          <-------loop will happen in this place
> > [times=1024/32]
> > >            ---> shrink_page_list
> >
> > Can you actually measure this to be the culprit. Because we should
> > rethink our call path if it is too complicated/deep to perform well.
> > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> 
> Ok, I will try
> 
http://pasted.co/4edbcfff

This is result from ftrace graph, it maybe prove that the deep call path leads to low performance.

And when increase reclaiming page in try_to_free_mem_cgroup_pages, it can reduce calling of shrink_slab, which save times, in my cases, page caches occupy most memory, slab is little, but shrink_slab will be called everytime

Mutex_lock 1 us

try_to_free_mem_cgroup_pages
  do_try_to_free_pages ! 185.020 us
    shrink_node  ! 116.529 us
      shrink_node_memcg   39.203
          shrink_inactive_list  33.960
      shrink_slab   72.955

    shrink_node  61.502 us
      shrink_node_memcg   3.955
      shrink_slab   54.296 us

-RongQing

> -RongQing
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-23  2:58         ` Li,Rongqing
@ 2018-03-23 10:08           ` Michal Hocko
  2018-03-23 12:04             ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-03-23 10:08 UTC (permalink / raw)
  To: Li,Rongqing; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Fri 23-03-18 02:58:36, Li,Rongqing wrote:
> 
> 
> > -----邮件原件-----
> > 发件人: linux-kernel-owner@vger.kernel.org
> > [mailto:linux-kernel-owner@vger.kernel.org] 代表 Li,Rongqing
> > 发送时间: 2018年3月19日 18:52
> > 收件人: Michal Hocko <mhocko@kernel.org>
> > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > <aryabinin@virtuozzo.com>
> > 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory cgroup
> > 
> > 
> > 
> > > -----邮件原件-----
> > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > 发送时间: 2018年3月19日 18:38
> > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > <aryabinin@virtuozzo.com>
> > > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a
> > memory
> > > cgroup
> > >
> > > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > > >
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > > > 发送时间: 2018年3月19日 16:54
> > > > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > > > <aryabinin@virtuozzo.com>
> > > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > > memory
> > > > > cgroup
> > > > >
> > > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > > mem_cgroup_force_empty() tries to free only 32
> > > (SWAP_CLUSTER_MAX)
> > > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > > cache, it will take many iterations to empty all page cache, so
> > > > > > increase the reclaimed number per iteration to speed it up. same
> > > > > > as in
> > > > > > mem_cgroup_resize_limit()
> > > > > >
> > > > > > a simple test show:
> > > > > >
> > > > > >   $dd if=aaa  of=bbb  bs=1k count=3886080
> > > > > >   $rm -f bbb
> > > > > >   $time echo
> > > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > > >
> > > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > > >
> > > > > Andrey was proposing something similar [1]. My main objection was
> > > > > that his approach might lead to over-reclaim. Your approach is
> > > > > more conservative because it just increases the batch size. The
> > > > > size is still rather arbitrary. Same as SWAP_CLUSTER_MAX but that
> > > > > one is a commonly used unit of reclaim in the MM code.
> > > > >
> > > > > I would be really curious about more detailed explanation why
> > > > > having a larger batch yields to a better performance because we
> > > > > are doingg SWAP_CLUSTER_MAX batches at the lower reclaim level
> > anyway.
> > > > >
> > > >
> > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the call
> > > > stack of try_to_free_mem_cgroup_pages is too long, increase the
> > > > nr_to_reclaim can reduce times of calling
> > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > >
> > > > mem_cgroup_resize_limit
> > > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > > --->SWAP_CLUSTER_MAX),
> > > >    ---> do_try_to_free_pages
> > > >      ---> shrink_zones
> > > >       --->shrink_node
> > > >        ---> shrink_node_memcg
> > > >          ---> shrink_list          <-------loop will happen in this place
> > > [times=1024/32]
> > > >            ---> shrink_page_list
> > >
> > > Can you actually measure this to be the culprit. Because we should
> > > rethink our call path if it is too complicated/deep to perform well.
> > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > 
> > Ok, I will try
> > 
> http://pasted.co/4edbcfff
> 
> This is result from ftrace graph, it maybe prove that the deep call
> path leads to low performance.

Does it? Let's have a look at the condensed output:
  6)               |    try_to_free_mem_cgroup_pages() {
  6)               |      mem_cgroup_select_victim_node() {
  6)   0.320 us    |        mem_cgroup_node_nr_lru_pages();
  6)   0.151 us    |        mem_cgroup_node_nr_lru_pages();
  6)   2.190 us    |      }
  6)               |      do_try_to_free_pages() {
  6)               |        shrink_node() {
  6)               |          shrink_node_memcg() {
  6)               |            shrink_inactive_list() {
  6) + 23.131 us   |              shrink_page_list();
  6) + 33.960 us   |            }
  6) + 39.203 us   |          }
  6)               |          shrink_slab() {
  6) + 72.955 us   |          }
  6) ! 116.529 us  |        }
  6)               |        shrink_node() {
  6)   0.050 us    |          mem_cgroup_iter();
  6)   0.035 us    |          mem_cgroup_low();
  6)               |          shrink_node_memcg() {
  6)   3.955 us    |          }
  6)               |          shrink_slab() {
  6) + 54.296 us   |          }
  6) + 61.502 us   |        }
  6) ! 185.020 us  |      }
  6) ! 188.165 us  |    }

try_to_free_mem_cgroup_pages is the full memcg reclaim path taking
188,165 us. The pure reclaim path is shrink_node and that took 116+61 = 177 us.
So we have 11us spent on the way. Is this really making such a difference?
How does the profile look when we do larger batches?

> And when increase reclaiming page in try_to_free_mem_cgroup_pages, it
> can reduce calling of shrink_slab, which save times, in my cases, page
> caches occupy most memory, slab is little, but shrink_slab will be
> called everytime

OK, that makes more sense! shrink_slab is clearly visible here. It is
more expensive than the page reclaim. This is something to look into.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-19  8:29 [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup Li RongQing
  2018-03-19  8:53 ` Michal Hocko
@ 2018-03-23 10:34 ` Michal Hocko
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2018-03-23 10:34 UTC (permalink / raw)
  To: Li RongQing; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Mon 19-03-18 16:29:30, Li RongQing wrote:
> mem_cgroup_force_empty() tries to free only 32 (SWAP_CLUSTER_MAX) pages
> on each iteration, if a memory cgroup has lots of page cache, it will
> take many iterations to empty all page cache, so increase the reclaimed
> number per iteration to speed it up. same as in mem_cgroup_resize_limit()
> 
> a simple test show:
> 
>   $dd if=aaa  of=bbb  bs=1k count=3886080
>   $rm -f bbb
>   $time echo 100000000 >/cgroup/memory/test/memory.limit_in_bytes
>
> Before: 0m0.252s ===> after: 0m0.178s

One more note. I have only now realized that increasing the patch size
might have another negative side effect. Memcg reclaim bails out early
when the required target has been reclaimed and so we might skip memcgs
in the hierarchy and could end up hamering one child in the hierarchy
much more than others. Our current code is not ideal and we workaround
this by a smaller target and caching the last reclaimed memcg so the
imbalance is not so visible at least.

This is not something that couldn't be fixed and maybe 1M chunk would be
acceptable as well. I dunno. Let's focus on the main bottleneck first
before we start doing these changes though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* 答复: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-23 10:08           ` Michal Hocko
@ 2018-03-23 12:04             ` Li,Rongqing
  2018-03-23 12:29               ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Li,Rongqing @ 2018-03-23 12:04 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin



> -----邮件原件-----
> 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> 发送时间: 2018年3月23日 18:09
> 收件人: Li,Rongqing <lirongqing@baidu.com>
> 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> <aryabinin@virtuozzo.com>
> 主题: Re: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty
> a memory cgroup
> 
> On Fri 23-03-18 02:58:36, Li,Rongqing wrote:
> >
> >
> > > -----邮件原件-----
> > > 发件人: linux-kernel-owner@vger.kernel.org
> > > [mailto:linux-kernel-owner@vger.kernel.org] 代表 Li,Rongqing
> > > 发送时间: 2018年3月19日 18:52
> > > 收件人: Michal Hocko <mhocko@kernel.org>
> > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > <aryabinin@virtuozzo.com>
> > > 主题: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force
> empty a
> > > memory cgroup
> > >
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > > 发送时间: 2018年3月19日 18:38
> > > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > > <aryabinin@virtuozzo.com>
> > > > 主题: Re: 答复: [PATCH] mm/memcontrol.c: speed up to force empty
> a
> > > memory
> > > > cgroup
> > > >
> > > > On Mon 19-03-18 10:00:41, Li,Rongqing wrote:
> > > > >
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Michal Hocko [mailto:mhocko@kernel.org]
> > > > > > 发送时间: 2018年3月19日 16:54
> > > > > > 收件人: Li,Rongqing <lirongqing@baidu.com>
> > > > > > 抄送: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > > cgroups@vger.kernel.org; hannes@cmpxchg.org; Andrey Ryabinin
> > > > > > <aryabinin@virtuozzo.com>
> > > > > > 主题: Re: [PATCH] mm/memcontrol.c: speed up to force empty a
> > > > memory
> > > > > > cgroup
> > > > > >
> > > > > > On Mon 19-03-18 16:29:30, Li RongQing wrote:
> > > > > > > mem_cgroup_force_empty() tries to free only 32
> > > > (SWAP_CLUSTER_MAX)
> > > > > > > pages on each iteration, if a memory cgroup has lots of page
> > > > > > > cache, it will take many iterations to empty all page cache,
> > > > > > > so increase the reclaimed number per iteration to speed it
> > > > > > > up. same as in
> > > > > > > mem_cgroup_resize_limit()
> > > > > > >
> > > > > > > a simple test show:
> > > > > > >
> > > > > > >   $dd if=aaa  of=bbb  bs=1k count=3886080
> > > > > > >   $rm -f bbb
> > > > > > >   $time echo
> > > > 100000000 >/cgroup/memory/test/memory.limit_in_bytes
> > > > > > >
> > > > > > > Before: 0m0.252s ===> after: 0m0.178s
> > > > > >
> > > > > > Andrey was proposing something similar [1]. My main objection
> > > > > > was that his approach might lead to over-reclaim. Your
> > > > > > approach is more conservative because it just increases the
> > > > > > batch size. The size is still rather arbitrary. Same as
> > > > > > SWAP_CLUSTER_MAX but that one is a commonly used unit of
> reclaim in the MM code.
> > > > > >
> > > > > > I would be really curious about more detailed explanation why
> > > > > > having a larger batch yields to a better performance because
> > > > > > we are doingg SWAP_CLUSTER_MAX batches at the lower reclaim
> > > > > > level
> > > anyway.
> > > > > >
> > > > >
> > > > > Although SWAP_CLUSTER_MAX is used at the lower level, but the
> > > > > call stack of try_to_free_mem_cgroup_pages is too long, increase
> > > > > the nr_to_reclaim can reduce times of calling
> > > > > function[do_try_to_free_pages, shrink_zones, hrink_node ]
> > > > >
> > > > > mem_cgroup_resize_limit
> > > > > --->try_to_free_mem_cgroup_pages:  .nr_to_reclaim = max(1024,
> > > > > --->SWAP_CLUSTER_MAX),
> > > > >    ---> do_try_to_free_pages
> > > > >      ---> shrink_zones
> > > > >       --->shrink_node
> > > > >        ---> shrink_node_memcg
> > > > >          ---> shrink_list          <-------loop will happen in this
> place
> > > > [times=1024/32]
> > > > >            ---> shrink_page_list
> > > >
> > > > Can you actually measure this to be the culprit. Because we should
> > > > rethink our call path if it is too complicated/deep to perform well.
> > > > Adding arbitrary batch sizes doesn't sound like a good way to go to me.
> > >
> > > Ok, I will try
> > >
> > http://pasted.co/4edbcfff
> >
> > This is result from ftrace graph, it maybe prove that the deep call
> > path leads to low performance.
> 
> Does it? Let's have a look at the condensed output:
>   6)               |    try_to_free_mem_cgroup_pages() {
>   6)               |      mem_cgroup_select_victim_node() {
>   6)   0.320 us    |        mem_cgroup_node_nr_lru_pages();
>   6)   0.151 us    |        mem_cgroup_node_nr_lru_pages();
>   6)   2.190 us    |      }
>   6)               |      do_try_to_free_pages() {
>   6)               |        shrink_node() {
>   6)               |          shrink_node_memcg() {
>   6)               |            shrink_inactive_list() {
>   6) + 23.131 us   |              shrink_page_list();
>   6) + 33.960 us   |            }
>   6) + 39.203 us   |          }
>   6)               |          shrink_slab() {
>   6) + 72.955 us   |          }
>   6) ! 116.529 us  |        }
>   6)               |        shrink_node() {
>   6)   0.050 us    |          mem_cgroup_iter();
>   6)   0.035 us    |          mem_cgroup_low();
>   6)               |          shrink_node_memcg() {
>   6)   3.955 us    |          }
>   6)               |          shrink_slab() {
>   6) + 54.296 us   |          }
>   6) + 61.502 us   |        }
>   6) ! 185.020 us  |      }
>   6) ! 188.165 us  |    }
> 
> try_to_free_mem_cgroup_pages is the full memcg reclaim path taking
> 188,165 us. The pure reclaim path is shrink_node and that took 116+61 =
> 177 us.
> So we have 11us spent on the way. Is this really making such a difference?
> How does the profile look when we do larger batches?
> 
> > And when increase reclaiming page in try_to_free_mem_cgroup_pages, it
> > can reduce calling of shrink_slab, which save times, in my cases, page
> > caches occupy most memory, slab is little, but shrink_slab will be
> > called everytime
> 
> OK, that makes more sense! shrink_slab is clearly visible here. It is more
> expensive than the page reclaim. This is something to look into.
> 

  shrink_slab() {
  6)   0.175 us    |            down_read_trylock();
  6)               |            super_cache_count() {
  6)   0.642 us    |              list_lru_count_one();
  6)   0.587 us    |              list_lru_count_one();
  6)   3.740 us    |            }
  6)               |            super_cache_count() {
  6)   0.625 us    |              list_lru_count_one();
  6)   0.485 us    |              list_lru_count_one();
  6)   2.145 us    |            }
  6)               |            super_cache_count() {
  6)   0.333 us    |              list_lru_count_one();
  6)   0.334 us    |              list_lru_count_one();
  6)   2.109 us    |            }
  6)               |            super_cache_count() {
  6)   0.062 us    |              list_lru_count_one();
  6)   0.188 us    |              list_lru_count_one();
  6)   1.216 us    |            }
  6)               |            super_cache_count() {
  6)   0.217 us    |              list_lru_count_one();
  6)   0.056 us    |              list_lru_count_one();
  6)   1.282 us    |            }
  6)               |            super_cache_count() {
  6)   0.204 us    |              list_lru_count_one();
  6)   0.205 us    |              list_lru_count_one();
  6)   1.237 us    |            }
  6)               |            super_cache_count() {
  6)   0.596 us    |              list_lru_count_one();
  6)   0.493 us    |              list_lru_count_one();
  6)   2.140 us    |            }
  6)               |            super_cache_count() {
  6)   0.130 us    |              list_lru_count_one();
  6)   0.056 us    |              list_lru_count_one();
  6)   1.260 us    |            }
  6)               |            super_cache_count() {
  6)   0.385 us    |              list_lru_count_one();
  6)   0.054 us    |              list_lru_count_one();
  6)   1.186 us    |            }
  6)               |            super_cache_count() {
  6)   0.304 us    |              list_lru_count_one();
  6)   0.286 us    |              list_lru_count_one();
  6)   1.550 us    |            }
  6)               |            super_cache_count() {
  6)   0.230 us    |              list_lru_count_one();
  6)   0.128 us    |              list_lru_count_one();
  6)   1.408 us    |            }
  6)               |            super_cache_count() {
  6)   0.392 us    |              list_lru_count_one();
  6)   0.132 us    |              list_lru_count_one();
  6)   1.694 us    |            }
  6)               |            super_cache_count() {
  6)   0.257 us    |              list_lru_count_one();
  6)   0.258 us    |              list_lru_count_one();
  6)   1.510 us    |            }
  6)               |            super_cache_count() {
  6)   0.132 us    |              list_lru_count_one();
  6)   0.132 us    |              list_lru_count_one();
  6)   1.361 us    |            }
  6)               |            super_cache_count() {
  6)   0.130 us    |              list_lru_count_one();
  6)   0.130 us    |              list_lru_count_one();
  6)   1.246 us    |            }
  6)               |            count_shadow_nodes() {
  6)   0.203 us    |              list_lru_count_one();
  6)   0.042 us    |              mem_cgroup_node_nr_lru_pages();
  6)   1.131 us    |            }
  6)               |            super_cache_count() {
  6)   0.202 us    |              list_lru_count_one();
  6)   0.056 us    |              list_lru_count_one();
  6)   1.115 us    |            }
  6)               |            super_cache_count() {
  6)   0.055 us    |              list_lru_count_one();
  6)   0.107 us    |              list_lru_count_one();
  6)   0.958 us    |            }
  6)               |            super_cache_count() {
  6)   0.147 us    |              list_lru_count_one();
  6)   0.150 us    |              list_lru_count_one();
  6)   1.474 us    |            }
  6)               |            super_cache_count() {
  6)   0.491 us    |              list_lru_count_one();
  6)   0.485 us    |              list_lru_count_one();
  6)   2.569 us    |            }
  6)               |            super_cache_count() {
  6)   0.605 us    |              list_lru_count_one();
  6)   0.590 us    |              list_lru_count_one();
  6)   2.136 us    |            }
  6)               |            super_cache_count() {
  6)   0.572 us    |              list_lru_count_one();
  6)   0.418 us    |              list_lru_count_one();
  6)   1.914 us    |            }
  6)               |            super_cache_count() {
  6)   0.428 us    |              list_lru_count_one();
  6)   0.358 us    |              list_lru_count_one();
  6)   2.073 us    |            } /* super_cache_count */
  6)               |            super_cache_count() {
  6)   0.422 us    |              list_lru_count_one();
  6)   0.433 us    |              list_lru_count_one();
  6)   1.604 us    |            }
  6)               |            super_cache_count() {
  6)   0.532 us    |              list_lru_count_one();
  6)   0.280 us    |              list_lru_count_one();
  6)   1.523 us    |            }
  6)               |            super_cache_count() {
  6)   0.422 us    |              list_lru_count_one();
  6)   0.574 us    |              list_lru_count_one();
  6)   1.554 us    |            }
  6)               |            super_cache_count() {
  6)   0.565 us    |              list_lru_count_one();
  6)   0.587 us    |              list_lru_count_one();
  6)   1.878 us    |            }
  6)               |            super_cache_count() {
  6)   0.563 us    |              list_lru_count_one();
  6)   0.558 us    |              list_lru_count_one();
  6)   1.949 us    |            }
  6)               |            super_cache_count() {
  6)   0.468 us    |              list_lru_count_one();
  6)   0.476 us    |              list_lru_count_one();
  6)   2.149 us    |            }
  6)               |            super_cache_count() {
  6)   0.443 us    |              list_lru_count_one();
  6)   0.483 us    |              list_lru_count_one();
  6)   2.283 us    |            }
  6)               |            super_cache_count() {
  6)   0.332 us    |              list_lru_count_one();
  6)   0.228 us    |              list_lru_count_one();
  6)   1.307 us    |            }
  6)               |            super_cache_count() {
  6)   0.532 us    |              list_lru_count_one();
  6)   0.367 us    |              list_lru_count_one();
  6)   1.956 us    |            }
  6)   0.036 us    |            up_read();
  6)   0.038 us    |            _cond_resched();
  6) + 72.955 us   |          }

shrink_slab does not reclaim any memory, but take lots of time to count lru

maybe we can use the returning of shrink_slab to control if next shrink_slab should be called?


Or define a slight list_lru_empty to check if sb->s_dentry_lru is empty before calling list_lru_shrink_count, like below

diff --git a/fs/super.c b/fs/super.c
index 672538ca9831..954c22338833 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -130,8 +130,10 @@ static unsigned long super_cache_count(struct shrinker *shrink,
        if (sb->s_op && sb->s_op->nr_cached_objects)
                total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
-       total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
-       total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
+       if (!list_lru_empty(sb->s_dentry_lru))
+               total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+       if (!list_lru_empty(sb->s_inode_lru))
+               total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
        total_objects = vfs_pressure_ratio(total_objects);
        return total_objects;

-RongQing


> Thanks!
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: 答复: 答复: 答复: [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup
  2018-03-23 12:04             ` 答复: " Li,Rongqing
@ 2018-03-23 12:29               ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2018-03-23 12:29 UTC (permalink / raw)
  To: Li,Rongqing; +Cc: linux-kernel, linux-mm, cgroups, hannes, Andrey Ryabinin

On Fri 23-03-18 12:04:16, Li,Rongqing wrote:
[...]
> shrink_slab does not reclaim any memory, but take lots of time to
> count lru
>
> maybe we can use the returning of shrink_slab to control if next
> shrink_slab should be called?

How? Different memcgs might have different amount of shrinkable memory.

> Or define a slight list_lru_empty to check if sb->s_dentry_lru is
> empty before calling list_lru_shrink_count, like below

Does it really help to improve numbers?
 
> diff --git a/fs/super.c b/fs/super.c
> index 672538ca9831..954c22338833 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -130,8 +130,10 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>         if (sb->s_op && sb->s_op->nr_cached_objects)
>                 total_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
> -       total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
> -       total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
> +       if (!list_lru_empty(sb->s_dentry_lru))
> +               total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
> +       if (!list_lru_empty(sb->s_inode_lru))
> +               total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
>  
>         total_objects = vfs_pressure_ratio(total_objects);
>         return total_objects;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-03-23 12:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19  8:29 [PATCH] mm/memcontrol.c: speed up to force empty a memory cgroup Li RongQing
2018-03-19  8:53 ` Michal Hocko
     [not found]   ` <2AD939572F25A448A3AE3CAEA61328C23745764B@BC-MAIL-M28.internal.baidu.com>
2018-03-19 10:37     ` 答复: " Michal Hocko
2018-03-19 10:51       ` 答复: " Li,Rongqing
2018-03-19 17:51         ` David Rientjes
2018-03-20  8:39           ` Michal Hocko
2018-03-20 20:29             ` David Rientjes
2018-03-20 22:08               ` Andrey Ryabinin
2018-03-20 22:15                 ` David Rientjes
2018-03-20 22:35                   ` Andrey Ryabinin
2018-03-20 22:45                     ` David Rientjes
2018-03-21  9:59                     ` Michal Hocko
2018-03-23  2:58         ` Li,Rongqing
2018-03-23 10:08           ` Michal Hocko
2018-03-23 12:04             ` 答复: " Li,Rongqing
2018-03-23 12:29               ` Michal Hocko
2018-03-23 10:34 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).