linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
@ 2019-01-22 20:09 Yang Shi
  2019-01-23  9:59 ` Michal Hocko
  2019-01-23 10:28 ` Kirill Tkhai
  0 siblings, 2 replies; 9+ messages in thread
From: Yang Shi @ 2019-01-22 20:09 UTC (permalink / raw)
  To: mhocko, hannes, akpm; +Cc: yang.shi, linux-mm, linux-kernel

In current implementation, both kswapd and direct reclaim has to iterate
all mem cgroups.  It is not a problem before offline mem cgroups could
be iterated.  But, currently with iterating offline mem cgroups, it
could be very time consuming.  In our workloads, we saw over 400K mem
cgroups accumulated in some cases, only a few hundred are online memcgs.
Although kswapd could help out to reduce the number of memcgs, direct
reclaim still get hit with iterating a number of offline memcgs in some
cases.  We experienced the responsiveness problems due to this
occassionally.

Here just break the iteration once it reclaims enough pages as what
memcg direct reclaim does.  This may hurt the fairness among memcgs
since direct reclaim may awlays do reclaim from same memcgs.  But, it
sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
pages and memcgs can be protected by min/low.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/vmscan.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a714c4f..ced5a16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2764,16 +2764,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				   sc->nr_reclaimed - reclaimed);
 
 			/*
-			 * Direct reclaim and kswapd have to scan all memory
-			 * cgroups to fulfill the overall scan target for the
-			 * node.
+			 * Kswapd have to scan all memory cgroups to fulfill
+			 * the overall scan target for the node.
 			 *
 			 * Limit reclaim, on the other hand, only cares about
 			 * nr_to_reclaim pages to be reclaimed and it will
 			 * retry with decreasing priority if one round over the
 			 * whole hierarchy is not sufficient.
 			 */
-			if (!global_reclaim(sc) &&
+			if ((!global_reclaim(sc) || !current_is_kswapd()) &&
 					sc->nr_reclaimed >= sc->nr_to_reclaim) {
 				mem_cgroup_iter_break(root, memcg);
 				break;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-22 20:09 [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim Yang Shi
@ 2019-01-23  9:59 ` Michal Hocko
  2019-01-23 20:24   ` Yang Shi
  2019-01-23 10:28 ` Kirill Tkhai
  1 sibling, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-23  9:59 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Wed 23-01-19 04:09:42, Yang Shi wrote:
> In current implementation, both kswapd and direct reclaim has to iterate
> all mem cgroups.  It is not a problem before offline mem cgroups could
> be iterated.  But, currently with iterating offline mem cgroups, it
> could be very time consuming.  In our workloads, we saw over 400K mem
> cgroups accumulated in some cases, only a few hundred are online memcgs.
> Although kswapd could help out to reduce the number of memcgs, direct
> reclaim still get hit with iterating a number of offline memcgs in some
> cases.  We experienced the responsiveness problems due to this
> occassionally.

Can you provide some numbers?

> Here just break the iteration once it reclaims enough pages as what
> memcg direct reclaim does.  This may hurt the fairness among memcgs
> since direct reclaim may awlays do reclaim from same memcgs.  But, it
> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
> pages and memcgs can be protected by min/low.

OK, this makes some sense to me. The purpose of the direct reclaim is
to reclaim some memory and throttle the allocation pace. The iterator is
cached so the next reclaimer on the same hierarchy will simply continue
so the fairness should be more or less achieved.

Btw. is there any reason to keep !global_reclaim() check in place? Why
is it not sufficient to exclude kswapd?

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/vmscan.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a714c4f..ced5a16 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2764,16 +2764,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				   sc->nr_reclaimed - reclaimed);
>  
>  			/*
> -			 * Direct reclaim and kswapd have to scan all memory
> -			 * cgroups to fulfill the overall scan target for the
> -			 * node.
> +			 * Kswapd have to scan all memory cgroups to fulfill
> +			 * the overall scan target for the node.
>  			 *
>  			 * Limit reclaim, on the other hand, only cares about
>  			 * nr_to_reclaim pages to be reclaimed and it will
>  			 * retry with decreasing priority if one round over the
>  			 * whole hierarchy is not sufficient.
>  			 */
> -			if (!global_reclaim(sc) &&
> +			if ((!global_reclaim(sc) || !current_is_kswapd()) &&
>  					sc->nr_reclaimed >= sc->nr_to_reclaim) {
>  				mem_cgroup_iter_break(root, memcg);
>  				break;
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-22 20:09 [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim Yang Shi
  2019-01-23  9:59 ` Michal Hocko
@ 2019-01-23 10:28 ` Kirill Tkhai
  2019-01-23 11:02   ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Kirill Tkhai @ 2019-01-23 10:28 UTC (permalink / raw)
  To: Yang Shi, mhocko, hannes, akpm; +Cc: linux-mm, linux-kernel

On 22.01.2019 23:09, Yang Shi wrote:
> In current implementation, both kswapd and direct reclaim has to iterate
> all mem cgroups.  It is not a problem before offline mem cgroups could
> be iterated.  But, currently with iterating offline mem cgroups, it
> could be very time consuming.  In our workloads, we saw over 400K mem
> cgroups accumulated in some cases, only a few hundred are online memcgs.
> Although kswapd could help out to reduce the number of memcgs, direct
> reclaim still get hit with iterating a number of offline memcgs in some
> cases.  We experienced the responsiveness problems due to this
> occassionally.
> 
> Here just break the iteration once it reclaims enough pages as what
> memcg direct reclaim does.  This may hurt the fairness among memcgs
> since direct reclaim may awlays do reclaim from same memcgs.  But, it
> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
> pages and memcgs can be protected by min/low.

In case of we stop after SWAP_CLUSTER_MAX pages are reclaimed; it's possible
the following situation. Memcgs, which are closest to root_mem_cgroup, will
become empty, and you will have to iterate over empty memcg hierarchy long time,
just to find a not empty memcg.

I'd suggest, we should not lose fairness. We may introduce
mem_cgroup::last_reclaim_child parameter to save a child
(or its id), where the last reclaim was interrupted. Then
next reclaim should start from this child:

   memcg = mem_cgroup_iter(root, find_child(root->last_reclaim_child), &reclaim);
   do {  

      if ((!global_reclaim(sc) || !current_is_kswapd()) && 
           sc->nr_reclaimed >= sc->nr_to_reclaim) { 
               root->last_reclaim_child = memcg->id;
               mem_cgroup_iter_break(root, memcg);
               break; 
      }

Kirill
 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/vmscan.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a714c4f..ced5a16 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2764,16 +2764,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				   sc->nr_reclaimed - reclaimed);
>  
>  			/*
> -			 * Direct reclaim and kswapd have to scan all memory
> -			 * cgroups to fulfill the overall scan target for the
> -			 * node.
> +			 * Kswapd have to scan all memory cgroups to fulfill
> +			 * the overall scan target for the node.
>  			 *
>  			 * Limit reclaim, on the other hand, only cares about
>  			 * nr_to_reclaim pages to be reclaimed and it will
>  			 * retry with decreasing priority if one round over the
>  			 * whole hierarchy is not sufficient.
>  			 */
> -			if (!global_reclaim(sc) &&
> +			if ((!global_reclaim(sc) || !current_is_kswapd()) &&
>  					sc->nr_reclaimed >= sc->nr_to_reclaim) {
>  				mem_cgroup_iter_break(root, memcg);
>  				break;
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-23 10:28 ` Kirill Tkhai
@ 2019-01-23 11:02   ` Michal Hocko
  2019-01-23 11:05     ` Kirill Tkhai
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-23 11:02 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Yang Shi, hannes, akpm, linux-mm, linux-kernel

On Wed 23-01-19 13:28:03, Kirill Tkhai wrote:
> On 22.01.2019 23:09, Yang Shi wrote:
> > In current implementation, both kswapd and direct reclaim has to iterate
> > all mem cgroups.  It is not a problem before offline mem cgroups could
> > be iterated.  But, currently with iterating offline mem cgroups, it
> > could be very time consuming.  In our workloads, we saw over 400K mem
> > cgroups accumulated in some cases, only a few hundred are online memcgs.
> > Although kswapd could help out to reduce the number of memcgs, direct
> > reclaim still get hit with iterating a number of offline memcgs in some
> > cases.  We experienced the responsiveness problems due to this
> > occassionally.
> > 
> > Here just break the iteration once it reclaims enough pages as what
> > memcg direct reclaim does.  This may hurt the fairness among memcgs
> > since direct reclaim may awlays do reclaim from same memcgs.  But, it
> > sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
> > pages and memcgs can be protected by min/low.
> 
> In case of we stop after SWAP_CLUSTER_MAX pages are reclaimed; it's possible
> the following situation. Memcgs, which are closest to root_mem_cgroup, will
> become empty, and you will have to iterate over empty memcg hierarchy long time,
> just to find a not empty memcg.
> 
> I'd suggest, we should not lose fairness. We may introduce
> mem_cgroup::last_reclaim_child parameter to save a child
> (or its id), where the last reclaim was interrupted. Then
> next reclaim should start from this child:

Why is not our reclaim_cookie based caching sufficient?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-23 11:02   ` Michal Hocko
@ 2019-01-23 11:05     ` Kirill Tkhai
  2019-01-23 12:10       ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Kirill Tkhai @ 2019-01-23 11:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Yang Shi, hannes, akpm, linux-mm, linux-kernel

On 23.01.2019 14:02, Michal Hocko wrote:
> On Wed 23-01-19 13:28:03, Kirill Tkhai wrote:
>> On 22.01.2019 23:09, Yang Shi wrote:
>>> In current implementation, both kswapd and direct reclaim has to iterate
>>> all mem cgroups.  It is not a problem before offline mem cgroups could
>>> be iterated.  But, currently with iterating offline mem cgroups, it
>>> could be very time consuming.  In our workloads, we saw over 400K mem
>>> cgroups accumulated in some cases, only a few hundred are online memcgs.
>>> Although kswapd could help out to reduce the number of memcgs, direct
>>> reclaim still get hit with iterating a number of offline memcgs in some
>>> cases.  We experienced the responsiveness problems due to this
>>> occassionally.
>>>
>>> Here just break the iteration once it reclaims enough pages as what
>>> memcg direct reclaim does.  This may hurt the fairness among memcgs
>>> since direct reclaim may awlays do reclaim from same memcgs.  But, it
>>> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
>>> pages and memcgs can be protected by min/low.
>>
>> In case of we stop after SWAP_CLUSTER_MAX pages are reclaimed; it's possible
>> the following situation. Memcgs, which are closest to root_mem_cgroup, will
>> become empty, and you will have to iterate over empty memcg hierarchy long time,
>> just to find a not empty memcg.
>>
>> I'd suggest, we should not lose fairness. We may introduce
>> mem_cgroup::last_reclaim_child parameter to save a child
>> (or its id), where the last reclaim was interrupted. Then
>> next reclaim should start from this child:
> 
> Why is not our reclaim_cookie based caching sufficient?

Hm, maybe I missed them. Do cookies already implement this functionality?

Kirill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-23 11:05     ` Kirill Tkhai
@ 2019-01-23 12:10       ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2019-01-23 12:10 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Yang Shi, hannes, akpm, linux-mm, linux-kernel

On Wed 23-01-19 14:05:28, Kirill Tkhai wrote:
> On 23.01.2019 14:02, Michal Hocko wrote:
> > On Wed 23-01-19 13:28:03, Kirill Tkhai wrote:
> >> On 22.01.2019 23:09, Yang Shi wrote:
> >>> In current implementation, both kswapd and direct reclaim has to iterate
> >>> all mem cgroups.  It is not a problem before offline mem cgroups could
> >>> be iterated.  But, currently with iterating offline mem cgroups, it
> >>> could be very time consuming.  In our workloads, we saw over 400K mem
> >>> cgroups accumulated in some cases, only a few hundred are online memcgs.
> >>> Although kswapd could help out to reduce the number of memcgs, direct
> >>> reclaim still get hit with iterating a number of offline memcgs in some
> >>> cases.  We experienced the responsiveness problems due to this
> >>> occassionally.
> >>>
> >>> Here just break the iteration once it reclaims enough pages as what
> >>> memcg direct reclaim does.  This may hurt the fairness among memcgs
> >>> since direct reclaim may awlays do reclaim from same memcgs.  But, it
> >>> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
> >>> pages and memcgs can be protected by min/low.
> >>
> >> In case of we stop after SWAP_CLUSTER_MAX pages are reclaimed; it's possible
> >> the following situation. Memcgs, which are closest to root_mem_cgroup, will
> >> become empty, and you will have to iterate over empty memcg hierarchy long time,
> >> just to find a not empty memcg.
> >>
> >> I'd suggest, we should not lose fairness. We may introduce
> >> mem_cgroup::last_reclaim_child parameter to save a child
> >> (or its id), where the last reclaim was interrupted. Then
> >> next reclaim should start from this child:
> > 
> > Why is not our reclaim_cookie based caching sufficient?
> 
> Hm, maybe I missed them. Do cookies already implement this functionality?

Have a look at mem_cgroup_iter

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-23  9:59 ` Michal Hocko
@ 2019-01-23 20:24   ` Yang Shi
  2019-01-24  8:43     ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Yang Shi @ 2019-01-23 20:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/23/19 1:59 AM, Michal Hocko wrote:
> On Wed 23-01-19 04:09:42, Yang Shi wrote:
>> In current implementation, both kswapd and direct reclaim has to iterate
>> all mem cgroups.  It is not a problem before offline mem cgroups could
>> be iterated.  But, currently with iterating offline mem cgroups, it
>> could be very time consuming.  In our workloads, we saw over 400K mem
>> cgroups accumulated in some cases, only a few hundred are online memcgs.
>> Although kswapd could help out to reduce the number of memcgs, direct
>> reclaim still get hit with iterating a number of offline memcgs in some
>> cases.  We experienced the responsiveness problems due to this
>> occassionally.
> Can you provide some numbers?

What numbers do you mean? How long did it take to iterate all the 
memcgs? For now I don't have the exact number for the production 
environment, but the unresponsiveness is visible.

I had some test number with triggering direct reclaim with 8k memcgs 
artificially, which has just one clean page charged for each memcg, so 
the reclaim is cheaper than real production environment.

perf shows it took around 220ms to iterate 8k memcgs:

               dd 13873 [011]   578.542919: 
vmscan:mm_vmscan_direct_reclaim_begin
               dd 13873 [011]   578.758689: 
vmscan:mm_vmscan_direct_reclaim_end

So, iterating 400K would take at least 11s in this artificial case. The 
production environment is much more complicated, so it would take much 
longer in fact.

>
>> Here just break the iteration once it reclaims enough pages as what
>> memcg direct reclaim does.  This may hurt the fairness among memcgs
>> since direct reclaim may awlays do reclaim from same memcgs.  But, it
>> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
>> pages and memcgs can be protected by min/low.
> OK, this makes some sense to me. The purpose of the direct reclaim is
> to reclaim some memory and throttle the allocation pace. The iterator is
> cached so the next reclaimer on the same hierarchy will simply continue
> so the fairness should be more or less achieved.

Yes, you are right. I missed this point.

>
> Btw. is there any reason to keep !global_reclaim() check in place? Why
> is it not sufficient to exclude kswapd?

Iterating all memcgs in kswapd is still useful to help to reduce those 
zombie memcgs.

Thanks,
Yang

>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a714c4f..ced5a16 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2764,16 +2764,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>>   				   sc->nr_reclaimed - reclaimed);
>>   
>>   			/*
>> -			 * Direct reclaim and kswapd have to scan all memory
>> -			 * cgroups to fulfill the overall scan target for the
>> -			 * node.
>> +			 * Kswapd have to scan all memory cgroups to fulfill
>> +			 * the overall scan target for the node.
>>   			 *
>>   			 * Limit reclaim, on the other hand, only cares about
>>   			 * nr_to_reclaim pages to be reclaimed and it will
>>   			 * retry with decreasing priority if one round over the
>>   			 * whole hierarchy is not sufficient.
>>   			 */
>> -			if (!global_reclaim(sc) &&
>> +			if ((!global_reclaim(sc) || !current_is_kswapd()) &&
>>   					sc->nr_reclaimed >= sc->nr_to_reclaim) {
>>   				mem_cgroup_iter_break(root, memcg);
>>   				break;
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-23 20:24   ` Yang Shi
@ 2019-01-24  8:43     ` Michal Hocko
  2019-01-26  1:42       ` Yang Shi
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-24  8:43 UTC (permalink / raw)
  To: Yang Shi; +Cc: hannes, akpm, linux-mm, linux-kernel

On Wed 23-01-19 12:24:38, Yang Shi wrote:
> 
> 
> On 1/23/19 1:59 AM, Michal Hocko wrote:
> > On Wed 23-01-19 04:09:42, Yang Shi wrote:
> > > In current implementation, both kswapd and direct reclaim has to iterate
> > > all mem cgroups.  It is not a problem before offline mem cgroups could
> > > be iterated.  But, currently with iterating offline mem cgroups, it
> > > could be very time consuming.  In our workloads, we saw over 400K mem
> > > cgroups accumulated in some cases, only a few hundred are online memcgs.
> > > Although kswapd could help out to reduce the number of memcgs, direct
> > > reclaim still get hit with iterating a number of offline memcgs in some
> > > cases.  We experienced the responsiveness problems due to this
> > > occassionally.
> > Can you provide some numbers?
> 
> What numbers do you mean? How long did it take to iterate all the memcgs?
> For now I don't have the exact number for the production environment, but
> the unresponsiveness is visible.

Yeah, I would be interested in the worst case direct reclaim latencies.
You can get that from our vmscan tracepoints quite easily.

> I had some test number with triggering direct reclaim with 8k memcgs
> artificially, which has just one clean page charged for each memcg, so the
> reclaim is cheaper than real production environment.
> 
> perf shows it took around 220ms to iterate 8k memcgs:
> 
>               dd 13873 [011]   578.542919:
> vmscan:mm_vmscan_direct_reclaim_begin
>               dd 13873 [011]   578.758689:
> vmscan:mm_vmscan_direct_reclaim_end
> 
> So, iterating 400K would take at least 11s in this artificial case. The
> production environment is much more complicated, so it would take much
> longer in fact.

Having real world numbers would definitely help with the justification.

> > > Here just break the iteration once it reclaims enough pages as what
> > > memcg direct reclaim does.  This may hurt the fairness among memcgs
> > > since direct reclaim may awlays do reclaim from same memcgs.  But, it
> > > sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
> > > pages and memcgs can be protected by min/low.
> > OK, this makes some sense to me. The purpose of the direct reclaim is
> > to reclaim some memory and throttle the allocation pace. The iterator is
> > cached so the next reclaimer on the same hierarchy will simply continue
> > so the fairness should be more or less achieved.
> 
> Yes, you are right. I missed this point.
> 
> > 
> > Btw. is there any reason to keep !global_reclaim() check in place? Why
> > is it not sufficient to exclude kswapd?
> 
> Iterating all memcgs in kswapd is still useful to help to reduce those
> zombie memcgs.

Yes, but for that you do not need to check for global_reclaim right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim
  2019-01-24  8:43     ` Michal Hocko
@ 2019-01-26  1:42       ` Yang Shi
  0 siblings, 0 replies; 9+ messages in thread
From: Yang Shi @ 2019-01-26  1:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, linux-mm, linux-kernel



On 1/24/19 12:43 AM, Michal Hocko wrote:
> On Wed 23-01-19 12:24:38, Yang Shi wrote:
>>
>> On 1/23/19 1:59 AM, Michal Hocko wrote:
>>> On Wed 23-01-19 04:09:42, Yang Shi wrote:
>>>> In current implementation, both kswapd and direct reclaim has to iterate
>>>> all mem cgroups.  It is not a problem before offline mem cgroups could
>>>> be iterated.  But, currently with iterating offline mem cgroups, it
>>>> could be very time consuming.  In our workloads, we saw over 400K mem
>>>> cgroups accumulated in some cases, only a few hundred are online memcgs.
>>>> Although kswapd could help out to reduce the number of memcgs, direct
>>>> reclaim still get hit with iterating a number of offline memcgs in some
>>>> cases.  We experienced the responsiveness problems due to this
>>>> occassionally.
>>> Can you provide some numbers?
>> What numbers do you mean? How long did it take to iterate all the memcgs?
>> For now I don't have the exact number for the production environment, but
>> the unresponsiveness is visible.
> Yeah, I would be interested in the worst case direct reclaim latencies.
> You can get that from our vmscan tracepoints quite easily.

I wish I could. But I just can't predict when the problem will happen on 
what machine, and I can't simply run perf on all machines of production 
environment.

I tried to dig into our cluster monitor data history which records some 
system behaviors. By looking into the data, it seems excessive direct 
reclaim latency may reach tens of seconds due to excessive memcgs in 
some cases (the discrepancy depends on the number of memcgs and workload 
too).

And the excessive direct reclaim latency problem has been reduced 
significantly since the patch was deployed.

>
>> I had some test number with triggering direct reclaim with 8k memcgs
>> artificially, which has just one clean page charged for each memcg, so the
>> reclaim is cheaper than real production environment.
>>
>> perf shows it took around 220ms to iterate 8k memcgs:
>>
>>                dd 13873 [011]   578.542919:
>> vmscan:mm_vmscan_direct_reclaim_begin
>>                dd 13873 [011]   578.758689:
>> vmscan:mm_vmscan_direct_reclaim_end
>>
>> So, iterating 400K would take at least 11s in this artificial case. The
>> production environment is much more complicated, so it would take much
>> longer in fact.
> Having real world numbers would definitely help with the justification.
>
>>>> Here just break the iteration once it reclaims enough pages as what
>>>> memcg direct reclaim does.  This may hurt the fairness among memcgs
>>>> since direct reclaim may awlays do reclaim from same memcgs.  But, it
>>>> sounds ok since direct reclaim just tries to reclaim SWAP_CLUSTER_MAX
>>>> pages and memcgs can be protected by min/low.
>>> OK, this makes some sense to me. The purpose of the direct reclaim is
>>> to reclaim some memory and throttle the allocation pace. The iterator is
>>> cached so the next reclaimer on the same hierarchy will simply continue
>>> so the fairness should be more or less achieved.
>> Yes, you are right. I missed this point.
>>
>>> Btw. is there any reason to keep !global_reclaim() check in place? Why
>>> is it not sufficient to exclude kswapd?
>> Iterating all memcgs in kswapd is still useful to help to reduce those
>> zombie memcgs.
> Yes, but for that you do not need to check for global_reclaim right?

Aha, yes. You are right. !current_is_kswapd() is good enough. Will fix 
this in v2.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-26  1:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22 20:09 [RFC PATCH] mm: vmscan: do not iterate all mem cgroups for global direct reclaim Yang Shi
2019-01-23  9:59 ` Michal Hocko
2019-01-23 20:24   ` Yang Shi
2019-01-24  8:43     ` Michal Hocko
2019-01-26  1:42       ` Yang Shi
2019-01-23 10:28 ` Kirill Tkhai
2019-01-23 11:02   ` Michal Hocko
2019-01-23 11:05     ` Kirill Tkhai
2019-01-23 12:10       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).