linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
@ 2019-08-12 19:23 Johannes Weiner
  2019-08-12 21:07 ` Roman Gushchin
  2019-08-13 13:29 ` Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Johannes Weiner @ 2019-08-12 19:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

One of our services observed a high rate of cgroup OOM kills in the
presence of large amounts of clean cache. Debugging showed that the
culprit is the shared cgroup iteration in page reclaim.

Under high allocation concurrency, multiple threads enter reclaim at
the same time. Fearing overreclaim when we first switched from the
single global LRU to cgrouped LRU lists, we introduced a shared
iteration state for reclaim invocations - whether 1 or 20 reclaimers
are active concurrently, we only walk the cgroup tree once: the 1st
reclaimer reclaims the first cgroup, the second the second one etc.
With more reclaimers than cgroups, we start another walk from the top.

This sounded reasonable at the time, but the problem is that reclaim
concurrency doesn't scale with allocation concurrency. As reclaim
concurrency increases, the amount of memory individual reclaimers get
to scan gets smaller and smaller. Individual reclaimers may only see
one cgroup per cycle, and that may not have much reclaimable
memory. We see individual reclaimers declare OOM when there is plenty
of reclaimable memory available in cgroups they didn't visit.

This patch does away with the shared iterator, and every reclaimer is
allowed to scan the full cgroup tree and see all of reclaimable
memory, just like it would on a non-cgrouped system. This way, when
OOM is declared, we know that the reclaimer actually had a chance.

To still maintain fairness in reclaim pressure, disallow cgroup
reclaim from bailing out of the tree walk early. Kswapd and regular
direct reclaim already don't bail, so it's not clear why limit reclaim
would have to, especially since it only walks subtrees to begin with.

This change completely eliminates the OOM kills on our service, while
showing no signs of overreclaim - no increased scan rates, %sys time,
or abrupt free memory spikes. I tested across 100 machines that have
64G of RAM and host about 300 cgroups each.

[ It's possible overreclaim never was a *practical* issue to begin
  with - it was simply a concern we had on the mailing lists at the
  time, with no real data to back it up. But we have also added more
  bail-out conditions deeper inside reclaim (e.g. the proportional
  exit in shrink_node_memcg) since. Regardless, now we have data that
  suggests full walks are more reliable and scale just fine. ]

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbdc46a84f63..b2f10fa49c88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
-		struct mem_cgroup_reclaim_cookie reclaim = {
-			.pgdat = pgdat,
-			.priority = sc->priority,
-		};
 		unsigned long node_lru_pages = 0;
 		struct mem_cgroup *memcg;
 
@@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		nr_reclaimed = sc->nr_reclaimed;
 		nr_scanned = sc->nr_scanned;
 
-		memcg = mem_cgroup_iter(root, NULL, &reclaim);
+		memcg = mem_cgroup_iter(root, NULL, NULL);
 		do {
 			unsigned long lru_pages;
 			unsigned long reclaimed;
@@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				   sc->nr_scanned - scanned,
 				   sc->nr_reclaimed - reclaimed);
 
-			/*
-			 * Kswapd have to scan all memory cgroups to fulfill
-			 * the overall scan target for the node.
-			 *
-			 * Limit reclaim, on the other hand, only cares about
-			 * nr_to_reclaim pages to be reclaimed and it will
-			 * retry with decreasing priority if one round over the
-			 * whole hierarchy is not sufficient.
-			 */
-			if (!current_is_kswapd() &&
-					sc->nr_reclaimed >= sc->nr_to_reclaim) {
-				mem_cgroup_iter_break(root, memcg);
-				break;
-			}
-		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
+		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-12 19:23 [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Johannes Weiner
@ 2019-08-12 21:07 ` Roman Gushchin
  2019-08-12 23:00   ` Johannes Weiner
  2019-08-13 13:29 ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Roman Gushchin @ 2019-08-12 21:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Mon, Aug 12, 2019 at 03:23:16PM -0400, Johannes Weiner wrote:
> One of our services observed a high rate of cgroup OOM kills in the
> presence of large amounts of clean cache. Debugging showed that the
> culprit is the shared cgroup iteration in page reclaim.
> 
> Under high allocation concurrency, multiple threads enter reclaim at
> the same time. Fearing overreclaim when we first switched from the
> single global LRU to cgrouped LRU lists, we introduced a shared
> iteration state for reclaim invocations - whether 1 or 20 reclaimers
> are active concurrently, we only walk the cgroup tree once: the 1st
> reclaimer reclaims the first cgroup, the second the second one etc.
> With more reclaimers than cgroups, we start another walk from the top.
> 
> This sounded reasonable at the time, but the problem is that reclaim
> concurrency doesn't scale with allocation concurrency. As reclaim
> concurrency increases, the amount of memory individual reclaimers get
> to scan gets smaller and smaller. Individual reclaimers may only see
> one cgroup per cycle, and that may not have much reclaimable
> memory. We see individual reclaimers declare OOM when there is plenty
> of reclaimable memory available in cgroups they didn't visit.

Nice catch!


> 
> This patch does away with the shared iterator, and every reclaimer is
> allowed to scan the full cgroup tree and see all of reclaimable
> memory, just like it would on a non-cgrouped system. This way, when
> OOM is declared, we know that the reclaimer actually had a chance.
> 
> To still maintain fairness in reclaim pressure, disallow cgroup
> reclaim from bailing out of the tree walk early. Kswapd and regular
> direct reclaim already don't bail, so it's not clear why limit reclaim
> would have to, especially since it only walks subtrees to begin with.
> 
> This change completely eliminates the OOM kills on our service, while
> showing no signs of overreclaim - no increased scan rates, %sys time,
> or abrupt free memory spikes. I tested across 100 machines that have
> 64G of RAM and host about 300 cgroups each.
> 
> [ It's possible overreclaim never was a *practical* issue to begin
>   with - it was simply a concern we had on the mailing lists at the
>   time, with no real data to back it up. But we have also added more
>   bail-out conditions deeper inside reclaim (e.g. the proportional
>   exit in shrink_node_memcg) since. Regardless, now we have data that
>   suggests full walks are more reliable and scale just fine. ]
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 22 ++--------------------
>  1 file changed, 2 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dbdc46a84f63..b2f10fa49c88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> -		struct mem_cgroup_reclaim_cookie reclaim = {
> -			.pgdat = pgdat,
> -			.priority = sc->priority,
> -		};
>  		unsigned long node_lru_pages = 0;
>  		struct mem_cgroup *memcg;
>  
> @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		nr_reclaimed = sc->nr_reclaimed;
>  		nr_scanned = sc->nr_scanned;
>  
> -		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> +		memcg = mem_cgroup_iter(root, NULL, NULL);

I wonder if we can remove the shared memcg tree walking at all? It seems that
the only use case left is the soft limit, and the same logic can be applied
to it. The we potentially can remove a lot of code in mem_cgroup_iter().
Just an idea...

>  		do {
>  			unsigned long lru_pages;
>  			unsigned long reclaimed;
> @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				   sc->nr_scanned - scanned,
>  				   sc->nr_reclaimed - reclaimed);
>  
> -			/*
> -			 * Kswapd have to scan all memory cgroups to fulfill
> -			 * the overall scan target for the node.
> -			 *
> -			 * Limit reclaim, on the other hand, only cares about
> -			 * nr_to_reclaim pages to be reclaimed and it will
> -			 * retry with decreasing priority if one round over the
> -			 * whole hierarchy is not sufficient.
> -			 */
> -			if (!current_is_kswapd() &&
> -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -				mem_cgroup_iter_break(root, memcg);
> -				break;
> -			}
> -		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> +		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -- 
> 2.22.0
>

Otherwise looks good to me!

Reviewed-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-12 21:07 ` Roman Gushchin
@ 2019-08-12 23:00   ` Johannes Weiner
  0 siblings, 0 replies; 8+ messages in thread
From: Johannes Weiner @ 2019-08-12 23:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Mon, Aug 12, 2019 at 09:07:27PM +0000, Roman Gushchin wrote:
> On Mon, Aug 12, 2019 at 03:23:16PM -0400, Johannes Weiner wrote:
> > @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  		nr_reclaimed = sc->nr_reclaimed;
> >  		nr_scanned = sc->nr_scanned;
> >  
> > -		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> > +		memcg = mem_cgroup_iter(root, NULL, NULL);
> 
> I wonder if we can remove the shared memcg tree walking at all? It seems that
> the only use case left is the soft limit, and the same logic can be applied
> to it. The we potentially can remove a lot of code in mem_cgroup_iter().
> Just an idea...

It's so tempting! But soft limit reclaim starts at priority 0 right
out of the gate, so overreclaim is an actual concern there. We could
try to rework it, but it'll be hard to avoid regressions given how
awkward the semantics and behavior around the soft limit already are.

> >  		do {
> >  			unsigned long lru_pages;
> >  			unsigned long reclaimed;
> > @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  				   sc->nr_scanned - scanned,
> >  				   sc->nr_reclaimed - reclaimed);
> >  
> > -			/*
> > -			 * Kswapd have to scan all memory cgroups to fulfill
> > -			 * the overall scan target for the node.
> > -			 *
> > -			 * Limit reclaim, on the other hand, only cares about
> > -			 * nr_to_reclaim pages to be reclaimed and it will
> > -			 * retry with decreasing priority if one round over the
> > -			 * whole hierarchy is not sufficient.
> > -			 */
> > -			if (!current_is_kswapd() &&
> > -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> > -				mem_cgroup_iter_break(root, memcg);
> > -				break;
> > -			}
> > -		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> > +		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> >  
> >  		if (reclaim_state) {
> >  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > -- 
> > 2.22.0
> >
> 
> Otherwise looks good to me!
> 
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-12 19:23 [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Johannes Weiner
  2019-08-12 21:07 ` Roman Gushchin
@ 2019-08-13 13:29 ` Michal Hocko
  2019-08-13 15:59   ` Yang Shi
  2019-08-13 17:12   ` Johannes Weiner
  1 sibling, 2 replies; 8+ messages in thread
From: Michal Hocko @ 2019-08-13 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Mon 12-08-19 15:23:16, Johannes Weiner wrote:
> One of our services observed a high rate of cgroup OOM kills in the
> presence of large amounts of clean cache. Debugging showed that the
> culprit is the shared cgroup iteration in page reclaim.
> 
> Under high allocation concurrency, multiple threads enter reclaim at
> the same time. Fearing overreclaim when we first switched from the
> single global LRU to cgrouped LRU lists, we introduced a shared
> iteration state for reclaim invocations - whether 1 or 20 reclaimers
> are active concurrently, we only walk the cgroup tree once: the 1st
> reclaimer reclaims the first cgroup, the second the second one etc.
> With more reclaimers than cgroups, we start another walk from the top.
> 
> This sounded reasonable at the time, but the problem is that reclaim
> concurrency doesn't scale with allocation concurrency. As reclaim
> concurrency increases, the amount of memory individual reclaimers get
> to scan gets smaller and smaller. Individual reclaimers may only see
> one cgroup per cycle, and that may not have much reclaimable
> memory. We see individual reclaimers declare OOM when there is plenty
> of reclaimable memory available in cgroups they didn't visit.
> 
> This patch does away with the shared iterator, and every reclaimer is
> allowed to scan the full cgroup tree and see all of reclaimable
> memory, just like it would on a non-cgrouped system. This way, when
> OOM is declared, we know that the reclaimer actually had a chance.
> 
> To still maintain fairness in reclaim pressure, disallow cgroup
> reclaim from bailing out of the tree walk early. Kswapd and regular
> direct reclaim already don't bail, so it's not clear why limit reclaim
> would have to, especially since it only walks subtrees to begin with.

The code does bail out on any direct reclaim - be it limit or page
allocator triggered. Check the !current_is_kswapd part of the condition.

> This change completely eliminates the OOM kills on our service, while
> showing no signs of overreclaim - no increased scan rates, %sys time,
> or abrupt free memory spikes. I tested across 100 machines that have
> 64G of RAM and host about 300 cgroups each.

What is the usual direct reclaim involvement on those machines?

> [ It's possible overreclaim never was a *practical* issue to begin
>   with - it was simply a concern we had on the mailing lists at the
>   time, with no real data to back it up. But we have also added more
>   bail-out conditions deeper inside reclaim (e.g. the proportional
>   exit in shrink_node_memcg) since. Regardless, now we have data that
>   suggests full walks are more reliable and scale just fine. ]

I do not see how shrink_node_memcg bail out helps here. We do scan up-to
SWAP_CLUSTER_MAX pages for each LRU at least once. So we are getting to
nr_memcgs_with_pages multiplier with the patch applied in the worst case.

How much that matters is another question and it depends on the
number of cgroups and the rate the direct reclaim happens. I do not
remember exact numbers but even walking a very large memcg tree was
noticeable.

For the over reclaim part SWAP_CLUSTER_MAX is a relatively small number
so even hundreds of memcgs on a "reasonably" sized system shouldn't be
really observable (we are talking about 7MB per reclaim per reclaimer on
1k memcgs with pages). This would get worse with many reclaimers. Maybe
we will need something like the regular direct reclaim throttling of
mmemcg limit reclaim as well in the future.

That being said, I do agree that the oom side of the coin is causing
real troubles and it is a real problem to be addressed first. Especially with
cgroup v2 where we have likely more memcgs without any pages because
inner nodes do not have any tasks and direct charges which makes some
reclaimers hit memcgs without pages more likely.

Let's see whether we see regression due to over-reclaim. 

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

With the direct reclaim bail out reference fixed - unless I am wrong
there of course

Acked-by: Michal Hocko <mhocko@suse.com>

It is sad to see this piece of fun not being used after that many years
of bugs here and there and all the lockless fun but this is the life
;)

> ---
>  mm/vmscan.c | 22 ++--------------------
>  1 file changed, 2 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dbdc46a84f63..b2f10fa49c88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> -		struct mem_cgroup_reclaim_cookie reclaim = {
> -			.pgdat = pgdat,
> -			.priority = sc->priority,
> -		};
>  		unsigned long node_lru_pages = 0;
>  		struct mem_cgroup *memcg;
>  
> @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		nr_reclaimed = sc->nr_reclaimed;
>  		nr_scanned = sc->nr_scanned;
>  
> -		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> +		memcg = mem_cgroup_iter(root, NULL, NULL);
>  		do {
>  			unsigned long lru_pages;
>  			unsigned long reclaimed;
> @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				   sc->nr_scanned - scanned,
>  				   sc->nr_reclaimed - reclaimed);
>  
> -			/*
> -			 * Kswapd have to scan all memory cgroups to fulfill
> -			 * the overall scan target for the node.
> -			 *
> -			 * Limit reclaim, on the other hand, only cares about
> -			 * nr_to_reclaim pages to be reclaimed and it will
> -			 * retry with decreasing priority if one round over the
> -			 * whole hierarchy is not sufficient.
> -			 */
> -			if (!current_is_kswapd() &&
> -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -				mem_cgroup_iter_break(root, memcg);
> -				break;
> -			}
> -		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> +		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -- 
> 2.22.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-13 13:29 ` Michal Hocko
@ 2019-08-13 15:59   ` Yang Shi
  2019-08-13 18:14     ` Johannes Weiner
  2019-08-13 17:12   ` Johannes Weiner
  1 sibling, 1 reply; 8+ messages in thread
From: Yang Shi @ 2019-08-13 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Linux MM, cgroups,
	Linux Kernel Mailing List, kernel-team

On Tue, Aug 13, 2019 at 6:29 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 12-08-19 15:23:16, Johannes Weiner wrote:
> > One of our services observed a high rate of cgroup OOM kills in the
> > presence of large amounts of clean cache. Debugging showed that the
> > culprit is the shared cgroup iteration in page reclaim.
> >
> > Under high allocation concurrency, multiple threads enter reclaim at
> > the same time. Fearing overreclaim when we first switched from the
> > single global LRU to cgrouped LRU lists, we introduced a shared
> > iteration state for reclaim invocations - whether 1 or 20 reclaimers
> > are active concurrently, we only walk the cgroup tree once: the 1st
> > reclaimer reclaims the first cgroup, the second the second one etc.
> > With more reclaimers than cgroups, we start another walk from the top.
> >
> > This sounded reasonable at the time, but the problem is that reclaim
> > concurrency doesn't scale with allocation concurrency. As reclaim
> > concurrency increases, the amount of memory individual reclaimers get
> > to scan gets smaller and smaller. Individual reclaimers may only see
> > one cgroup per cycle, and that may not have much reclaimable
> > memory. We see individual reclaimers declare OOM when there is plenty
> > of reclaimable memory available in cgroups they didn't visit.
> >
> > This patch does away with the shared iterator, and every reclaimer is
> > allowed to scan the full cgroup tree and see all of reclaimable
> > memory, just like it would on a non-cgrouped system. This way, when
> > OOM is declared, we know that the reclaimer actually had a chance.
> >
> > To still maintain fairness in reclaim pressure, disallow cgroup
> > reclaim from bailing out of the tree walk early. Kswapd and regular
> > direct reclaim already don't bail, so it's not clear why limit reclaim
> > would have to, especially since it only walks subtrees to begin with.
>
> The code does bail out on any direct reclaim - be it limit or page
> allocator triggered. Check the !current_is_kswapd part of the condition.

Yes, please see commit 2bb0f34fe3c1 ("mm: vmscan: do not iterate all
mem cgroups for global direct reclaim")

>
> > This change completely eliminates the OOM kills on our service, while
> > showing no signs of overreclaim - no increased scan rates, %sys time,
> > or abrupt free memory spikes. I tested across 100 machines that have
> > 64G of RAM and host about 300 cgroups each.
>
> What is the usual direct reclaim involvement on those machines?
>
> > [ It's possible overreclaim never was a *practical* issue to begin
> >   with - it was simply a concern we had on the mailing lists at the
> >   time, with no real data to back it up. But we have also added more
> >   bail-out conditions deeper inside reclaim (e.g. the proportional
> >   exit in shrink_node_memcg) since. Regardless, now we have data that
> >   suggests full walks are more reliable and scale just fine. ]
>
> I do not see how shrink_node_memcg bail out helps here. We do scan up-to
> SWAP_CLUSTER_MAX pages for each LRU at least once. So we are getting to
> nr_memcgs_with_pages multiplier with the patch applied in the worst case.
>
> How much that matters is another question and it depends on the
> number of cgroups and the rate the direct reclaim happens. I do not
> remember exact numbers but even walking a very large memcg tree was
> noticeable.

I'm concerned by this too. It might be ok to cgroup v2, but v1 still
dominates. And, considering offline memcgs it might be not unusual to
have quite large memcg tree.

>
> For the over reclaim part SWAP_CLUSTER_MAX is a relatively small number
> so even hundreds of memcgs on a "reasonably" sized system shouldn't be
> really observable (we are talking about 7MB per reclaim per reclaimer on
> 1k memcgs with pages). This would get worse with many reclaimers. Maybe
> we will need something like the regular direct reclaim throttling of
> mmemcg limit reclaim as well in the future.
>
> That being said, I do agree that the oom side of the coin is causing
> real troubles and it is a real problem to be addressed first. Especially with
> cgroup v2 where we have likely more memcgs without any pages because
> inner nodes do not have any tasks and direct charges which makes some
> reclaimers hit memcgs without pages more likely.
>
> Let's see whether we see regression due to over-reclaim.
>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>
> With the direct reclaim bail out reference fixed - unless I am wrong
> there of course
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> It is sad to see this piece of fun not being used after that many years
> of bugs here and there and all the lockless fun but this is the life
> ;)
>
> > ---
> >  mm/vmscan.c | 22 ++--------------------
> >  1 file changed, 2 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dbdc46a84f63..b2f10fa49c88 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >
> >       do {
> >               struct mem_cgroup *root = sc->target_mem_cgroup;
> > -             struct mem_cgroup_reclaim_cookie reclaim = {
> > -                     .pgdat = pgdat,
> > -                     .priority = sc->priority,
> > -             };
> >               unsigned long node_lru_pages = 0;
> >               struct mem_cgroup *memcg;
> >
> > @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >               nr_reclaimed = sc->nr_reclaimed;
> >               nr_scanned = sc->nr_scanned;
> >
> > -             memcg = mem_cgroup_iter(root, NULL, &reclaim);
> > +             memcg = mem_cgroup_iter(root, NULL, NULL);
> >               do {
> >                       unsigned long lru_pages;
> >                       unsigned long reclaimed;
> > @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >                                  sc->nr_scanned - scanned,
> >                                  sc->nr_reclaimed - reclaimed);
> >
> > -                     /*
> > -                      * Kswapd have to scan all memory cgroups to fulfill
> > -                      * the overall scan target for the node.
> > -                      *
> > -                      * Limit reclaim, on the other hand, only cares about
> > -                      * nr_to_reclaim pages to be reclaimed and it will
> > -                      * retry with decreasing priority if one round over the
> > -                      * whole hierarchy is not sufficient.
> > -                      */
> > -                     if (!current_is_kswapd() &&
> > -                                     sc->nr_reclaimed >= sc->nr_to_reclaim) {
> > -                             mem_cgroup_iter_break(root, memcg);
> > -                             break;
> > -                     }
> > -             } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> > +             } while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> >
> >               if (reclaim_state) {
> >                       sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > --
> > 2.22.0
>
> --
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-13 13:29 ` Michal Hocko
  2019-08-13 15:59   ` Yang Shi
@ 2019-08-13 17:12   ` Johannes Weiner
  2019-08-14  8:11     ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2019-08-13 17:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue, Aug 13, 2019 at 03:29:38PM +0200, Michal Hocko wrote:
> On Mon 12-08-19 15:23:16, Johannes Weiner wrote:
> > One of our services observed a high rate of cgroup OOM kills in the
> > presence of large amounts of clean cache. Debugging showed that the
> > culprit is the shared cgroup iteration in page reclaim.
> > 
> > Under high allocation concurrency, multiple threads enter reclaim at
> > the same time. Fearing overreclaim when we first switched from the
> > single global LRU to cgrouped LRU lists, we introduced a shared
> > iteration state for reclaim invocations - whether 1 or 20 reclaimers
> > are active concurrently, we only walk the cgroup tree once: the 1st
> > reclaimer reclaims the first cgroup, the second the second one etc.
> > With more reclaimers than cgroups, we start another walk from the top.
> > 
> > This sounded reasonable at the time, but the problem is that reclaim
> > concurrency doesn't scale with allocation concurrency. As reclaim
> > concurrency increases, the amount of memory individual reclaimers get
> > to scan gets smaller and smaller. Individual reclaimers may only see
> > one cgroup per cycle, and that may not have much reclaimable
> > memory. We see individual reclaimers declare OOM when there is plenty
> > of reclaimable memory available in cgroups they didn't visit.
> > 
> > This patch does away with the shared iterator, and every reclaimer is
> > allowed to scan the full cgroup tree and see all of reclaimable
> > memory, just like it would on a non-cgrouped system. This way, when
> > OOM is declared, we know that the reclaimer actually had a chance.
> > 
> > To still maintain fairness in reclaim pressure, disallow cgroup
> > reclaim from bailing out of the tree walk early. Kswapd and regular
> > direct reclaim already don't bail, so it's not clear why limit reclaim
> > would have to, especially since it only walks subtrees to begin with.
> 
> The code does bail out on any direct reclaim - be it limit or page
> allocator triggered. Check the !current_is_kswapd part of the condition.

Ah you're right. In practice I doubt it makes much of a difference,
though, because...

> > This change completely eliminates the OOM kills on our service, while
> > showing no signs of overreclaim - no increased scan rates, %sys time,
> > or abrupt free memory spikes. I tested across 100 machines that have
> > 64G of RAM and host about 300 cgroups each.
> 
> What is the usual direct reclaim involvement on those machines?

80-200 kb/s. In general we try to keep this low to non-existent on our
hosts due to the latency implications. So it's fair to say that kswapd
does page reclaim, and direct reclaim is a sign of overload.

> That being said, I do agree that the oom side of the coin is causing
> real troubles and it is a real problem to be addressed first. Especially with
> cgroup v2 where we have likely more memcgs without any pages because
> inner nodes do not have any tasks and direct charges which makes some
> reclaimers hit memcgs without pages more likely.
> 
> Let's see whether we see regression due to over-reclaim. 
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> With the direct reclaim bail out reference fixed - unless I am wrong
> there of course
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks! I'll send an updated changelog.

> It is sad to see this piece of fun not being used after that many years
> of bugs here and there and all the lockless fun but this is the life

Haha, agreed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-13 15:59   ` Yang Shi
@ 2019-08-13 18:14     ` Johannes Weiner
  0 siblings, 0 replies; 8+ messages in thread
From: Johannes Weiner @ 2019-08-13 18:14 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Andrew Morton, Linux MM, cgroups,
	Linux Kernel Mailing List, kernel-team

On Tue, Aug 13, 2019 at 08:59:38AM -0700, Yang Shi wrote:
> On Tue, Aug 13, 2019 at 6:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 12-08-19 15:23:16, Johannes Weiner wrote:
> > > One of our services observed a high rate of cgroup OOM kills in the
> > > presence of large amounts of clean cache. Debugging showed that the
> > > culprit is the shared cgroup iteration in page reclaim.
> > >
> > > Under high allocation concurrency, multiple threads enter reclaim at
> > > the same time. Fearing overreclaim when we first switched from the
> > > single global LRU to cgrouped LRU lists, we introduced a shared
> > > iteration state for reclaim invocations - whether 1 or 20 reclaimers
> > > are active concurrently, we only walk the cgroup tree once: the 1st
> > > reclaimer reclaims the first cgroup, the second the second one etc.
> > > With more reclaimers than cgroups, we start another walk from the top.
> > >
> > > This sounded reasonable at the time, but the problem is that reclaim
> > > concurrency doesn't scale with allocation concurrency. As reclaim
> > > concurrency increases, the amount of memory individual reclaimers get
> > > to scan gets smaller and smaller. Individual reclaimers may only see
> > > one cgroup per cycle, and that may not have much reclaimable
> > > memory. We see individual reclaimers declare OOM when there is plenty
> > > of reclaimable memory available in cgroups they didn't visit.
> > >
> > > This patch does away with the shared iterator, and every reclaimer is
> > > allowed to scan the full cgroup tree and see all of reclaimable
> > > memory, just like it would on a non-cgrouped system. This way, when
> > > OOM is declared, we know that the reclaimer actually had a chance.
> > >
> > > To still maintain fairness in reclaim pressure, disallow cgroup
> > > reclaim from bailing out of the tree walk early. Kswapd and regular
> > > direct reclaim already don't bail, so it's not clear why limit reclaim
> > > would have to, especially since it only walks subtrees to begin with.
> >
> > The code does bail out on any direct reclaim - be it limit or page
> > allocator triggered. Check the !current_is_kswapd part of the condition.
> 
> Yes, please see commit 2bb0f34fe3c1 ("mm: vmscan: do not iterate all
> mem cgroups for global direct reclaim")

This patch is a workaround for the cgroup tree blowing up with zombie
cgroups. Roman's slab reparenting patches are fixing the zombies, so
we shouldn't need this anymore.

Because with or without the direct reclaim rule, we still don't want
offline cgroups to accumulate like this. They also slow down kswapd,
and they eat a ton of RAM.

> > > This change completely eliminates the OOM kills on our service, while
> > > showing no signs of overreclaim - no increased scan rates, %sys time,
> > > or abrupt free memory spikes. I tested across 100 machines that have
> > > 64G of RAM and host about 300 cgroups each.
> >
> > What is the usual direct reclaim involvement on those machines?
> >
> > > [ It's possible overreclaim never was a *practical* issue to begin
> > >   with - it was simply a concern we had on the mailing lists at the
> > >   time, with no real data to back it up. But we have also added more
> > >   bail-out conditions deeper inside reclaim (e.g. the proportional
> > >   exit in shrink_node_memcg) since. Regardless, now we have data that
> > >   suggests full walks are more reliable and scale just fine. ]
> >
> > I do not see how shrink_node_memcg bail out helps here. We do scan up-to
> > SWAP_CLUSTER_MAX pages for each LRU at least once. So we are getting to
> > nr_memcgs_with_pages multiplier with the patch applied in the worst case.
> >
> > How much that matters is another question and it depends on the
> > number of cgroups and the rate the direct reclaim happens. I do not
> > remember exact numbers but even walking a very large memcg tree was
> > noticeable.
> 
> I'm concerned by this too. It might be ok to cgroup v2, but v1 still
> dominates. And, considering offline memcgs it might be not unusual to
> have quite large memcg tree.

cgroup2 was affected by the offline memcgs just as much as cgroup1 -
probably even more so because it tracks more types of memory per
default. That's why Roman worked tirelessly on a solution.

But we shouldn't keep those bandaid patches around forever.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
  2019-08-13 17:12   ` Johannes Weiner
@ 2019-08-14  8:11     ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2019-08-14  8:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 13-08-19 13:12:37, Johannes Weiner wrote:
> On Tue, Aug 13, 2019 at 03:29:38PM +0200, Michal Hocko wrote:
> > On Mon 12-08-19 15:23:16, Johannes Weiner wrote:
[...]
> > > This change completely eliminates the OOM kills on our service, while
> > > showing no signs of overreclaim - no increased scan rates, %sys time,
> > > or abrupt free memory spikes. I tested across 100 machines that have
> > > 64G of RAM and host about 300 cgroups each.
> > 
> > What is the usual direct reclaim involvement on those machines?
> 
> 80-200 kb/s. In general we try to keep this low to non-existent on our
> hosts due to the latency implications. So it's fair to say that kswapd
> does page reclaim, and direct reclaim is a sign of overload.

Well, there are workloads which are much more direct reclaim heavier.
How much they rely on large memcg trees remains to be seen. Your
changelog should state that the above workload is very light on direct
reclaim, though, because the above paragraph suggests that a risk of
longer stalls is really non-issue while I think this is not really all
that clear.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-08-14  8:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-12 19:23 [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Johannes Weiner
2019-08-12 21:07 ` Roman Gushchin
2019-08-12 23:00   ` Johannes Weiner
2019-08-13 13:29 ` Michal Hocko
2019-08-13 15:59   ` Yang Shi
2019-08-13 18:14     ` Johannes Weiner
2019-08-13 17:12   ` Johannes Weiner
2019-08-14  8:11     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).