linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers
Date: Mon, 12 Aug 2019 21:07:27 +0000	[thread overview]
Message-ID: <20190812210723.GA9423@tower.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190812192316.13615-1-hannes@cmpxchg.org>

On Mon, Aug 12, 2019 at 03:23:16PM -0400, Johannes Weiner wrote:
> One of our services observed a high rate of cgroup OOM kills in the
> presence of large amounts of clean cache. Debugging showed that the
> culprit is the shared cgroup iteration in page reclaim.
> 
> Under high allocation concurrency, multiple threads enter reclaim at
> the same time. Fearing overreclaim when we first switched from the
> single global LRU to cgrouped LRU lists, we introduced a shared
> iteration state for reclaim invocations - whether 1 or 20 reclaimers
> are active concurrently, we only walk the cgroup tree once: the 1st
> reclaimer reclaims the first cgroup, the second the second one etc.
> With more reclaimers than cgroups, we start another walk from the top.
> 
> This sounded reasonable at the time, but the problem is that reclaim
> concurrency doesn't scale with allocation concurrency. As reclaim
> concurrency increases, the amount of memory individual reclaimers get
> to scan gets smaller and smaller. Individual reclaimers may only see
> one cgroup per cycle, and that may not have much reclaimable
> memory. We see individual reclaimers declare OOM when there is plenty
> of reclaimable memory available in cgroups they didn't visit.

Nice catch!


> 
> This patch does away with the shared iterator, and every reclaimer is
> allowed to scan the full cgroup tree and see all of reclaimable
> memory, just like it would on a non-cgrouped system. This way, when
> OOM is declared, we know that the reclaimer actually had a chance.
> 
> To still maintain fairness in reclaim pressure, disallow cgroup
> reclaim from bailing out of the tree walk early. Kswapd and regular
> direct reclaim already don't bail, so it's not clear why limit reclaim
> would have to, especially since it only walks subtrees to begin with.
> 
> This change completely eliminates the OOM kills on our service, while
> showing no signs of overreclaim - no increased scan rates, %sys time,
> or abrupt free memory spikes. I tested across 100 machines that have
> 64G of RAM and host about 300 cgroups each.
> 
> [ It's possible overreclaim never was a *practical* issue to begin
>   with - it was simply a concern we had on the mailing lists at the
>   time, with no real data to back it up. But we have also added more
>   bail-out conditions deeper inside reclaim (e.g. the proportional
>   exit in shrink_node_memcg) since. Regardless, now we have data that
>   suggests full walks are more reliable and scale just fine. ]
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 22 ++--------------------
>  1 file changed, 2 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dbdc46a84f63..b2f10fa49c88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  	do {
>  		struct mem_cgroup *root = sc->target_mem_cgroup;
> -		struct mem_cgroup_reclaim_cookie reclaim = {
> -			.pgdat = pgdat,
> -			.priority = sc->priority,
> -		};
>  		unsigned long node_lru_pages = 0;
>  		struct mem_cgroup *memcg;
>  
> @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		nr_reclaimed = sc->nr_reclaimed;
>  		nr_scanned = sc->nr_scanned;
>  
> -		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> +		memcg = mem_cgroup_iter(root, NULL, NULL);

I wonder if we can remove the shared memcg tree walking at all? It seems that
the only use case left is the soft limit, and the same logic can be applied
to it. The we potentially can remove a lot of code in mem_cgroup_iter().
Just an idea...

>  		do {
>  			unsigned long lru_pages;
>  			unsigned long reclaimed;
> @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				   sc->nr_scanned - scanned,
>  				   sc->nr_reclaimed - reclaimed);
>  
> -			/*
> -			 * Kswapd have to scan all memory cgroups to fulfill
> -			 * the overall scan target for the node.
> -			 *
> -			 * Limit reclaim, on the other hand, only cares about
> -			 * nr_to_reclaim pages to be reclaimed and it will
> -			 * retry with decreasing priority if one round over the
> -			 * whole hierarchy is not sufficient.
> -			 */
> -			if (!current_is_kswapd() &&
> -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -				mem_cgroup_iter_break(root, memcg);
> -				break;
> -			}
> -		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> +		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
>  		if (reclaim_state) {
>  			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -- 
> 2.22.0
>

Otherwise looks good to me!

Reviewed-by: Roman Gushchin <guro@fb.com>


  reply	other threads:[~2019-08-12 21:07 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-12 19:23 [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Johannes Weiner
2019-08-12 21:07 ` Roman Gushchin [this message]
2019-08-12 23:00   ` Johannes Weiner
2019-08-13 13:29 ` Michal Hocko
2019-08-13 15:59   ` Yang Shi
2019-08-13 18:14     ` Johannes Weiner
2019-08-13 17:12   ` Johannes Weiner
2019-08-14  8:11     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190812210723.GA9423@tower.dhcp.thefacebook.com \
    --to=guro@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).