cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: vmscan: restore incremental cgroup iteration
@ 2024-05-14 20:26 Johannes Weiner
  2024-05-14 20:48 ` Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Johannes Weiner @ 2024-05-14 20:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Rik van Riel,
	linux-mm, cgroups, linux-kernel, kernel-team

Currently, reclaim always walks the entire cgroup tree in order to
ensure fairness between groups. While overreclaim is limited in
shrink_lruvec(), many of our systems have a sizable number of active
groups, and an even bigger number of idle cgroups with cache left
behind by previous jobs; the mere act of walking all these cgroups can
impose significant latency on direct reclaimers.

In the past, we've used a save-and-restore iterator that enabled
incremental tree walks over multiple reclaim invocations. This ensured
fairness, while keeping the work of individual reclaimers small.

However, in edge cases with a lot of reclaim concurrency, individual
reclaimers would sometimes not see enough of the cgroup tree to make
forward progress and (prematurely) declare OOM. Consequently we
switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not
share cgroup iteration between reclaimers").

To address the latency problem without bringing back the premature OOM
issue, reinstate the shared iteration, but with a restart condition to
do the full walk in the OOM case - similar to what we do for
memory.low enforcement and active page protection.

In the worst case, we do one more full tree walk before declaring
OOM. But the vast majority of direct reclaim scans can then finish
much quicker, while fairness across the tree is maintained:

- Before this patch, we observed that direct reclaim always takes more
  than 100us and most direct reclaim time is spent in reclaim cycles
  lasting between 1ms and 1 second. Almost 40% of direct reclaim time
  was spent on reclaim cycles exceeding 100ms.

- With this patch, almost all page reclaim cycles last less than 10ms,
  and a good amount of direct page reclaim finishes in under 100us. No
  page reclaim cycles lasting over 100ms were observed anymore.

The shared iterator state is maintaned inside the target cgroup, so
fair and incremental walks are performed during both global reclaim
and cgroup limit reclaim of complex subtrees.

Reported-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 mm/vmscan.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6981a71c8ef0..fc22704fbbe1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -133,6 +133,9 @@ struct scan_control {
 	unsigned int memcg_low_reclaim:1;
 	unsigned int memcg_low_skipped:1;
 
+	/* Shared cgroup tree walk failed, rescan the whole tree */
+	unsigned int memcg_full_walk:1;
+
 	unsigned int hibernation_mode:1;
 
 	/* One of the zones is ready for compaction */
@@ -5862,9 +5865,25 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
+	struct mem_cgroup_reclaim_cookie reclaim = {
+		.pgdat = pgdat,
+	};
+	struct mem_cgroup_reclaim_cookie *partial = &reclaim;
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
+	/*
+	 * In most cases, direct reclaimers can do partial walks
+	 * through the cgroup tree, using an iterator state that
+	 * persists across invocations. This strikes a balance between
+	 * fairness and allocation latency.
+	 *
+	 * For kswapd, reliable forward progress is more important
+	 * than a quick return to idle. Always do full walks.
+	 */
+	if (current_is_kswapd() || sc->memcg_full_walk)
+		partial = NULL;
+
+	memcg = mem_cgroup_iter(target_memcg, NULL, partial);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long reclaimed;
@@ -5914,7 +5933,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				   sc->nr_scanned - scanned,
 				   sc->nr_reclaimed - reclaimed);
 
-	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
+		/* If partial walks are allowed, bail once goal is reached */
+		if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) {
+			mem_cgroup_iter_break(target_memcg, memcg);
+			break;
+		}
+	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial)));
 }
 
 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
@@ -6287,6 +6311,20 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	if (sc->compaction_ready)
 		return 1;
 
+	/*
+	 * In most cases, direct reclaimers can do partial walks
+	 * through the cgroup tree to meet the reclaim goal while
+	 * keeping latency low. Since the iterator state is shared
+	 * among all direct reclaim invocations (to retain fairness
+	 * among cgroups), though, high concurrency can result in
+	 * individual threads not seeing enough cgroups to make
+	 * meaningful forward progress. Avoid false OOMs in this case.
+	 */
+	if (!sc->memcg_full_walk) {
+		sc->memcg_full_walk = 1;
+		goto retry;
+	}
+
 	/*
 	 * We make inactive:active ratio decisions based on the node's
 	 * composition of memory, but a restrictive reclaim_idx or a
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: vmscan: restore incremental cgroup iteration
  2024-05-14 20:26 [PATCH] mm: vmscan: restore incremental cgroup iteration Johannes Weiner
@ 2024-05-14 20:48 ` Shakeel Butt
  2024-05-14 23:36 ` Roman Gushchin
  2024-05-24 16:35 ` Michal Koutný
  2 siblings, 0 replies; 4+ messages in thread
From: Shakeel Butt @ 2024-05-14 20:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Rik van Riel,
	linux-mm, cgroups, linux-kernel, kernel-team

On Tue, May 14, 2024 at 04:26:41PM -0400, Johannes Weiner wrote:
> Currently, reclaim always walks the entire cgroup tree in order to
> ensure fairness between groups. While overreclaim is limited in
> shrink_lruvec(), many of our systems have a sizable number of active
> groups, and an even bigger number of idle cgroups with cache left
> behind by previous jobs; the mere act of walking all these cgroups can
> impose significant latency on direct reclaimers.
> 
> In the past, we've used a save-and-restore iterator that enabled
> incremental tree walks over multiple reclaim invocations. This ensured
> fairness, while keeping the work of individual reclaimers small.
> 
> However, in edge cases with a lot of reclaim concurrency, individual
> reclaimers would sometimes not see enough of the cgroup tree to make
> forward progress and (prematurely) declare OOM. Consequently we
> switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not
> share cgroup iteration between reclaimers").
> 
> To address the latency problem without bringing back the premature OOM
> issue, reinstate the shared iteration, but with a restart condition to
> do the full walk in the OOM case - similar to what we do for
> memory.low enforcement and active page protection.
> 
> In the worst case, we do one more full tree walk before declaring
> OOM. But the vast majority of direct reclaim scans can then finish
> much quicker, while fairness across the tree is maintained:
> 
> - Before this patch, we observed that direct reclaim always takes more
>   than 100us and most direct reclaim time is spent in reclaim cycles
>   lasting between 1ms and 1 second. Almost 40% of direct reclaim time
>   was spent on reclaim cycles exceeding 100ms.
> 
> - With this patch, almost all page reclaim cycles last less than 10ms,
>   and a good amount of direct page reclaim finishes in under 100us. No
>   page reclaim cycles lasting over 100ms were observed anymore.
> 
> The shared iterator state is maintaned inside the target cgroup, so
> fair and incremental walks are performed during both global reclaim
> and cgroup limit reclaim of complex subtrees.
> 
> Reported-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Rik van Riel <riel@surriel.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: vmscan: restore incremental cgroup iteration
  2024-05-14 20:26 [PATCH] mm: vmscan: restore incremental cgroup iteration Johannes Weiner
  2024-05-14 20:48 ` Shakeel Butt
@ 2024-05-14 23:36 ` Roman Gushchin
  2024-05-24 16:35 ` Michal Koutný
  2 siblings, 0 replies; 4+ messages in thread
From: Roman Gushchin @ 2024-05-14 23:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Shakeel Butt, Rik van Riel,
	linux-mm, cgroups, linux-kernel, kernel-team

On Tue, May 14, 2024 at 04:26:41PM -0400, Johannes Weiner wrote:
> Currently, reclaim always walks the entire cgroup tree in order to
> ensure fairness between groups. While overreclaim is limited in
> shrink_lruvec(), many of our systems have a sizable number of active
> groups, and an even bigger number of idle cgroups with cache left
> behind by previous jobs; the mere act of walking all these cgroups can
> impose significant latency on direct reclaimers.
> 
> In the past, we've used a save-and-restore iterator that enabled
> incremental tree walks over multiple reclaim invocations. This ensured
> fairness, while keeping the work of individual reclaimers small.
> 
> However, in edge cases with a lot of reclaim concurrency, individual
> reclaimers would sometimes not see enough of the cgroup tree to make
> forward progress and (prematurely) declare OOM. Consequently we
> switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not
> share cgroup iteration between reclaimers").
> 
> To address the latency problem without bringing back the premature OOM
> issue, reinstate the shared iteration, but with a restart condition to
> do the full walk in the OOM case - similar to what we do for
> memory.low enforcement and active page protection.
> 
> In the worst case, we do one more full tree walk before declaring
> OOM. But the vast majority of direct reclaim scans can then finish
> much quicker, while fairness across the tree is maintained:
> 
> - Before this patch, we observed that direct reclaim always takes more
>   than 100us and most direct reclaim time is spent in reclaim cycles
>   lasting between 1ms and 1 second. Almost 40% of direct reclaim time
>   was spent on reclaim cycles exceeding 100ms.
> 
> - With this patch, almost all page reclaim cycles last less than 10ms,
>   and a good amount of direct page reclaim finishes in under 100us. No
>   page reclaim cycles lasting over 100ms were observed anymore.
> 
> The shared iterator state is maintaned inside the target cgroup, so
> fair and incremental walks are performed during both global reclaim
> and cgroup limit reclaim of complex subtrees.
> 
> Reported-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Rik van Riel <riel@surriel.com>

Looks really solid.

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: vmscan: restore incremental cgroup iteration
  2024-05-14 20:26 [PATCH] mm: vmscan: restore incremental cgroup iteration Johannes Weiner
  2024-05-14 20:48 ` Shakeel Butt
  2024-05-14 23:36 ` Roman Gushchin
@ 2024-05-24 16:35 ` Michal Koutný
  2 siblings, 0 replies; 4+ messages in thread
From: Michal Koutný @ 2024-05-24 16:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Rik van Riel, linux-mm, cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

Hello.

On Tue, May 14, 2024 at 04:26:41PM GMT, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The shared iterator state is maintaned inside the target cgroup, so
> fair and incremental walks are performed during both global reclaim
> and cgroup limit reclaim of complex subtrees.

Here it sounds like same fairness is maintained...

>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
...
> +	 * persists across invocations. This strikes a balance between
> +	 * fairness and allocation latency.

...but here you write about balance between fairness and allocation.

IIUC, this spreads reclaim (of whole subtree) over longer time when more
events may affect the state of memory (e.g. more allocations), so
fairness would be "different". So the statement from code comment is
correct, right?

(I was also wondering how does this affect determinism of reclaim and
whether some chaotic or oscillatory patterns aren't possible but I guess
that needn't to be considered given it used to work before
1ba6fc9af35b.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-24 16:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-14 20:26 [PATCH] mm: vmscan: restore incremental cgroup iteration Johannes Weiner
2024-05-14 20:48 ` Shakeel Butt
2024-05-14 23:36 ` Roman Gushchin
2024-05-24 16:35 ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).