linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
@ 2022-03-22 18:22 Michal Koutný
  2022-03-23 21:44 ` Roman Gushchin
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Koutný @ 2022-03-22 18:22 UTC (permalink / raw)
  To: cgroups, linux-mm, linux-kernel
  Cc: Richard Palethorpe, Andrew Morton, Shakeel Butt, Roman Gushchin,
	Michal Hocko, Vlastimil Babka, Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

	`parent .low=50M
	  ` s1	.low=50M  .current=50M+ε
	  ` s2  .low=0M   .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.
The effect is slightly regularized protection (workload dependent)
between siblings and misreported MEMCG_LOW event when reclaiming s2 with
this protection.

Fix the behavior by not reporting breached memory.low in such
situations. (This affects also setups where all siblings have
memory.low=0, parent's memory.events:low will still be non-zero when
parent's memory.low is breached but it will be reduced by the events
originated in children.)

Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Reported-by: Richard Palethorpe <rpalethorpe@suse.com>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 include/linux/memcontrol.h | 8 ++++----
 mm/vmscan.c                | 5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

Why is this RFC?

1) It changes number of events observed on parent/memory.events:low (especially
   for truly recursive configs where all children specify memory.low=0).
   IIUC past discussions about equality of all-zeros and all-infinities, those
   eagerly reported MEMCG_LOW events (in latter case) were deemed skewing the
   stats [1].
2) The observed behavior slightly impacts distribution of parent's memory.low. 
   Constructed example is a passive protected workload in s1 and active in s2
   (active ~ counteracts the reclaim with allocations). It could strip
   protection from s1 one by one (one:=SWAP_CLUSTER_MAX/2^sc.priority).
   That may be considered both wrong (s1 should have been more protected) or
   correct s2 deserves protection due to its activity.
   I don't have (didn't collect) data for this, so I think just masking the
   false events is sufficient (or independent).

[1] https://lore.kernel.org/r/20200221185839.GB70967@cmpxchg.org

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..99ac72e00bff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -626,13 +626,13 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
 {
 	if (!mem_cgroup_supports_protection(memcg))
 		return false;
 
-	return READ_ONCE(memcg->memory.elow) >=
-		page_counter_read(&memcg->memory);
+	return page_counter_read(&memcg->memory) <= (effective ?
+		READ_ONCE(memcg->memory.elow) :	READ_ONCE(memcg->memory.low));
 }
 
 static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
@@ -1177,7 +1177,7 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
 {
 	return false;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d696c..3bdb35d6bee6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3152,7 +3152,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		} else if (mem_cgroup_below_low(memcg)) {
+		} else if (mem_cgroup_below_low(memcg, true)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as
@@ -3163,7 +3163,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
-			memcg_memory_event(memcg, MEMCG_LOW);
+			if (mem_cgroup_below_low(memcg, false))
+				memcg_memory_event(memcg, MEMCG_LOW);
 		}
 
 		reclaimed = sc->nr_reclaimed;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-03-25 10:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 18:22 [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen Michal Koutný
2022-03-23 21:44 ` Roman Gushchin
2022-03-24  9:51   ` Michal Koutný
2022-03-24 18:17     ` Roman Gushchin
2022-03-25 10:31       ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).