linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
@ 2022-03-22 18:22 Michal Koutný
  2022-03-23 21:44 ` Roman Gushchin
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Koutný @ 2022-03-22 18:22 UTC (permalink / raw)
  To: cgroups, linux-mm, linux-kernel
  Cc: Richard Palethorpe, Andrew Morton, Shakeel Butt, Roman Gushchin,
	Michal Hocko, Vlastimil Babka, Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

	`parent .low=50M
	  ` s1	.low=50M  .current=50M+ε
	  ` s2  .low=0M   .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.
The effect is slightly regularized protection (workload dependent)
between siblings and misreported MEMCG_LOW event when reclaiming s2 with
this protection.

Fix the behavior by not reporting breached memory.low in such
situations. (This affects also setups where all siblings have
memory.low=0, parent's memory.events:low will still be non-zero when
parent's memory.low is breached but it will be reduced by the events
originated in children.)

Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Reported-by: Richard Palethorpe <rpalethorpe@suse.com>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 include/linux/memcontrol.h | 8 ++++----
 mm/vmscan.c                | 5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

Why is this RFC?

1) It changes number of events observed on parent/memory.events:low (especially
   for truly recursive configs where all children specify memory.low=0).
   IIUC past discussions about equality of all-zeros and all-infinities, those
   eagerly reported MEMCG_LOW events (in latter case) were deemed skewing the
   stats [1].
2) The observed behavior slightly impacts distribution of parent's memory.low. 
   Constructed example is a passive protected workload in s1 and active in s2
   (active ~ counteracts the reclaim with allocations). It could strip
   protection from s1 one by one (one:=SWAP_CLUSTER_MAX/2^sc.priority).
   That may be considered both wrong (s1 should have been more protected) or
   correct s2 deserves protection due to its activity.
   I don't have (didn't collect) data for this, so I think just masking the
   false events is sufficient (or independent).

[1] https://lore.kernel.org/r/20200221185839.GB70967@cmpxchg.org

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..99ac72e00bff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -626,13 +626,13 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
 {
 	if (!mem_cgroup_supports_protection(memcg))
 		return false;
 
-	return READ_ONCE(memcg->memory.elow) >=
-		page_counter_read(&memcg->memory);
+	return page_counter_read(&memcg->memory) <= (effective ?
+		READ_ONCE(memcg->memory.elow) :	READ_ONCE(memcg->memory.low));
 }
 
 static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
@@ -1177,7 +1177,7 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg, bool effective)
 {
 	return false;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59b14e0d696c..3bdb35d6bee6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3152,7 +3152,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		} else if (mem_cgroup_below_low(memcg)) {
+		} else if (mem_cgroup_below_low(memcg, true)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as
@@ -3163,7 +3163,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
-			memcg_memory_event(memcg, MEMCG_LOW);
+			if (mem_cgroup_below_low(memcg, false))
+				memcg_memory_event(memcg, MEMCG_LOW);
 		}
 
 		reclaimed = sc->nr_reclaimed;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
  2022-03-22 18:22 [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen Michal Koutný
@ 2022-03-23 21:44 ` Roman Gushchin
  2022-03-24  9:51   ` Michal Koutný
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Gushchin @ 2022-03-23 21:44 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-mm, linux-kernel, Richard Palethorpe,
	Andrew Morton, Shakeel Butt, Michal Hocko, Vlastimil Babka,
	Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down

On Tue, Mar 22, 2022 at 07:22:48PM +0100, Michal Koutny wrote:
> This was observed with memcontrol selftest/new LTP test but can be also
> reproduced in simplified setup of two siblings:
> 
> 	`parent .low=50M
> 	  ` s1	.low=50M  .current=50M+ε
> 	  ` s2  .low=0M   .current=50M
> 
> The expectation is that s2/memory.events:low will be zero under outer
> reclaimer since no protection should be given to cgroup s2 (even with
> memory_recursiveprot).
> 
> However, this does not happen. The apparent reason is that when s1 is
> considered for (proportional) reclaim the scanned proportion is rounded
> up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
> reclaimed. Consequently, when the effective low value of s2 is
> calculated, it observes unclaimed parent's protection from s1
> (ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.
> The effect is slightly regularized protection (workload dependent)
> between siblings and misreported MEMCG_LOW event when reclaiming s2 with
> this protection.
> 
> Fix the behavior by not reporting breached memory.low in such
> situations. (This affects also setups where all siblings have
> memory.low=0, parent's memory.events:low will still be non-zero when
> parent's memory.low is breached but it will be reduced by the events
> originated in children.)
> 
> Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
> Reported-by: Richard Palethorpe <rpalethorpe@suse.com>
> Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>

Hi Michal!

Does it mean that in the following configuration:
	`parent .low=50M
	  ` s1	.low=0M   .current=50M
	  ` s2  .low=0M   .current=50M
there will be no memory.events::low at all? (assuming the recursive thing is on)

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
  2022-03-23 21:44 ` Roman Gushchin
@ 2022-03-24  9:51   ` Michal Koutný
  2022-03-24 18:17     ` Roman Gushchin
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Koutný @ 2022-03-24  9:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: cgroups, linux-mm, linux-kernel, Richard Palethorpe,
	Andrew Morton, Shakeel Butt, Michal Hocko, Vlastimil Babka,
	Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down

On Wed, Mar 23, 2022 at 02:44:24PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> Does it mean that in the following configuration:
> 	`parent .low=50M
> 	  ` s1	.low=0M   .current=50M
> 	  ` s2  .low=0M   .current=50M
> there will be no memory.events::low at all? (assuming the recursive thing is on)

True, no memory.events:low among siblings.
Number of memory.events:low in the parent depends on how much has to be
reclaimed (>50M means carving into parent's protection, hence it'll be
counted).

This is a quantitative change in the events reporting (point 1 of
RFCness), my understanding is that the potential events due to recursive
surplus protection carry no new information regarding configured
memory.low.


Michal

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
  2022-03-24  9:51   ` Michal Koutný
@ 2022-03-24 18:17     ` Roman Gushchin
  2022-03-25 10:31       ` Michal Koutný
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Gushchin @ 2022-03-24 18:17 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-mm, linux-kernel, Richard Palethorpe,
	Andrew Morton, Shakeel Butt, Michal Hocko, Vlastimil Babka,
	Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down


> On Mar 24, 2022, at 2:52 AM, Michal Koutný <mkoutny@suse.com> wrote:
> 
> On Wed, Mar 23, 2022 at 02:44:24PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> Does it mean that in the following configuration:
>>    `parent .low=50M
>>      ` s1    .low=0M   .current=50M
>>      ` s2  .low=0M   .current=50M
>> there will be no memory.events::low at all? (assuming the recursive thing is on)
> 
> True, no memory.events:low among siblings.
> Number of memory.events:low in the parent depends on how much has to be
> reclaimed (>50M means carving into parent's protection, hence it'll be
> counted).

Ok, so it’s not really about the implementation details of the reclaim mechanism (I mean rounding up to the batch size etc), it’s a more generic change: do not generate low events for cgroups not explicitly protected by a non-zero memory.low value.

Idk, I don’t have a strong argument against this change (except that it changes the existing behavior), but I also don’t see why such events are harmful. Do you mind elaborating a bit more?

Thank you!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen
  2022-03-24 18:17     ` Roman Gushchin
@ 2022-03-25 10:31       ` Michal Koutný
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Koutný @ 2022-03-25 10:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: cgroups, linux-mm, linux-kernel, Richard Palethorpe,
	Andrew Morton, Shakeel Butt, Michal Hocko, Vlastimil Babka,
	Matthew Wilcox (Oracle),
	Muchun Song, Johannes Weiner, Yang Shi, Suren Baghdasaryan,
	Tejun Heo, Chris Down

On Thu, Mar 24, 2022 at 11:17:14AM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> Ok, so it’s not really about the implementation details of the reclaim
> mechanism (I mean rounding up to the batch size etc),

Actually, that was what I deemed more serious first.
It's the point 2 of RFCness:

| 2) The observed behavior slightly impacts distribution of parent's memory.low.
|    Constructed example is a passive protected workload in s1 and active in s2
|    (active ~ counteracts the reclaim with allocations). It could strip
|    protection from s1 one by one (one:=SWAP_CLUSTER_MAX/2^sc.priority).
|    That may be considered both wrong (s1 should have been more protected) or
|    correct s2 deserves protection due to its activity.
|    I don't have (didn't collect) data for this, so I think just masking the
|    false events is sufficient (or independent).

> Idk, I don’t have a strong argument against this change (except that
> it changes the existing behavior), but I also don’t see why such
> events are harmful. Do you mind elaborating a bit more?

So I've collected some demo data now.

	systemd-run \
	        -u precious.service --slice=test-protected.slice \
	        -p MemoryLow=50M \
	        /root/memeater 50 # allocates 50M anon, doesn't use it
	
	systemd-run \
	        -u victim.service --slice=test-protected.slice \
	        -p MemoryLow=0M \
	        /root/memeater -m 50 50 # allocates 50M anon, uses it
	
	echo "Started workloads"
	
	systemctl set-property --runtime test.slice MemoryMax=200M
	systemctl set-property --runtime test-protected.slice MemoryLow=50M
	
	sleep 5
	
	systemd-run \
	        -u pressure.service --slice=test.slice \
	        -p MemorySwapMax=0M \ # to push test-protected.slice to swap
	        /root/memeater -m 170 170
	
	sleep 5
	systemd-cgtop -b -1 -m test.slice

Result with memory_recursiveprot

> Control Group                                                        Tasks   %CPU   Memory  Input/s Output/s
> test.slice                                                               3      -   199.9M        -        -
> test.slice/pressure.service                                              1      -   170.5M        -        -
> test.slice/test-protected.slice                                          2      -    29.4M        -        -
> test.slice/test-protected.slice/victim.service                           1      -    29.1M        -        -
> test.slice/test-protected.slice/precious.service                         1      -   292.0K        -        -

Result without memory_recursiveprot

> Control Group                                                        Tasks   %CPU   Memory  Input/s Output/s
> test.slice                                                               3      -   199.8M        -        -
> test.slice/pressure.service                                              1      -   170.5M        -        -
> test.slice/test-protected.slice                                          2      -    29.3M        -        -
> test.slice/test-protected.slice/precious.service                         1      -    28.7M        -        -
> test.slice/test-protected.slice/victim.service                           1      -   560.0K        -        -

(kernel 5.17.0, systemd 249.10)

So with this result, I'd say the event reporting is an independent change
(admiteddly, thanks to the current implementation (not the proposal of
mine) I noticed this issue).
/me scratches head, let me review my other approaches...


Michal

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-03-25 10:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 18:22 [RFC PATCH] mm: memcg: Do not count memory.low reclaim if it does not happen Michal Koutný
2022-03-23 21:44 ` Roman Gushchin
2022-03-24  9:51   ` Michal Koutný
2022-03-24 18:17     ` Roman Gushchin
2022-03-25 10:31       ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).