All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michal Koutný" <mkoutny@suse.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	David Vernet <void@manifault.com>,
	tj@kernel.org, roman.gushchin@linux.dev,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, mhocko@kernel.org, shakeelb@google.com,
	kernel-team@fb.com, Richard Palethorpe <rpalethorpe@suse.com>,
	Chris Down <chris@chrisdown.name>
Subject: Re: [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low()
Date: Tue, 10 May 2022 19:43:41 +0200	[thread overview]
Message-ID: <20220510174341.GC24172@blackbody.suse.cz> (raw)
In-Reply-To: <20220509174424.e43e695ffe0f7333c187fba8@linux-foundation.org>

Hello all.

On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm@linux-foundation.org> wrote:
> So I think we're OK with [2/5] now.  Unless there be objections, I'll
> be looking to get this series into mm-stable later this week.

I'm sorry, I think the current form of the test reveals an unexpected
behavior of reclaim and silencing the test is not the way to go.
Although, I may be convinced that my understanding is wrong.


On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> My understanding of the issue you're raising, Michal, is that
> protected siblings start with current > low, then get reclaimed
> slightly too much and end up with current < low. This results in a
> tiny bit of float that then gets assigned to the low=0 sibling; 

Up until here, we're on the same page.

> when that sibling gets reclaimed regardless, it sees a low event.
> Correct me if I missed a detail or nuance here.

Here, I'd like to stress that the event itself is just a messenger (whom
my original RFC patch attempted to get rid of). The problem is that if
the sibling with recursive protection is active enough to claim it, it's
effectively stolen from the passive sibling. See the comparison of
'precious' vs 'victim' in [1].

> But unused float going to siblings is intentional. This is documented
> in point 3 in the comment above effective_protection(): if you use
> less than you're legitimately claiming, the float goes to your
> siblings.

The problem is how the unused protection came to be (voluntarily not
consumed vs reclaimed).

> So the problem doesn't seem to be with low accounting and
> event generation, but rather it's simply overreclaim.

Exactly.

> It's conceivable to make reclaim more precise and then tighten up the
> test. But right now, David's patch looks correct to me.

The obvious fix is at the end of this message, it resolves the case I
posted earlier (with memory_recursiveprot), however, it "breaks"
memory.events:low accounting inside recursive children, hence I'm not
considering it finished. (I may elaborate on the breaking case if
interested, I also need to look more into that myself).


On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void@manifault.com> wrote:
> If you look at how much memory A/B/E gets at the end of the reclaim,
> it's still far less than 1MB (though should it be 0?).

This selftest has two ±equal workloads in siblings, however, if their
activity varies, it can end up even opposite (the example [1]).

> This definitely sounds to me like a useful testcase to add, and I'm
> happy to do so in a follow-on patch. If we added this, do you think
> we need to keep the check for memory.low events for the memory.low ==
> 0 child in the overcommit testcase?

I think it's still useful, to check the behavior when inherited vs
explicit siblings coexist under protected parent.
Actually, the second case of all siblings having the inherited
(implicit) protection is also interesting (it seems that's that I'm
seeing in my tests with the attached patch).

+Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high
priority (too low numerically IIUC) [2].

Michal

[1] https://lore.kernel.org/r/20220325103118.GC2828@blackbody.suse.cz/
[2] https://lore.kernel.org/all/20190128214213.GB15349@chrisdown.name/

--- 8< ---
From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Date: Tue, 10 May 2022 18:48:31 +0200
Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from
 protected memcg
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

	`parent .low=50M
	  ` s1	.low=50M  .current=50M+ε
	  ` s2  .low=0M   .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.

What is worse, when the sibling s2 has more (memory) greedy workload, it
can repeatedly "steal" the protection from s1 and the distribution ends
up with s1 mostly reclaimed despite explicit prioritization over s2.

Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have
saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from
protected memcgs at relatively low priority _without_ counting any
memory.events:low (due to overreclaim). Now, if the moderated scan is
not enough, we must bring priority to zero to open protected reserves.
And that's correct, we want to be explicit when reclaiming those.


Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Reported-by: Richard Palethorpe <rpalethorpe@suse.com>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 mm/vmscan.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cd760842b9ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 			scan = lruvec_size - lruvec_size * protection /
 				(cgroup_size + 1);
-
-			/*
-			 * Minimally target SWAP_CLUSTER_MAX pages to keep
-			 * reclaim moving forwards, avoiding decrementing
-			 * sc->priority further than desirable.
-			 */
-			scan = max(scan, SWAP_CLUSTER_MAX);
 		} else {
 			scan = lruvec_size;
 		}
-- 
2.35.3



WARNING: multiple messages have this Message-ID (diff)
From: "Michal Koutný" <mkoutny-IBi9RG/b67k@public.gmane.org>
To: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	David Vernet <void-gq6j2QGBifHby3iVrkZq2A@public.gmane.org>,
	tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org,
	Richard Palethorpe <rpalethorpe-IBi9RG/b67k@public.gmane.org>,
	Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
Subject: Re: [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low()
Date: Tue, 10 May 2022 19:43:41 +0200	[thread overview]
Message-ID: <20220510174341.GC24172@blackbody.suse.cz> (raw)
In-Reply-To: <20220509174424.e43e695ffe0f7333c187fba8-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

Hello all.

On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> So I think we're OK with [2/5] now.  Unless there be objections, I'll
> be looking to get this series into mm-stable later this week.

I'm sorry, I think the current form of the test reveals an unexpected
behavior of reclaim and silencing the test is not the way to go.
Although, I may be convinced that my understanding is wrong.


On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> My understanding of the issue you're raising, Michal, is that
> protected siblings start with current > low, then get reclaimed
> slightly too much and end up with current < low. This results in a
> tiny bit of float that then gets assigned to the low=0 sibling; 

Up until here, we're on the same page.

> when that sibling gets reclaimed regardless, it sees a low event.
> Correct me if I missed a detail or nuance here.

Here, I'd like to stress that the event itself is just a messenger (whom
my original RFC patch attempted to get rid of). The problem is that if
the sibling with recursive protection is active enough to claim it, it's
effectively stolen from the passive sibling. See the comparison of
'precious' vs 'victim' in [1].

> But unused float going to siblings is intentional. This is documented
> in point 3 in the comment above effective_protection(): if you use
> less than you're legitimately claiming, the float goes to your
> siblings.

The problem is how the unused protection came to be (voluntarily not
consumed vs reclaimed).

> So the problem doesn't seem to be with low accounting and
> event generation, but rather it's simply overreclaim.

Exactly.

> It's conceivable to make reclaim more precise and then tighten up the
> test. But right now, David's patch looks correct to me.

The obvious fix is at the end of this message, it resolves the case I
posted earlier (with memory_recursiveprot), however, it "breaks"
memory.events:low accounting inside recursive children, hence I'm not
considering it finished. (I may elaborate on the breaking case if
interested, I also need to look more into that myself).


On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void-gq6j2QGBifHby3iVrkZq2A@public.gmane.org> wrote:
> If you look at how much memory A/B/E gets at the end of the reclaim,
> it's still far less than 1MB (though should it be 0?).

This selftest has two ±equal workloads in siblings, however, if their
activity varies, it can end up even opposite (the example [1]).

> This definitely sounds to me like a useful testcase to add, and I'm
> happy to do so in a follow-on patch. If we added this, do you think
> we need to keep the check for memory.low events for the memory.low ==
> 0 child in the overcommit testcase?

I think it's still useful, to check the behavior when inherited vs
explicit siblings coexist under protected parent.
Actually, the second case of all siblings having the inherited
(implicit) protection is also interesting (it seems that's that I'm
seeing in my tests with the attached patch).

+Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high
priority (too low numerically IIUC) [2].

Michal

[1] https://lore.kernel.org/r/20220325103118.GC2828-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org/
[2] https://lore.kernel.org/all/20190128214213.GB15349-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org/

--- 8< ---
From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny-IBi9RG/b67k@public.gmane.org>
Date: Tue, 10 May 2022 18:48:31 +0200
Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from
 protected memcg
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

	`parent .low=50M
	  ` s1	.low=50M  .current=50M+ε
	  ` s2  .low=0M   .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.

What is worse, when the sibling s2 has more (memory) greedy workload, it
can repeatedly "steal" the protection from s1 and the distribution ends
up with s1 mostly reclaimed despite explicit prioritization over s2.

Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have
saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from
protected memcgs at relatively low priority _without_ counting any
memory.events:low (due to overreclaim). Now, if the moderated scan is
not enough, we must bring priority to zero to open protected reserves.
And that's correct, we want to be explicit when reclaiming those.


Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Reported-by: Richard Palethorpe <rpalethorpe-IBi9RG/b67k@public.gmane.org>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe-IBi9RG/b67k@public.gmane.org/
Signed-off-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 mm/vmscan.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cd760842b9ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 			scan = lruvec_size - lruvec_size * protection /
 				(cgroup_size + 1);
-
-			/*
-			 * Minimally target SWAP_CLUSTER_MAX pages to keep
-			 * reclaim moving forwards, avoiding decrementing
-			 * sc->priority further than desirable.
-			 */
-			scan = max(scan, SWAP_CLUSTER_MAX);
 		} else {
 			scan = lruvec_size;
 		}
-- 
2.35.3



  reply	other threads:[~2022-05-10 17:43 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-23 15:56 [PATCH v2 0/5] Fix bugs in memcontroller cgroup tests David Vernet
2022-04-23 15:56 ` David Vernet
2022-04-23 15:56 ` [PATCH v2 1/5] cgroups: Refactor children cgroups in memcg tests David Vernet
2022-04-26  1:56   ` Roman Gushchin
2022-04-26  1:56     ` Roman Gushchin
2022-04-23 15:56 ` [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low() David Vernet
2022-04-23 15:56   ` David Vernet
2022-04-27 14:09   ` Michal Koutný
2022-04-27 14:09     ` Michal Koutný
2022-04-29  1:03     ` David Vernet
2022-04-29  1:03       ` David Vernet
2022-04-29  9:26       ` Michal Koutný
2022-04-29  9:26         ` Michal Koutný
2022-05-06 16:40         ` David Vernet
2022-05-06 16:40           ` David Vernet
2022-05-09 15:09           ` Johannes Weiner
2022-05-09 15:09             ` Johannes Weiner
2022-05-10  0:44             ` Andrew Morton
2022-05-10  0:44               ` Andrew Morton
2022-05-10 17:43               ` Michal Koutný [this message]
2022-05-10 17:43                 ` Michal Koutný
2022-05-11 17:53                 ` Johannes Weiner
2022-05-11 17:53                   ` Johannes Weiner
2022-05-12 17:27                   ` Michal Koutný
2022-05-12 17:27                     ` Michal Koutný
2022-04-23 15:56 ` [PATCH v2 3/5] cgroup: Account for memory_localevents in test_memcg_oom_group_leaf_events() David Vernet
2022-04-23 15:56   ` David Vernet
2022-04-23 15:56 ` [PATCH v2 4/5] cgroup: Removing racy check in test_memcg_sock() David Vernet
2022-04-23 15:56   ` David Vernet
2022-04-23 15:56 ` [PATCH v2 5/5] cgroup: Fix racy check in alloc_pagecache_max_30M() helper function David Vernet
2022-04-23 15:56   ` David Vernet
2022-05-12 17:04 ` [PATCH v2 0/5] Fix bugs in memcontroller cgroup tests Michal Koutný
2022-05-12 17:04   ` Michal Koutný
2022-05-12 17:30   ` David Vernet
2022-05-12 17:30     ` David Vernet
2022-05-12 17:44     ` David Vernet
2022-05-12 17:44       ` David Vernet
2022-05-13 17:18       ` [PATCH 0/4] memcontrol selftests fixups Michal Koutný
2022-05-13 17:18         ` Michal Koutný
2022-05-13 17:18         ` [PATCH 1/4] selftests: memcg: Fix compilation Michal Koutný
2022-05-13 17:18           ` Michal Koutný
2022-05-13 17:40           ` David Vernet
2022-05-13 17:40             ` David Vernet
2022-05-13 18:53           ` Roman Gushchin
2022-05-13 18:53             ` Roman Gushchin
2022-05-13 19:09             ` Roman Gushchin
2022-05-13 19:09               ` Roman Gushchin
2022-05-13 17:18         ` [PATCH 2/4] selftests: memcg: Expect no low events in unprotected sibling Michal Koutný
2022-05-13 17:18           ` Michal Koutný
2022-05-13 17:42           ` David Vernet
2022-05-13 17:42             ` David Vernet
2022-05-13 18:54           ` Roman Gushchin
2022-05-18 15:54             ` Michal Koutný
2022-05-18 15:54               ` Michal Koutný
2022-05-13 17:18         ` [PATCH 3/4] selftests: memcg: Adjust expected reclaim values of protected cgroups Michal Koutný
2022-05-13 17:18           ` Michal Koutný
2022-05-13 18:52           ` Roman Gushchin
2022-05-13 18:52             ` Roman Gushchin
2022-05-13 17:18         ` [PATCH 4/4] selftests: memcg: Remove protection from top level memcg Michal Koutný
2022-05-13 17:18           ` Michal Koutný
2022-05-13 18:59           ` Roman Gushchin
2022-05-13 18:59             ` Roman Gushchin
2022-05-18  0:24             ` Andrew Morton
2022-05-18  0:24               ` Andrew Morton
2022-05-18  0:52               ` Roman Gushchin
2022-05-18  0:52                 ` Roman Gushchin
2022-05-18 15:44                 ` Michal Koutný
2022-05-18 15:44                   ` Michal Koutný
2022-05-13 19:14           ` David Vernet
2022-05-13 19:14             ` David Vernet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220510174341.GC24172@blackbody.suse.cz \
    --to=mkoutny@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chris@chrisdown.name \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rpalethorpe@suse.com \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.