From: "Michal Koutný" <mkoutny@suse.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, David Vernet <void@manifault.com>, tj@kernel.org, roman.gushchin@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, mhocko@kernel.org, shakeelb@google.com, kernel-team@fb.com, Richard Palethorpe <rpalethorpe@suse.com>, Chris Down <chris@chrisdown.name> Subject: Re: [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low() Date: Tue, 10 May 2022 19:43:41 +0200 [thread overview] Message-ID: <20220510174341.GC24172@blackbody.suse.cz> (raw) In-Reply-To: <20220509174424.e43e695ffe0f7333c187fba8@linux-foundation.org> Hello all. On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm@linux-foundation.org> wrote: > So I think we're OK with [2/5] now. Unless there be objections, I'll > be looking to get this series into mm-stable later this week. I'm sorry, I think the current form of the test reveals an unexpected behavior of reclaim and silencing the test is not the way to go. Although, I may be convinced that my understanding is wrong. On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote: > My understanding of the issue you're raising, Michal, is that > protected siblings start with current > low, then get reclaimed > slightly too much and end up with current < low. This results in a > tiny bit of float that then gets assigned to the low=0 sibling; Up until here, we're on the same page. > when that sibling gets reclaimed regardless, it sees a low event. > Correct me if I missed a detail or nuance here. Here, I'd like to stress that the event itself is just a messenger (whom my original RFC patch attempted to get rid of). The problem is that if the sibling with recursive protection is active enough to claim it, it's effectively stolen from the passive sibling. See the comparison of 'precious' vs 'victim' in [1]. > But unused float going to siblings is intentional. This is documented > in point 3 in the comment above effective_protection(): if you use > less than you're legitimately claiming, the float goes to your > siblings. The problem is how the unused protection came to be (voluntarily not consumed vs reclaimed). > So the problem doesn't seem to be with low accounting and > event generation, but rather it's simply overreclaim. Exactly. > It's conceivable to make reclaim more precise and then tighten up the > test. But right now, David's patch looks correct to me. The obvious fix is at the end of this message, it resolves the case I posted earlier (with memory_recursiveprot), however, it "breaks" memory.events:low accounting inside recursive children, hence I'm not considering it finished. (I may elaborate on the breaking case if interested, I also need to look more into that myself). On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void@manifault.com> wrote: > If you look at how much memory A/B/E gets at the end of the reclaim, > it's still far less than 1MB (though should it be 0?). This selftest has two ±equal workloads in siblings, however, if their activity varies, it can end up even opposite (the example [1]). > This definitely sounds to me like a useful testcase to add, and I'm > happy to do so in a follow-on patch. If we added this, do you think > we need to keep the check for memory.low events for the memory.low == > 0 child in the overcommit testcase? I think it's still useful, to check the behavior when inherited vs explicit siblings coexist under protected parent. Actually, the second case of all siblings having the inherited (implicit) protection is also interesting (it seems that's that I'm seeing in my tests with the attached patch). +Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high priority (too low numerically IIUC) [2]. Michal [1] https://lore.kernel.org/r/20220325103118.GC2828@blackbody.suse.cz/ [2] https://lore.kernel.org/all/20190128214213.GB15349@chrisdown.name/ --- 8< --- From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com> Date: Tue, 10 May 2022 18:48:31 +0200 Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from protected memcg MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was observed with memcontrol selftest/new LTP test but can be also reproduced in simplified setup of two siblings: `parent .low=50M ` s1 .low=50M .current=50M+ε ` s2 .low=0M .current=50M The expectation is that s2/memory.events:low will be zero under outer reclaimer since no protection should be given to cgroup s2 (even with memory_recursiveprot). However, this does not happen. The apparent reason is that when s1 is considered for (proportional) reclaim the scanned proportion is rounded up to SWAP_CLUSTER_MAX and slightly over-proportional amount is reclaimed. Consequently, when the effective low value of s2 is calculated, it observes unclaimed parent's protection from s1 (ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it. What is worse, when the sibling s2 has more (memory) greedy workload, it can repeatedly "steal" the protection from s1 and the distribution ends up with s1 mostly reclaimed despite explicit prioritization over s2. Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from protected memcgs at relatively low priority _without_ counting any memory.events:low (due to overreclaim). Now, if the moderated scan is not enough, we must bring priority to zero to open protected reserves. And that's correct, we want to be explicit when reclaiming those. Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection") Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Reported-by: Richard Palethorpe <rpalethorpe@suse.com> Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@suse.com/ Signed-off-by: Michal Koutný <mkoutny@suse.com> --- mm/vmscan.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..cd760842b9ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, scan = lruvec_size - lruvec_size * protection / (cgroup_size + 1); - - /* - * Minimally target SWAP_CLUSTER_MAX pages to keep - * reclaim moving forwards, avoiding decrementing - * sc->priority further than desirable. - */ - scan = max(scan, SWAP_CLUSTER_MAX); } else { scan = lruvec_size; } -- 2.35.3
WARNING: multiple messages have this Message-ID (diff)
From: "Michal Koutný" <mkoutny-IBi9RG/b67k@public.gmane.org> To: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, David Vernet <void-gq6j2QGBifHby3iVrkZq2A@public.gmane.org>, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org, Richard Palethorpe <rpalethorpe-IBi9RG/b67k@public.gmane.org>, Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org> Subject: Re: [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low() Date: Tue, 10 May 2022 19:43:41 +0200 [thread overview] Message-ID: <20220510174341.GC24172@blackbody.suse.cz> (raw) In-Reply-To: <20220509174424.e43e695ffe0f7333c187fba8-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Hello all. On Mon, May 09, 2022 at 05:44:24PM -0700, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > So I think we're OK with [2/5] now. Unless there be objections, I'll > be looking to get this series into mm-stable later this week. I'm sorry, I think the current form of the test reveals an unexpected behavior of reclaim and silencing the test is not the way to go. Although, I may be convinced that my understanding is wrong. On Mon, May 09, 2022 at 11:09:15AM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote: > My understanding of the issue you're raising, Michal, is that > protected siblings start with current > low, then get reclaimed > slightly too much and end up with current < low. This results in a > tiny bit of float that then gets assigned to the low=0 sibling; Up until here, we're on the same page. > when that sibling gets reclaimed regardless, it sees a low event. > Correct me if I missed a detail or nuance here. Here, I'd like to stress that the event itself is just a messenger (whom my original RFC patch attempted to get rid of). The problem is that if the sibling with recursive protection is active enough to claim it, it's effectively stolen from the passive sibling. See the comparison of 'precious' vs 'victim' in [1]. > But unused float going to siblings is intentional. This is documented > in point 3 in the comment above effective_protection(): if you use > less than you're legitimately claiming, the float goes to your > siblings. The problem is how the unused protection came to be (voluntarily not consumed vs reclaimed). > So the problem doesn't seem to be with low accounting and > event generation, but rather it's simply overreclaim. Exactly. > It's conceivable to make reclaim more precise and then tighten up the > test. But right now, David's patch looks correct to me. The obvious fix is at the end of this message, it resolves the case I posted earlier (with memory_recursiveprot), however, it "breaks" memory.events:low accounting inside recursive children, hence I'm not considering it finished. (I may elaborate on the breaking case if interested, I also need to look more into that myself). On Fri, May 06, 2022 at 09:40:15AM -0700, David Vernet <void-gq6j2QGBifHby3iVrkZq2A@public.gmane.org> wrote: > If you look at how much memory A/B/E gets at the end of the reclaim, > it's still far less than 1MB (though should it be 0?). This selftest has two ±equal workloads in siblings, however, if their activity varies, it can end up even opposite (the example [1]). > This definitely sounds to me like a useful testcase to add, and I'm > happy to do so in a follow-on patch. If we added this, do you think > we need to keep the check for memory.low events for the memory.low == > 0 child in the overcommit testcase? I think it's still useful, to check the behavior when inherited vs explicit siblings coexist under protected parent. Actually, the second case of all siblings having the inherited (implicit) protection is also interesting (it seems that's that I'm seeing in my tests with the attached patch). +Cc: Chris, who reasoned about the SWAP_CLUSTER_MAX rounding vs too high priority (too low numerically IIUC) [2]. Michal [1] https://lore.kernel.org/r/20220325103118.GC2828-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org/ [2] https://lore.kernel.org/all/20190128214213.GB15349-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org/ --- 8< --- From e18caf7a5a1b0f39185fbdc11e4034def42cde88 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny-IBi9RG/b67k@public.gmane.org> Date: Tue, 10 May 2022 18:48:31 +0200 Subject: [RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from protected memcg MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This was observed with memcontrol selftest/new LTP test but can be also reproduced in simplified setup of two siblings: `parent .low=50M ` s1 .low=50M .current=50M+ε ` s2 .low=0M .current=50M The expectation is that s2/memory.events:low will be zero under outer reclaimer since no protection should be given to cgroup s2 (even with memory_recursiveprot). However, this does not happen. The apparent reason is that when s1 is considered for (proportional) reclaim the scanned proportion is rounded up to SWAP_CLUSTER_MAX and slightly over-proportional amount is reclaimed. Consequently, when the effective low value of s2 is calculated, it observes unclaimed parent's protection from s1 (ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it. What is worse, when the sibling s2 has more (memory) greedy workload, it can repeatedly "steal" the protection from s1 and the distribution ends up with s1 mostly reclaimed despite explicit prioritization over s2. Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from protected memcgs at relatively low priority _without_ counting any memory.events:low (due to overreclaim). Now, if the moderated scan is not enough, we must bring priority to zero to open protected reserves. And that's correct, we want to be explicit when reclaiming those. Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection") Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") Reported-by: Richard Palethorpe <rpalethorpe-IBi9RG/b67k@public.gmane.org> Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe-IBi9RG/b67k@public.gmane.org/ Signed-off-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> --- mm/vmscan.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 1678802e03e7..cd760842b9ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, scan = lruvec_size - lruvec_size * protection / (cgroup_size + 1); - - /* - * Minimally target SWAP_CLUSTER_MAX pages to keep - * reclaim moving forwards, avoiding decrementing - * sc->priority further than desirable. - */ - scan = max(scan, SWAP_CLUSTER_MAX); } else { scan = lruvec_size; } -- 2.35.3
next prev parent reply other threads:[~2022-05-10 17:43 UTC|newest] Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-04-23 15:56 [PATCH v2 0/5] Fix bugs in memcontroller cgroup tests David Vernet 2022-04-23 15:56 ` David Vernet 2022-04-23 15:56 ` [PATCH v2 1/5] cgroups: Refactor children cgroups in memcg tests David Vernet 2022-04-26 1:56 ` Roman Gushchin 2022-04-26 1:56 ` Roman Gushchin 2022-04-23 15:56 ` [PATCH v2 2/5] cgroup: Account for memory_recursiveprot in test_memcg_low() David Vernet 2022-04-23 15:56 ` David Vernet 2022-04-27 14:09 ` Michal Koutný 2022-04-27 14:09 ` Michal Koutný 2022-04-29 1:03 ` David Vernet 2022-04-29 1:03 ` David Vernet 2022-04-29 9:26 ` Michal Koutný 2022-04-29 9:26 ` Michal Koutný 2022-05-06 16:40 ` David Vernet 2022-05-06 16:40 ` David Vernet 2022-05-09 15:09 ` Johannes Weiner 2022-05-09 15:09 ` Johannes Weiner 2022-05-10 0:44 ` Andrew Morton 2022-05-10 0:44 ` Andrew Morton 2022-05-10 17:43 ` Michal Koutný [this message] 2022-05-10 17:43 ` Michal Koutný 2022-05-11 17:53 ` Johannes Weiner 2022-05-11 17:53 ` Johannes Weiner 2022-05-12 17:27 ` Michal Koutný 2022-05-12 17:27 ` Michal Koutný 2022-04-23 15:56 ` [PATCH v2 3/5] cgroup: Account for memory_localevents in test_memcg_oom_group_leaf_events() David Vernet 2022-04-23 15:56 ` David Vernet 2022-04-23 15:56 ` [PATCH v2 4/5] cgroup: Removing racy check in test_memcg_sock() David Vernet 2022-04-23 15:56 ` David Vernet 2022-04-23 15:56 ` [PATCH v2 5/5] cgroup: Fix racy check in alloc_pagecache_max_30M() helper function David Vernet 2022-04-23 15:56 ` David Vernet 2022-05-12 17:04 ` [PATCH v2 0/5] Fix bugs in memcontroller cgroup tests Michal Koutný 2022-05-12 17:04 ` Michal Koutný 2022-05-12 17:30 ` David Vernet 2022-05-12 17:30 ` David Vernet 2022-05-12 17:44 ` David Vernet 2022-05-12 17:44 ` David Vernet 2022-05-13 17:18 ` [PATCH 0/4] memcontrol selftests fixups Michal Koutný 2022-05-13 17:18 ` Michal Koutný 2022-05-13 17:18 ` [PATCH 1/4] selftests: memcg: Fix compilation Michal Koutný 2022-05-13 17:18 ` Michal Koutný 2022-05-13 17:40 ` David Vernet 2022-05-13 17:40 ` David Vernet 2022-05-13 18:53 ` Roman Gushchin 2022-05-13 18:53 ` Roman Gushchin 2022-05-13 19:09 ` Roman Gushchin 2022-05-13 19:09 ` Roman Gushchin 2022-05-13 17:18 ` [PATCH 2/4] selftests: memcg: Expect no low events in unprotected sibling Michal Koutný 2022-05-13 17:18 ` Michal Koutný 2022-05-13 17:42 ` David Vernet 2022-05-13 17:42 ` David Vernet 2022-05-13 18:54 ` Roman Gushchin 2022-05-18 15:54 ` Michal Koutný 2022-05-18 15:54 ` Michal Koutný 2022-05-13 17:18 ` [PATCH 3/4] selftests: memcg: Adjust expected reclaim values of protected cgroups Michal Koutný 2022-05-13 17:18 ` Michal Koutný 2022-05-13 18:52 ` Roman Gushchin 2022-05-13 18:52 ` Roman Gushchin 2022-05-13 17:18 ` [PATCH 4/4] selftests: memcg: Remove protection from top level memcg Michal Koutný 2022-05-13 17:18 ` Michal Koutný 2022-05-13 18:59 ` Roman Gushchin 2022-05-13 18:59 ` Roman Gushchin 2022-05-18 0:24 ` Andrew Morton 2022-05-18 0:24 ` Andrew Morton 2022-05-18 0:52 ` Roman Gushchin 2022-05-18 0:52 ` Roman Gushchin 2022-05-18 15:44 ` Michal Koutný 2022-05-18 15:44 ` Michal Koutný 2022-05-13 19:14 ` David Vernet 2022-05-13 19:14 ` David Vernet
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20220510174341.GC24172@blackbody.suse.cz \ --to=mkoutny@suse.com \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=chris@chrisdown.name \ --cc=hannes@cmpxchg.org \ --cc=kernel-team@fb.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@kernel.org \ --cc=roman.gushchin@linux.dev \ --cc=rpalethorpe@suse.com \ --cc=shakeelb@google.com \ --cc=tj@kernel.org \ --cc=void@manifault.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.