linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, chris@chrisdown.name, guro@fb.com,
	hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@suse.com,
	mkoutny@suse.com, mm-commits@vger.kernel.org,
	shakeelb@google.com, stable@vger.kernel.org, tj@kernel.org,
	torvalds@linux-foundation.org
Subject: [patch 12/14] Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
Date: Tue, 09 Feb 2021 13:42:28 -0800	[thread overview]
Message-ID: <20210209214228.djnZUZeMX%akpm@linux-foundation.org> (raw)
In-Reply-To: <20210209134115.4d933d446165cd0ed8977b03@linux-foundation.org>

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"

This reverts commit 536d3bf261a2fc3b05b3e91e7eef7383443015cf, as it can
cause writers to memory.high to get stuck in the kernel forever,
performing page reclaim and consuming excessive amounts of CPU cycles.

Before the patch, a write to memory.high would first put the new limit in
place for the workload, and then reclaim the requested delta.  After the
patch, the kernel tries to reclaim the delta before putting the new limit
into place, in order to not overwhelm the workload with a sudden, large
excess over the limit.  However, if reclaim is actively racing with new
allocations from the uncurbed workload, it can keep the write() working
inside the kernel indefinitely.

This is causing problems in Facebook production.  A privileged
system-level daemon that adjusts memory.high for various workloads running
on a host can get unexpectedly stuck in the kernel and essentially turn
into a sort of involuntary kswapd for one of the workloads.  We've
observed that daemon busy-spin in a write() for minutes at a time,
neglecting its other duties on the system, and expending privileged system
resources on behalf of a workload.

To remedy this, we have first considered changing the reclaim logic to
break out after a couple of loops - whether the workload has converged to
the new limit or not - and bound the write() call this way.  However, the
root cause that inspired the sequence change in the first place has been
fixed through other means, and so a revert back to the proven
limit-setting sequence, also used by memory.max, is preferable.

The sequence was changed to avoid extreme latencies in the workload when
the limit was lowered: the sudden, large excess created by the limit
lowering would erroneously trigger the penalty sleeping code that is meant
to throttle excessive growth from below.  Allocating threads could end up
sleeping long after the write() had already reclaimed the delta for which
they were being punished.

However, erroneous throttling also caused problems in other scenarios at
around the same time.  This resulted in commit b3ff92916af3 ("mm, memcg:
reclaim more aggressively before high allocator throttling"), included in
the same release as the offending commit.  When allocating threads now
encounter large excess caused by a racing write() to memory.high, instead
of entering punitive sleeps, they will simply be tasked with helping
reclaim down the excess, and will be held no longer than it takes to
accomplish that.  This is in line with regular limit enforcement - i.e. 
if the workload allocates up against or over an otherwise unchanged limit
from below.

With the patch breaking userspace, and the root cause addressed by other
means already, revert it again.

Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/mm/memcontrol.c~revert-mm-memcontrol-avoid-workload-stalls-when-lowering-memoryhigh
+++ a/mm/memcontrol.c
@@ -6271,6 +6271,8 @@ static ssize_t memory_high_write(struct
 	if (err)
 		return err;
 
+	page_counter_set_high(&memcg->memory, high);
+
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
 		unsigned long reclaimed;
@@ -6294,10 +6296,7 @@ static ssize_t memory_high_write(struct
 			break;
 	}
 
-	page_counter_set_high(&memcg->memory, high);
-
 	memcg_wb_domain_size_changed(memcg);
-
 	return nbytes;
 }
 
_


  parent reply	other threads:[~2021-02-09 21:42 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-09 21:41 incoming Andrew Morton
2021-02-09 21:41 ` [patch 01/14] squashfs: avoid out of bounds writes in decompressors Andrew Morton
2021-02-09 21:41 ` [patch 02/14] squashfs: add more sanity checks in id lookup Andrew Morton
2021-02-09 21:41 ` [patch 03/14] squashfs: add more sanity checks in inode lookup Andrew Morton
2021-02-09 21:42 ` [patch 04/14] squashfs: add more sanity checks in xattr id lookup Andrew Morton
2021-02-09 21:42 ` [patch 05/14] kasan: fix stack traces dependency for HW_TAGS Andrew Morton
2021-02-09 21:42 ` [patch 06/14] firmware_loader: align .builtin_fw to 8 Andrew Morton
2021-02-09 21:42 ` [patch 07/14] mm/mremap: fix BUILD_BUG_ON() error in get_extent Andrew Morton
2021-02-09 21:42 ` [patch 08/14] tmpfs: disallow CONFIG_TMPFS_INODE64 on s390 Andrew Morton
2021-02-09 21:42 ` [patch 09/14] tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha Andrew Morton
2021-02-09 22:03   ` Linus Torvalds
2021-02-10 13:34     ` Heiko Carstens
2021-02-10 17:27       ` Heiko Carstens
2021-02-10 19:17       ` Linus Torvalds
2021-02-10 19:55         ` Arnd Bergmann
2021-02-11 18:45         ` Heiko Carstens
2021-02-09 21:42 ` [patch 10/14] selftests/vm: rename file run_vmtests to run_vmtests.sh Andrew Morton
2021-02-09 21:42 ` [patch 11/14] MAINTAINERS: update Andrey Ryabinin's email address Andrew Morton
2021-02-09 21:42 ` Andrew Morton [this message]
2021-02-09 21:42 ` [patch 13/14] mm, slub: better heuristic for number of cpus when calculating slab order Andrew Morton
2021-02-10 14:34   ` Vlastimil Babka
2021-02-10 19:22     ` Linus Torvalds
2021-02-09 21:42 ` [patch 14/14] nilfs2: make splice write available again Andrew Morton
2021-02-10 19:30 ` incoming Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210209214228.djnZUZeMX%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=chris@chrisdown.name \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=mm-commits@vger.kernel.org \
    --cc=shakeelb@google.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).