Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4

From: Michal Hocko <mhocko@kernel.org>
To: Paul Furtado <paulfurtado91@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Date: Wed, 15 Apr 2020 11:44:58 +0200	[thread overview]
Message-ID: <20200415094458.GB4629@dhcp22.suse.cz> (raw)
In-Reply-To: <CAKkCftoa3e3cj2jArO5Ekk68_p6igSu+GzqWDrkWVo0WGcuZ4g@mail.gmail.com>

On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > You can either try to use cgroup v2 which has much better memcg aware dirty
> > throttling implementation so such a large amount of dirty pages doesn't
> > accumulate in the first place
> 
> I'd love to use cgroup v2, however this is docker + kubernetes so that
> would require a lot of changes on our end to make happen, given how
> recently container runtimes gained cgroup v2 support.
> 
> > I pressume you are using defaults for
> > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > available memory. I would recommend using their resp. *_bytes
> > alternatives and use something like 500M for background and 800M for
> > dirty_bytes.
> 
> We're using the defaults right now, however, given that this is a
> containerized environment, it's problematic to set these values too
> low system-wide since the containers all have dedicated volumes with
> varying performance (from as low as 100MB/sec to gigabyes). Looking
> around, I see that there were patches in the past to set per-cgroup
> vm.dirty settings, however it doesn't look like those ever made it
> into the kernel unless I'm missing something.

I am not aware of that work for memcg v1.

> In practice, maybe 500M
> and 800M wouldn't be so bad though and may improve latency in other
> ways. The other problem is that this also sets an upper bound on the
> minimum container size for anything that does do IO.

Well this would be a conservative approach but most allocations will
simply be throttled during reclaim. It is the restricted memory reclaim
context that is the bummer here. I have already brought up why this is
the case in the generic write(2) system call path [1]. Maybe we can
reduce the amount of NOFS requests.

> That said, I'll
> still I'll tune these settings in our infrastructure and see how
> things go, but it sounds like something should be done inside the
> kernel to help this situation, since it's so easy to trigger, but
> looking at the threads that led to the commits you referenced, I can
> see that this is complicated.

Yeah, there are certainly things that we should be doing and reducing
the NOFS allocations is the first step. From my past experience
non trivial usage has turned out to be used incorrectly. I am not sure
how much we can do for cgroup v1 though. If tuning for global dirty
thresholds doesn't lead to a better behavior we can think of a band aid
of some form. Something like this (only compile tested)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05b4ec2c6499..4e1e8d121785 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	/*
+	 * Legacy memcg relies on dirty data throttling during the reclaim
+	 * but this cannot be done for GFP_NOFS requests so we might trigger
+	 * the oom way too early. Throttle here if we have way too many
+	 * dirty/writeback pages.
+	 */
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
+		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
+			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
+
+		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
+			schedule_timeout_interruptible(1);
+	}
+
 	if (nr_retries--)
 		goto retry;
 

[1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs