From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lua+=57=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.0 required=3.0 tests=INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A183EC2BA19
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 09:45:04 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5F011206D9
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 09:45:04 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5F011206D9
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B96FB8E0005; Wed, 15 Apr 2020 05:45:03 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B47278E0001; Wed, 15 Apr 2020 05:45:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A5BE18E0005; Wed, 15 Apr 2020 05:45:03 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1])
	by kanga.kvack.org (Postfix) with ESMTP id 8BE6D8E0001
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 05:45:03 -0400 (EDT)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 553FF1812A463
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 09:45:03 +0000 (UTC)
X-FDA: 76709605686.06.flag57_8366ff1e5ef24
X-HE-Tag: flag57_8366ff1e5ef24
X-Filterd-Recvd-Size: 6075
Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66])
	by imf20.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 09:45:02 +0000 (UTC)
Received: by mail-wm1-f66.google.com with SMTP id a81so18142068wmf.5
        for <linux-mm@kvack.org>; Wed, 15 Apr 2020 02:45:02 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=G5lIOHGicdsAxlZe8Kug7sbDWFnVmxugEFNCQ1Nk47I=;
        b=rIH+PHb65GJFNUuUwLgtgS+SOqhPTP/DPEDrx6NJxCMXLgU2gdU7LXsTQRGsPCMCt3
         spf4be09NRwEqdiOugaLhwMxvh4w/L8aJFA3JK+2NF0ZmboBzWS8R8WQHhZHDfrz232N
         kFBuk9D5SVJiPbtBGsyr3TYO2J58coDQx5qvVN+TItloNoCfloRUn4TOj5Jm1f2rjfdu
         n+ZrjOQPOSljD/hvaAggwluEcK0481m/YvT/JL0iTRnzG9jdr2X6296WQHN9l/FMA5Wi
         aED1p/fv1sIiNvL5hCJIb6aOqpdYffQziKBVOpM7Lkniq1eNDjFWAGtalOfKNLY3urR3
         0jMQ==
X-Gm-Message-State: AGi0PuZYBGjDKMgH2wPlV/lmjQ6Ejch/Vt2nCFYln1/DvQpPCk4Qdt0G
	4YAwH0P1p5HU8sMaiVAiM80=
X-Google-Smtp-Source: APiQypJDU/iedD8mZ42omXzDdk/yqApY5jsDa5umx977t0F1ppQhLlYQBl0oWk5BxPownP0TfCC6vQ==
X-Received: by 2002:a7b:cbc6:: with SMTP id n6mr4624991wmi.155.1586943901674;
        Wed, 15 Apr 2020 02:45:01 -0700 (PDT)
Received: from localhost (ip-37-188-180-223.eurotel.cz. [37.188.180.223])
        by smtp.gmail.com with ESMTPSA id y20sm23375849wra.79.2020.04.15.02.44.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2020 02:45:00 -0700 (PDT)
Date: Wed, 15 Apr 2020 11:44:58 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Paul Furtado <paulfurtado91@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage
 OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Message-ID: <20200415094458.GB4629@dhcp22.suse.cz>
References: <bug-207273-27@https.bugzilla.kernel.org/>
 <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org>
 <20200415065059.GV4629@dhcp22.suse.cz>
 <CAKkCftoa3e3cj2jArO5Ekk68_p6igSu+GzqWDrkWVo0WGcuZ4g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAKkCftoa3e3cj2jArO5Ekk68_p6igSu+GzqWDrkWVo0WGcuZ4g@mail.gmail.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > You can either try to use cgroup v2 which has much better memcg aware dirty
> > throttling implementation so such a large amount of dirty pages doesn't
> > accumulate in the first place
> 
> I'd love to use cgroup v2, however this is docker + kubernetes so that
> would require a lot of changes on our end to make happen, given how
> recently container runtimes gained cgroup v2 support.
> 
> > I pressume you are using defaults for
> > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > available memory. I would recommend using their resp. *_bytes
> > alternatives and use something like 500M for background and 800M for
> > dirty_bytes.
> 
> We're using the defaults right now, however, given that this is a
> containerized environment, it's problematic to set these values too
> low system-wide since the containers all have dedicated volumes with
> varying performance (from as low as 100MB/sec to gigabyes). Looking
> around, I see that there were patches in the past to set per-cgroup
> vm.dirty settings, however it doesn't look like those ever made it
> into the kernel unless I'm missing something.

I am not aware of that work for memcg v1.

> In practice, maybe 500M
> and 800M wouldn't be so bad though and may improve latency in other
> ways. The other problem is that this also sets an upper bound on the
> minimum container size for anything that does do IO.

Well this would be a conservative approach but most allocations will
simply be throttled during reclaim. It is the restricted memory reclaim
context that is the bummer here. I have already brought up why this is
the case in the generic write(2) system call path [1]. Maybe we can
reduce the amount of NOFS requests.

> That said, I'll
> still I'll tune these settings in our infrastructure and see how
> things go, but it sounds like something should be done inside the
> kernel to help this situation, since it's so easy to trigger, but
> looking at the threads that led to the commits you referenced, I can
> see that this is complicated.

Yeah, there are certainly things that we should be doing and reducing
the NOFS allocations is the first step. From my past experience
non trivial usage has turned out to be used incorrectly. I am not sure
how much we can do for cgroup v1 though. If tuning for global dirty
thresholds doesn't lead to a better behavior we can think of a band aid
of some form. Something like this (only compile tested)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05b4ec2c6499..4e1e8d121785 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	/*
+	 * Legacy memcg relies on dirty data throttling during the reclaim
+	 * but this cannot be done for GFP_NOFS requests so we might trigger
+	 * the oom way too early. Throttle here if we have way too many
+	 * dirty/writeback pages.
+	 */
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
+		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
+			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
+
+		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
+			schedule_timeout_interruptible(1);
+	}
+
 	if (nr_retries--)
 		goto retry;
 

[1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs