From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=LhOy=NC=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4B85AC4338F
	for <linux-mm@archiver.kernel.org>; Wed, 11 Aug 2021 19:41:36 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CA08A60FD9
	for <linux-mm@archiver.kernel.org>; Wed, 11 Aug 2021 19:41:35 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org CA08A60FD9
Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4949A8D0001; Wed, 11 Aug 2021 15:41:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4420B6B0072; Wed, 11 Aug 2021 15:41:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3085A8D0001; Wed, 11 Aug 2021 15:41:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129])
	by kanga.kvack.org (Postfix) with ESMTP id 141F96B0071
	for <linux-mm@kvack.org>; Wed, 11 Aug 2021 15:41:35 -0400 (EDT)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A9F538249980
	for <linux-mm@kvack.org>; Wed, 11 Aug 2021 19:41:34 +0000 (UTC)
X-FDA: 78463819308.22.9A2EAEC
Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154])
	by imf14.hostedemail.com (Postfix) with ESMTP id 3745A6010F6A
	for <linux-mm@kvack.org>; Wed, 11 Aug 2021 19:41:34 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1628710895; x=1660246895;
  h=date:from:to:cc:subject:message-id:references:
   mime-version:in-reply-to;
  bh=jigMxMx816KLd1gmW/Cn6NZlqdLUWSkcjZn9zg4Rrs0=;
  b=o42URM5Owg16zORSmNekT/EM6f+idVpv5D/EAw9AW1UJJ5CLjhTo8lGq
   cg73SiJdKphZN6xDIVhOMr/oIMr+xh3oeVHZeqJDFimbveESglzngigvh
   Lk+aeQpI9yFcR7xoOHDyfMbpw223FM9ZplCXqKYcE4D5T9vPkDH/eCu7U
   I=;
X-IronPort-AV: E=Sophos;i="5.84,313,1620691200"; 
   d="scan'208";a="133284632"
Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1d-98acfc19.us-east-1.amazon.com) ([10.43.8.6])
  by smtp-border-fw-6001.iad6.amazon.com with ESMTP; 11 Aug 2021 19:41:28 +0000
Received: from EX13MTAUWB001.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan2.iad.amazon.com [10.40.159.162])
	by email-inbound-relay-1d-98acfc19.us-east-1.amazon.com (Postfix) with ESMTPS id 52AC5A178F;
	Wed, 11 Aug 2021 19:41:26 +0000 (UTC)
Received: from EX13D21UWB004.ant.amazon.com (10.43.161.221) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.23; Wed, 11 Aug 2021 19:41:25 +0000
Received: from EX13MTAUEA001.ant.amazon.com (10.43.61.82) by
 EX13D21UWB004.ant.amazon.com (10.43.161.221) with Microsoft SMTP Server (TLS)
 id 15.0.1497.23; Wed, 11 Aug 2021 19:41:24 +0000
Received: from dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com
 (172.22.96.68) by mail-relay.amazon.com (10.43.61.243) with Microsoft SMTP
 Server id 15.0.1497.23 via Frontend Transport; Wed, 11 Aug 2021 19:41:24
 +0000
Received: by dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (Postfix, from userid 4335130)
	id C35EC403C3; Wed, 11 Aug 2021 19:41:23 +0000 (UTC)
Date: Wed, 11 Aug 2021 19:41:23 +0000
From: Anchal Agarwal <anchalag@amazon.com>
To: Michal Hocko <mhocko@kernel.org>
CC: Paul Furtado <paulfurtado91@gmail.com>, Andrew Morton
	<akpm@linux-foundation.org>, <bugzilla-daemon@bugzilla.kernel.org>,
	<linux-mm@kvack.org>, <benh@amazon.com>
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage
 OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Message-ID: <20210811194123.GA8198@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
References: <bug-207273-27@https.bugzilla.kernel.org/>
 <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org>
 <20200415065059.GV4629@dhcp22.suse.cz>
 <CAKkCftoa3e3cj2jArO5Ekk68_p6igSu+GzqWDrkWVo0WGcuZ4g@mail.gmail.com>
 <20200415094458.GB4629@dhcp22.suse.cz>
 <20210806204246.GA21051@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20210806204246.GA21051@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Rspamd-Queue-Id: 3745A6010F6A
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=amazon.com header.s=amazon201209 header.b=o42URM5O;
	spf=pass (imf14.hostedemail.com: domain of "prvs=85035d471=anchalag@amazon.com" designates 52.95.48.154 as permitted sender) smtp.mailfrom="prvs=85035d471=anchalag@amazon.com";
	dmarc=pass (policy=quarantine) header.from=amazon.com
X-Rspamd-Server: rspam01
X-Stat-Signature: he57dhx5psbh7chuqh8qipfr6tedrpf4
X-HE-Tag: 1628710894-981283
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote:
> On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> > On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > > You can either try to use cgroup v2 which has much better memcg aware dirty
> > > > throttling implementation so such a large amount of dirty pages doesn't
> > > > accumulate in the first place
> > > 
> > > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > > would require a lot of changes on our end to make happen, given how
> > > recently container runtimes gained cgroup v2 support.
> > > 
> > > > I pressume you are using defaults for
> > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > > available memory. I would recommend using their resp. *_bytes
> > > > alternatives and use something like 500M for background and 800M for
> > > > dirty_bytes.
> > > 
> > > We're using the defaults right now, however, given that this is a
> > > containerized environment, it's problematic to set these values too
> > > low system-wide since the containers all have dedicated volumes with
> > > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > > around, I see that there were patches in the past to set per-cgroup
> > > vm.dirty settings, however it doesn't look like those ever made it
> > > into the kernel unless I'm missing something.
> > 
> > I am not aware of that work for memcg v1.
> > 
> > > In practice, maybe 500M
> > > and 800M wouldn't be so bad though and may improve latency in other
> > > ways. The other problem is that this also sets an upper bound on the
> > > minimum container size for anything that does do IO.
> > 
> > Well this would be a conservative approach but most allocations will
> > simply be throttled during reclaim. It is the restricted memory reclaim
> > context that is the bummer here. I have already brought up why this is
> > the case in the generic write(2) system call path [1]. Maybe we can
> > reduce the amount of NOFS requests.
> > 
> > > That said, I'll
> > > still I'll tune these settings in our infrastructure and see how
> > > things go, but it sounds like something should be done inside the
> > > kernel to help this situation, since it's so easy to trigger, but
> > > looking at the threads that led to the commits you referenced, I can
> > > see that this is complicated.
> > 
> > Yeah, there are certainly things that we should be doing and reducing
> > the NOFS allocations is the first step. From my past experience
> > non trivial usage has turned out to be used incorrectly. I am not sure
> > how much we can do for cgroup v1 though. If tuning for global dirty
> > thresholds doesn't lead to a better behavior we can think of a band aid
> > of some form. Something like this (only compile tested)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 05b4ec2c6499..4e1e8d121785 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	ie (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		goto retry;
> >  
> > +	/*
> > +	 * Legacy memcg relies on dirty data throttling during the reclaim
> > +	 * but this cannot be done for GFP_NOFS requests so we might trigger
> > +	 * the oom way too early. Throttle here if we have way too many
> > +	 * dirty/writeback pages.
> > +	 */
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
> > +		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> > +			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
> > +
> > +		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
> > +			schedule_timeout_interruptible(1);
> > +	}
> > +
> >  	if (nr_retries--)
> >  		goto retry;
> >  
> > 
> > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
> > -- 
> > Michal Hocko
> > SUSE Labs
> Hi Michal,
> Following up my conversation from bugzilla here:
> I am currently seeing the same issue when migrating a container from 4.14 to
> 5.4+ kernels. I tested this patch with a configuration where application reaches
> cgroups memory limit while doing IO. The issue is similar to described here
> https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
> write syscall due to restricted memory reclamation.
> I tested your patch however, I have to increase the jiffies from
> 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
> I also tried adjusting the dirty_bytes* and it worked after some tuning however,
> there's no one set of values suits all use cases kind of scenario.
> Hence it does not look like a viable option for me to change those defaults here and 
> expect it work for all kind of workloads. I think working out a fix in kernel may be a
> better option since this issue will be seen ins o many use cases where
> applications are used to old kernel behavior and they suddenly start failing on
> newer ones.
> I see the same stack trace on 4.19 kernel too.
> 
> Here is the stack trace:
> 
> dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997
> CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
> Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
> Call Trace:
> dump_stack+0x50/0x6b
> dump_header+0x4a/0x200
> oom_kill_process+0xd7/0x110
> out_of_memory+0x105/0x510
> mem_cgroup_out_of_memory+0xb5/0xd0
> try_charge+0x766/0x7c0
> mem_cgroup_try_charge+0x70/0x190
> __add_to_page_cache_locked+0x355/0x390
> ? scan_shadow_nodes+0x30/0x30
> add_to_page_cache_lru+0x4a/0xc0
> pagecache_get_page+0xf5/0x210
> grab_cache_page_write_begin+0x1f/0x40
> iomap_write_begin.constprop.34+0x1ee/0x340
> ? iomap_write_end+0x91/0x240
> iomap_write_actor+0x92/0x170
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_apply+0xba/0x130
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_file_buffered_write+0x62/0x90
> ? iomap_dirty_actor+0x1b0/0x1b0
> xfs_file_buffered_aio_write+0xca/0x310 [xfs]
> new_sync_write+0x11b/0x1b0
> vfs_write+0xad/0x1a0
> ksys_write+0xa1/0xe0
> do_syscall_64+0x48/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fc956e853ad
> ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca
> 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8
> 02 00 00 00 49 89 f4 be 00 88 08 00 55
> RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
> RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
> RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
> memory: usage 30720kB, limit 30720kB, failcnt 424
> memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for
> /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
> anon 1089536
> file 27475968
> kernel_stack 73728
> slab 1941504
> sock 0
> shmem 0
> file_mapped 0
> file_dirty 0
> file_writeback 0
> anon_thp 0
> inactive_anon 0
> active_anon 1351680
> inactive_file 27705344
> active_file 40960
> unevictable 0
> slab_reclaimable 819200
> slab_unreclaimable 1122304
> pgfault 23397
> pgmajfault 0
> workingset_refault 33
> workingset_activate 33
> workingset_nodereclaim 0
> pgrefill 119108
> pgscan 124436
> pgsteal 928
> pgactivate 123222
> pgdeactivate 119083
> pglazyfree 99
> pglazyfreed 0
> thp_fault_alloc 0
> thp_collapse_alloc 0
> Tasks state (memory values in pages):
> [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
> name
> [  28589]     0 28589      242        1    28672        0          -998 pause
> [  28703]     0 28703      399        1    40960        0           997 sh
> [  28766]     0 28766      821      341    45056        0           997 dd
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
> Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
> anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
> oom_score_adj:997
> oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
> shmem-rss:0kB
> 
> 
> Here is a snippet of the container spec:
> 
> containers:
> - image: docker.io/library/alpine:latest
> name: dd
> command:
> - sh
> args:
> - -c
> - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
> resources:
> requests:
> memory: 30Mi
> cpu: 20m
> limits:
> memory: 30Mi
> 
> Thanks,
> Anchal Agarwal
A gentle ping on this issue!

Thanks,
Anchal Agarwal