From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B85AC4338F for ; Wed, 11 Aug 2021 19:41:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CA08A60FD9 for ; Wed, 11 Aug 2021 19:41:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org CA08A60FD9 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 4949A8D0001; Wed, 11 Aug 2021 15:41:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4420B6B0072; Wed, 11 Aug 2021 15:41:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3085A8D0001; Wed, 11 Aug 2021 15:41:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id 141F96B0071 for ; Wed, 11 Aug 2021 15:41:35 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A9F538249980 for ; Wed, 11 Aug 2021 19:41:34 +0000 (UTC) X-FDA: 78463819308.22.9A2EAEC Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by imf14.hostedemail.com (Postfix) with ESMTP id 3745A6010F6A for ; Wed, 11 Aug 2021 19:41:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1628710895; x=1660246895; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=jigMxMx816KLd1gmW/Cn6NZlqdLUWSkcjZn9zg4Rrs0=; b=o42URM5Owg16zORSmNekT/EM6f+idVpv5D/EAw9AW1UJJ5CLjhTo8lGq cg73SiJdKphZN6xDIVhOMr/oIMr+xh3oeVHZeqJDFimbveESglzngigvh Lk+aeQpI9yFcR7xoOHDyfMbpw223FM9ZplCXqKYcE4D5T9vPkDH/eCu7U I=; X-IronPort-AV: E=Sophos;i="5.84,313,1620691200"; d="scan'208";a="133284632" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1d-98acfc19.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP; 11 Aug 2021 19:41:28 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan2.iad.amazon.com [10.40.159.162]) by email-inbound-relay-1d-98acfc19.us-east-1.amazon.com (Postfix) with ESMTPS id 52AC5A178F; Wed, 11 Aug 2021 19:41:26 +0000 (UTC) Received: from EX13D21UWB004.ant.amazon.com (10.43.161.221) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.23; Wed, 11 Aug 2021 19:41:25 +0000 Received: from EX13MTAUEA001.ant.amazon.com (10.43.61.82) by EX13D21UWB004.ant.amazon.com (10.43.161.221) with Microsoft SMTP Server (TLS) id 15.0.1497.23; Wed, 11 Aug 2021 19:41:24 +0000 Received: from dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (172.22.96.68) by mail-relay.amazon.com (10.43.61.243) with Microsoft SMTP Server id 15.0.1497.23 via Frontend Transport; Wed, 11 Aug 2021 19:41:24 +0000 Received: by dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (Postfix, from userid 4335130) id C35EC403C3; Wed, 11 Aug 2021 19:41:23 +0000 (UTC) Date: Wed, 11 Aug 2021 19:41:23 +0000 From: Anchal Agarwal To: Michal Hocko CC: Paul Furtado , Andrew Morton , , , Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4 Message-ID: <20210811194123.GA8198@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> References: <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org> <20200415065059.GV4629@dhcp22.suse.cz> <20200415094458.GB4629@dhcp22.suse.cz> <20210806204246.GA21051@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20210806204246.GA21051@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Rspamd-Queue-Id: 3745A6010F6A Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=amazon.com header.s=amazon201209 header.b=o42URM5O; spf=pass (imf14.hostedemail.com: domain of "prvs=85035d471=anchalag@amazon.com" designates 52.95.48.154 as permitted sender) smtp.mailfrom="prvs=85035d471=anchalag@amazon.com"; dmarc=pass (policy=quarantine) header.from=amazon.com X-Rspamd-Server: rspam01 X-Stat-Signature: he57dhx5psbh7chuqh8qipfr6tedrpf4 X-HE-Tag: 1628710894-981283 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote: > On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote: > > On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > > > You can either try to use cgroup v2 which has much better memcg aware dirty > > > > throttling implementation so such a large amount of dirty pages doesn't > > > > accumulate in the first place > > > > > > I'd love to use cgroup v2, however this is docker + kubernetes so that > > > would require a lot of changes on our end to make happen, given how > > > recently container runtimes gained cgroup v2 support. > > > > > > > I pressume you are using defaults for > > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > > > available memory. I would recommend using their resp. *_bytes > > > > alternatives and use something like 500M for background and 800M for > > > > dirty_bytes. > > > > > > We're using the defaults right now, however, given that this is a > > > containerized environment, it's problematic to set these values too > > > low system-wide since the containers all have dedicated volumes with > > > varying performance (from as low as 100MB/sec to gigabyes). Looking > > > around, I see that there were patches in the past to set per-cgroup > > > vm.dirty settings, however it doesn't look like those ever made it > > > into the kernel unless I'm missing something. > > > > I am not aware of that work for memcg v1. > > > > > In practice, maybe 500M > > > and 800M wouldn't be so bad though and may improve latency in other > > > ways. The other problem is that this also sets an upper bound on the > > > minimum container size for anything that does do IO. > > > > Well this would be a conservative approach but most allocations will > > simply be throttled during reclaim. It is the restricted memory reclaim > > context that is the bummer here. I have already brought up why this is > > the case in the generic write(2) system call path [1]. Maybe we can > > reduce the amount of NOFS requests. > > > > > That said, I'll > > > still I'll tune these settings in our infrastructure and see how > > > things go, but it sounds like something should be done inside the > > > kernel to help this situation, since it's so easy to trigger, but > > > looking at the threads that led to the commits you referenced, I can > > > see that this is complicated. > > > > Yeah, there are certainly things that we should be doing and reducing > > the NOFS allocations is the first step. From my past experience > > non trivial usage has turned out to be used incorrectly. I am not sure > > how much we can do for cgroup v1 though. If tuning for global dirty > > thresholds doesn't lead to a better behavior we can think of a band aid > > of some form. Something like this (only compile tested) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 05b4ec2c6499..4e1e8d121785 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > ie (mem_cgroup_wait_acct_move(mem_over_limit)) > > goto retry; > > > > + /* > > + * Legacy memcg relies on dirty data throttling during the reclaim > > + * but this cannot be done for GFP_NOFS requests so we might trigger > > + * the oom way too early. Throttle here if we have way too many > > + * dirty/writeback pages. > > + */ > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { > > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), > > + writeback = memcg_page_state(memcg, NR_WRITEBACK); > > + > > + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) > > + schedule_timeout_interruptible(1); > > + } > > + > > if (nr_retries--) > > goto retry; > > > > > > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz > > -- > > Michal Hocko > > SUSE Labs > Hi Michal, > Following up my conversation from bugzilla here: > I am currently seeing the same issue when migrating a container from 4.14 to > 5.4+ kernels. I tested this patch with a configuration where application reaches > cgroups memory limit while doing IO. The issue is similar to described here > https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in > write syscall due to restricted memory reclamation. > I tested your patch however, I have to increase the jiffies from > 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. > I also tried adjusting the dirty_bytes* and it worked after some tuning however, > there's no one set of values suits all use cases kind of scenario. > Hence it does not look like a viable option for me to change those defaults here and > expect it work for all kind of workloads. I think working out a fix in kernel may be a > better option since this issue will be seen ins o many use cases where > applications are used to old kernel behavior and they suddenly start failing on > newer ones. > I see the same stack trace on 4.19 kernel too. > > Here is the stack trace: > > dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997 > CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1 > Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x50/0x6b > dump_header+0x4a/0x200 > oom_kill_process+0xd7/0x110 > out_of_memory+0x105/0x510 > mem_cgroup_out_of_memory+0xb5/0xd0 > try_charge+0x766/0x7c0 > mem_cgroup_try_charge+0x70/0x190 > __add_to_page_cache_locked+0x355/0x390 > ? scan_shadow_nodes+0x30/0x30 > add_to_page_cache_lru+0x4a/0xc0 > pagecache_get_page+0xf5/0x210 > grab_cache_page_write_begin+0x1f/0x40 > iomap_write_begin.constprop.34+0x1ee/0x340 > ? iomap_write_end+0x91/0x240 > iomap_write_actor+0x92/0x170 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_apply+0xba/0x130 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_file_buffered_write+0x62/0x90 > ? iomap_dirty_actor+0x1b0/0x1b0 > xfs_file_buffered_aio_write+0xca/0x310 [xfs] > new_sync_write+0x11b/0x1b0 > vfs_write+0xad/0x1a0 > ksys_write+0xa1/0xe0 > do_syscall_64+0x48/0xf0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7fc956e853ad > ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca > 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 e9 8a d2 ff ff 41 54 b8 > 02 00 00 00 49 89 f4 be 00 88 08 00 55 > RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad > RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001 > RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000 > memory: usage 30720kB, limit 30720kB, failcnt 424 > memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0 > kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0 > Memory cgroup stats for > /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd: > anon 1089536 > file 27475968 > kernel_stack 73728 > slab 1941504 > sock 0 > shmem 0 > file_mapped 0 > file_dirty 0 > file_writeback 0 > anon_thp 0 > inactive_anon 0 > active_anon 1351680 > inactive_file 27705344 > active_file 40960 > unevictable 0 > slab_reclaimable 819200 > slab_unreclaimable 1122304 > pgfault 23397 > pgmajfault 0 > workingset_refault 33 > workingset_activate 33 > workingset_nodereclaim 0 > pgrefill 119108 > pgscan 124436 > pgsteal 928 > pgactivate 123222 > pgdeactivate 119083 > pglazyfree 99 > pglazyfreed 0 > thp_fault_alloc 0 > thp_collapse_alloc 0 > Tasks state (memory values in pages): > [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj > name > [ 28589] 0 28589 242 1 28672 0 -998 pause > [ 28703] 0 28703 399 1 40960 0 997 sh > [ 28766] 0 28766 821 341 45056 0 997 dd > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0 > Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB, > anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB > oom_score_adj:997 > oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB, > shmem-rss:0kB > > > Here is a snippet of the container spec: > > containers: > - image: docker.io/library/alpine:latest > name: dd > command: > - sh > args: > - -c > - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300 > resources: > requests: > memory: 30Mi > cpu: 20m > limits: > memory: 30Mi > > Thanks, > Anchal Agarwal A gentle ping on this issue! Thanks, Anchal Agarwal