From: Anchal Agarwal <anchalag@amazon.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Paul Furtado <paulfurtado91@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
<bugzilla-daemon@bugzilla.kernel.org>, <linux-mm@kvack.org>,
<benh@amazon.com>
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Date: Wed, 11 Aug 2021 19:41:23 +0000 [thread overview]
Message-ID: <20210811194123.GA8198@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> (raw)
In-Reply-To: <20210806204246.GA21051@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote:
> On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> > On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > > You can either try to use cgroup v2 which has much better memcg aware dirty
> > > > throttling implementation so such a large amount of dirty pages doesn't
> > > > accumulate in the first place
> > >
> > > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > > would require a lot of changes on our end to make happen, given how
> > > recently container runtimes gained cgroup v2 support.
> > >
> > > > I pressume you are using defaults for
> > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > > available memory. I would recommend using their resp. *_bytes
> > > > alternatives and use something like 500M for background and 800M for
> > > > dirty_bytes.
> > >
> > > We're using the defaults right now, however, given that this is a
> > > containerized environment, it's problematic to set these values too
> > > low system-wide since the containers all have dedicated volumes with
> > > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > > around, I see that there were patches in the past to set per-cgroup
> > > vm.dirty settings, however it doesn't look like those ever made it
> > > into the kernel unless I'm missing something.
> >
> > I am not aware of that work for memcg v1.
> >
> > > In practice, maybe 500M
> > > and 800M wouldn't be so bad though and may improve latency in other
> > > ways. The other problem is that this also sets an upper bound on the
> > > minimum container size for anything that does do IO.
> >
> > Well this would be a conservative approach but most allocations will
> > simply be throttled during reclaim. It is the restricted memory reclaim
> > context that is the bummer here. I have already brought up why this is
> > the case in the generic write(2) system call path [1]. Maybe we can
> > reduce the amount of NOFS requests.
> >
> > > That said, I'll
> > > still I'll tune these settings in our infrastructure and see how
> > > things go, but it sounds like something should be done inside the
> > > kernel to help this situation, since it's so easy to trigger, but
> > > looking at the threads that led to the commits you referenced, I can
> > > see that this is complicated.
> >
> > Yeah, there are certainly things that we should be doing and reducing
> > the NOFS allocations is the first step. From my past experience
> > non trivial usage has turned out to be used incorrectly. I am not sure
> > how much we can do for cgroup v1 though. If tuning for global dirty
> > thresholds doesn't lead to a better behavior we can think of a band aid
> > of some form. Something like this (only compile tested)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 05b4ec2c6499..4e1e8d121785 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > ie (mem_cgroup_wait_acct_move(mem_over_limit))
> > goto retry;
> >
> > + /*
> > + * Legacy memcg relies on dirty data throttling during the reclaim
> > + * but this cannot be done for GFP_NOFS requests so we might trigger
> > + * the oom way too early. Throttle here if we have way too many
> > + * dirty/writeback pages.
> > + */
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
> > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> > + writeback = memcg_page_state(memcg, NR_WRITEBACK);
> > +
> > + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
> > + schedule_timeout_interruptible(1);
> > + }
> > +
> > if (nr_retries--)
> > goto retry;
> >
> >
> > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
> > --
> > Michal Hocko
> > SUSE Labs
> Hi Michal,
> Following up my conversation from bugzilla here:
> I am currently seeing the same issue when migrating a container from 4.14 to
> 5.4+ kernels. I tested this patch with a configuration where application reaches
> cgroups memory limit while doing IO. The issue is similar to described here
> https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
> write syscall due to restricted memory reclamation.
> I tested your patch however, I have to increase the jiffies from
> 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
> I also tried adjusting the dirty_bytes* and it worked after some tuning however,
> there's no one set of values suits all use cases kind of scenario.
> Hence it does not look like a viable option for me to change those defaults here and
> expect it work for all kind of workloads. I think working out a fix in kernel may be a
> better option since this issue will be seen ins o many use cases where
> applications are used to old kernel behavior and they suddenly start failing on
> newer ones.
> I see the same stack trace on 4.19 kernel too.
>
> Here is the stack trace:
>
> dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997
> CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
> Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
> Call Trace:
> dump_stack+0x50/0x6b
> dump_header+0x4a/0x200
> oom_kill_process+0xd7/0x110
> out_of_memory+0x105/0x510
> mem_cgroup_out_of_memory+0xb5/0xd0
> try_charge+0x766/0x7c0
> mem_cgroup_try_charge+0x70/0x190
> __add_to_page_cache_locked+0x355/0x390
> ? scan_shadow_nodes+0x30/0x30
> add_to_page_cache_lru+0x4a/0xc0
> pagecache_get_page+0xf5/0x210
> grab_cache_page_write_begin+0x1f/0x40
> iomap_write_begin.constprop.34+0x1ee/0x340
> ? iomap_write_end+0x91/0x240
> iomap_write_actor+0x92/0x170
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_apply+0xba/0x130
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_file_buffered_write+0x62/0x90
> ? iomap_dirty_actor+0x1b0/0x1b0
> xfs_file_buffered_aio_write+0xca/0x310 [xfs]
> new_sync_write+0x11b/0x1b0
> vfs_write+0xad/0x1a0
> ksys_write+0xa1/0xe0
> do_syscall_64+0x48/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fc956e853ad
> ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca
> 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8
> 02 00 00 00 49 89 f4 be 00 88 08 00 55
> RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
> RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
> RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
> memory: usage 30720kB, limit 30720kB, failcnt 424
> memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for
> /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
> anon 1089536
> file 27475968
> kernel_stack 73728
> slab 1941504
> sock 0
> shmem 0
> file_mapped 0
> file_dirty 0
> file_writeback 0
> anon_thp 0
> inactive_anon 0
> active_anon 1351680
> inactive_file 27705344
> active_file 40960
> unevictable 0
> slab_reclaimable 819200
> slab_unreclaimable 1122304
> pgfault 23397
> pgmajfault 0
> workingset_refault 33
> workingset_activate 33
> workingset_nodereclaim 0
> pgrefill 119108
> pgscan 124436
> pgsteal 928
> pgactivate 123222
> pgdeactivate 119083
> pglazyfree 99
> pglazyfreed 0
> thp_fault_alloc 0
> thp_collapse_alloc 0
> Tasks state (memory values in pages):
> [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj
> name
> [ 28589] 0 28589 242 1 28672 0 -998 pause
> [ 28703] 0 28703 399 1 40960 0 997 sh
> [ 28766] 0 28766 821 341 45056 0 997 dd
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
> Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
> anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
> oom_score_adj:997
> oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
> shmem-rss:0kB
>
>
> Here is a snippet of the container spec:
>
> containers:
> - image: docker.io/library/alpine:latest
> name: dd
> command:
> - sh
> args:
> - -c
> - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
> resources:
> requests:
> memory: 30Mi
> cpu: 20m
> limits:
> memory: 30Mi
>
> Thanks,
> Anchal Agarwal
A gentle ping on this issue!
Thanks,
Anchal Agarwal
next prev parent reply other threads:[~2021-08-11 19:41 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <bug-207273-27@https.bugzilla.kernel.org/>
2020-04-15 4:25 ` [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4 Andrew Morton
2020-04-15 6:50 ` Michal Hocko
2020-04-15 8:34 ` Paul Furtado
2020-04-15 9:44 ` Michal Hocko
2021-08-06 20:42 ` Anchal Agarwal
2021-08-11 19:41 ` Anchal Agarwal [this message]
2021-08-19 9:35 ` Michal Hocko
2021-08-19 9:41 ` Herrenschmidt, Benjamin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210811194123.GA8198@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com \
--to=anchalag@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=benh@amazon.com \
--cc=bugzilla-daemon@bugzilla.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=paulfurtado91@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).