From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lua+=57=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BDA22C2BA19
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 08:35:10 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6E1352078A
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 08:35:10 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Da7BIq35"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6E1352078A
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id F25A78E0003; Wed, 15 Apr 2020 04:35:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ED6248E0001; Wed, 15 Apr 2020 04:35:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DED388E0003; Wed, 15 Apr 2020 04:35:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0185.hostedemail.com [216.40.44.185])
	by kanga.kvack.org (Postfix) with ESMTP id C75548E0001
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 04:35:09 -0400 (EDT)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 86114824556B
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 08:35:09 +0000 (UTC)
X-FDA: 76709429538.23.brush54_6734211136041
X-HE-Tag: brush54_6734211136041
X-Filterd-Recvd-Size: 7880
Received: from mail-il1-f181.google.com (mail-il1-f181.google.com [209.85.166.181])
	by imf17.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 08:35:09 +0000 (UTC)
Received: by mail-il1-f181.google.com with SMTP id z12so2508148ilb.10
        for <linux-mm@kvack.org>; Wed, 15 Apr 2020 01:35:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=yC7Tni4INWuvJ+F0QqkGtBeehkMxCu2z+K7ngZzMFek=;
        b=Da7BIq35I7asqoCbp1WqUJrtBeltHkd23GV5dZDm/gAzXJwrJgmaCIUV489rP0T13O
         RDEQOHGmNikkYobkOf2AwR6zh44vFBndT3Mx240ZakvaW+wjulTEYBhOOdNFi5BP0FI9
         8ztOEju6jYCalOsSJEbapqYVPPMNTExTdn0pyuyZHRoH7Xy6cMwZgQLlGRjk7gAy74MH
         s0x+gJbpe3T/ZR6DKDpRd+210hHCutJlB+eHBuqCyVgnxQEx+S95FQigya+aQ/370Pfu
         rE8or1ije6ZrAYYEQ7bYJ82nGtosA//60cU2DCmYcrttKO3XZ84o796brGUsLKVBTGsT
         W1IQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=yC7Tni4INWuvJ+F0QqkGtBeehkMxCu2z+K7ngZzMFek=;
        b=MLxdNVqn5js0E25ft0YtOyo5qUKHbqerfbtnffq0PW1FGx30ebUmyvw9wyG81hijVW
         Dk0fMc1nsENUkXumKxXhzXKLktvI4AEHHafwwaf+VnWQbYmSjaZRYK4EZ+wTBKxOvSwA
         7WSKDt0wQXNNqzGU/Xb1FYuYaI98VeMYpaYGWMUZF4Y8El6yRuDu1GA+BvsHziOqI7fW
         E8f2hCxhCrpm3ioyZvWtOtFC8vtiodSzcyc7Y+shMb+dO1LjBg6WA9HgfvQnTkJXF3YJ
         9DTIF5Go6gey2texjxbd04e7hXozVGvYD+aD66fhZThaor93Wj01bbXrFbbDdzaT9TQO
         y/HQ==
X-Gm-Message-State: AGi0PualoxvEMREG5kIYPtMDS+TpZ19E7ZKWSNbM8VctnWF5bkg4zAtw
	mt1z6QrDsSyK0OIugS/UUzjZOzbmUvB9FwvrMvk=
X-Google-Smtp-Source: APiQypIItYN0QaR8BHcpNLPQAcFeOqv+YbXCjjgl7554peJQuHc1dmFGWcU7zNzLgieHslLWmHrjYMYEFEfCPbm4zlw=
X-Received: by 2002:a92:c991:: with SMTP id y17mr4370597iln.239.1586939708458;
 Wed, 15 Apr 2020 01:35:08 -0700 (PDT)
MIME-Version: 1.0
References: <bug-207273-27@https.bugzilla.kernel.org/> <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org>
 <20200415065059.GV4629@dhcp22.suse.cz>
In-Reply-To: <20200415065059.GV4629@dhcp22.suse.cz>
From: Paul Furtado <paulfurtado91@gmail.com>
Date: Wed, 15 Apr 2020 04:34:56 -0400
Message-ID: <CAKkCftoa3e3cj2jArO5Ekk68_p6igSu+GzqWDrkWVo0WGcuZ4g@mail.gmail.com>
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage
 OOM-kills processes due to page cache usage after upgrading to kernel 5.4
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, bugzilla-daemon@bugzilla.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

> You can either try to use cgroup v2 which has much better memcg aware dirty
> throttling implementation so such a large amount of dirty pages doesn't
> accumulate in the first place

I'd love to use cgroup v2, however this is docker + kubernetes so that
would require a lot of changes on our end to make happen, given how
recently container runtimes gained cgroup v2 support.

> I pressume you are using defaults for
> /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> available memory. I would recommend using their resp. *_bytes
> alternatives and use something like 500M for background and 800M for
> dirty_bytes.

We're using the defaults right now, however, given that this is a
containerized environment, it's problematic to set these values too
low system-wide since the containers all have dedicated volumes with
varying performance (from as low as 100MB/sec to gigabyes). Looking
around, I see that there were patches in the past to set per-cgroup
vm.dirty settings, however it doesn't look like those ever made it
into the kernel unless I'm missing something. In practice, maybe 500M
and 800M wouldn't be so bad though and may improve latency in other
ways. The other problem is that this also sets an upper bound on the
minimum container size for anything that does do IO. That said, I'll
still I'll tune these settings in our infrastructure and see how
things go, but it sounds like something should be done inside the
kernel to help this situation, since it's so easy to trigger, but
looking at the threads that led to the commits you referenced, I can
see that this is complicated.

Thanks,
Paul


On Wed, Apr 15, 2020 at 2:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-04-20 21:25:58, Andrew Morton wrote:
> [...]
> > > Upon upgrading to kernel 5.4, we see constant OOM kills in database containers
> > > that are restoring from backups, with nearly no RSS memory usage. It appears
> > > all the memory is consumed by file_dirty, with applications using minimal
> > > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it
> > > appears to be a new regression.
>
> OK, this is interesting. Because the memcg oom handling has changed in
> 4.19. Older kernels triggered memcg oom only from the page fault path
> while your stack trace is pointing to the write(2) syscall. But if you
> do not see any problem with 4.19 then this is not it.
>
> [...]
>
> > > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
> [...]
> > > memory: usage 1536000kB, limit 1536000kB, failcnt 0
> > > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
> > > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0
>
> Based on the output I assume you are using cgroup v1
>
> > > Memory cgroup stats for
> > > /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
> > > anon 72507392
> > > file 1474740224
> > > kernel_stack 774144
> > > slab 18673664
> > > sock 0
> > > shmem 0
> > > file_mapped 0
> > > file_dirty 1413857280
> > > file_writeback 60555264
>
> This seems to be the crux of the problem. You cannot swap out any memory
> due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot
> of dirty memory to flush. This alone shouldn't be a disaster because
> cgroup v1 does have a hack to throttle the memory reclaim in presence of
> dirty/writeback pages. But note the gfp_mask for the allocation. It
> says GFP_NOFS which means that we cannot apply the throttling and have
> to give up. We used to retry the reclaim even though not much could be
> done with a restricted allocation context and that led to lockups. Then
> we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when
> invoking memcg OOM killer") in 5.4 and it has been marked for stable
> trees (4.19+). And this is likely the primary culprit of the issue you
> are seeing.
>
> Now what to do about that. Reverting f9c645621a28 doesn't sound like a
> feasible solution. We could try to put a sleep for restricted
> allocations after memory reclaim fails but we know from the past
> experience this is a bit fishy because a sleep without any feedback on
> the flushing is just not going to work reliably.
>
> Another possibility is to workaround the problem by configuration. You
> can either try to use cgroup v2 which has much better memcg aware dirty
> throttling implementation so such a large amount of dirty pages doesn't
> accumulate in the first place or you can reconfigure global dirty
> limits. I pressume you are using defaults for
> /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> available memory. I would recommend using their resp. *_bytes
> alternatives and use something like 500M for background and 800M for
> dirty_bytes. That should help in your current situation. The overall IO
> throughput might be smaller so you might need to tune those values a
> bit.
>
> HTH
> --
> Michal Hocko
> SUSE Labs