From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lua+=57=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 915F8C2BA2B
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 06:51:03 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5757C206D9
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 06:51:03 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5757C206D9
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id E626C8E0005; Wed, 15 Apr 2020 02:51:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E13508E0001; Wed, 15 Apr 2020 02:51:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D27AD8E0005; Wed, 15 Apr 2020 02:51:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1])
	by kanga.kvack.org (Postfix) with ESMTP id B87388E0001
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 02:51:02 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 74C508245578
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 06:51:02 +0000 (UTC)
X-FDA: 76709167164.27.bee61_434179b888922
X-HE-Tag: bee61_434179b888922
X-Filterd-Recvd-Size: 5625
Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42])
	by imf05.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 06:51:02 +0000 (UTC)
Received: by mail-wr1-f42.google.com with SMTP id t14so4499698wrw.12
        for <linux-mm@kvack.org>; Tue, 14 Apr 2020 23:51:01 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=o9dvVa94LyJW3pmBFqihHQL0vDUFUbUDTpFFrngyLns=;
        b=a4hTbpTtEj9P6VT7YzwM0BQSeLo/JPt9AlRO6ZKgaSbv7h3Q0PW0UaX1h5qjM4+9WP
         /T7hiWVmkvF3fqDMsQeNuckvtAyczEFGHrKT7PCCu0MS1m+LXdg5UuuRDZ9g25BIyj5c
         QAnFsLXI6qDycgL+Lkh+ZhLscDBxIKZHRikHAAHJI5Bc28zQgrGjURCqZD4wgH/tb+zT
         yOTbrlXlaWirUy6l+4H/7lK5cZs5r8+C6T1G/nMPJPozS/GRwDKrGs9gpAGLCMJ37LQw
         slX+5hwZu9zpvdGA2cHB42m1UTBbRKEQMbLPR6/3E/JSwHVrVs5i791ruwfOj2supTUx
         lv1g==
X-Gm-Message-State: AGi0Pua5mvdDm0iS9uMQHr42mLAzeKLDpxnVMv50+KEY22d58icSZk/T
	WDeCR+d+XcmFxlrn1ZoMPZY=
X-Google-Smtp-Source: APiQypJpj7jNuvZXdclsFQ9M7zuYVmZZg3bR5wLgzadXCDyn3MxiubcbFEplX9IMUmJLYqsawmKjfg==
X-Received: by 2002:adf:8563:: with SMTP id 90mr9555615wrh.202.1586933460976;
        Tue, 14 Apr 2020 23:51:00 -0700 (PDT)
Received: from localhost (ip-37-188-180-223.eurotel.cz. [37.188.180.223])
        by smtp.gmail.com with ESMTPSA id c17sm22248120wrp.28.2020.04.14.23.50.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Apr 2020 23:51:00 -0700 (PDT)
Date: Wed, 15 Apr 2020 08:50:59 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: paulfurtado91@gmail.com, bugzilla-daemon@bugzilla.kernel.org,
	linux-mm@kvack.org
Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage
 OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Message-ID: <20200415065059.GV4629@dhcp22.suse.cz>
References: <bug-207273-27@https.bugzilla.kernel.org/>
 <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue 14-04-20 21:25:58, Andrew Morton wrote:
[...]
> > Upon upgrading to kernel 5.4, we see constant OOM kills in database containers
> > that are restoring from backups, with nearly no RSS memory usage. It appears
> > all the memory is consumed by file_dirty, with applications using minimal
> > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it
> > appears to be a new regression.

OK, this is interesting. Because the memcg oom handling has changed in
4.19. Older kernels triggered memcg oom only from the page fault path
while your stack trace is pointing to the write(2) syscall. But if you
do not see any problem with 4.19 then this is not it.

[...]

> > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[...]
> > memory: usage 1536000kB, limit 1536000kB, failcnt 0
> > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
> > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0

Based on the output I assume you are using cgroup v1

> > Memory cgroup stats for
> > /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
> > anon 72507392
> > file 1474740224
> > kernel_stack 774144
> > slab 18673664
> > sock 0
> > shmem 0
> > file_mapped 0
> > file_dirty 1413857280
> > file_writeback 60555264

This seems to be the crux of the problem. You cannot swap out any memory
due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot
of dirty memory to flush. This alone shouldn't be a disaster because
cgroup v1 does have a hack to throttle the memory reclaim in presence of
dirty/writeback pages. But note the gfp_mask for the allocation. It
says GFP_NOFS which means that we cannot apply the throttling and have
to give up. We used to retry the reclaim even though not much could be
done with a restricted allocation context and that led to lockups. Then
we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when
invoking memcg OOM killer") in 5.4 and it has been marked for stable
trees (4.19+). And this is likely the primary culprit of the issue you
are seeing.

Now what to do about that. Reverting f9c645621a28 doesn't sound like a
feasible solution. We could try to put a sleep for restricted
allocations after memory reclaim fails but we know from the past
experience this is a bit fishy because a sleep without any feedback on
the flushing is just not going to work reliably.

Another possibility is to workaround the problem by configuration. You
can either try to use cgroup v2 which has much better memcg aware dirty
throttling implementation so such a large amount of dirty pages doesn't
accumulate in the first place or you can reconfigure global dirty
limits. I pressume you are using defaults for
/proc/sys/vm/dirty_{background_}ratio which is a percentage of the
available memory. I would recommend using their resp. *_bytes
alternatives and use something like 500M for background and 800M for
dirty_bytes. That should help in your current situation. The overall IO
throughput might be smaller so you might need to tune those values a
bit.

HTH
-- 
Michal Hocko
SUSE Labs