From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A183EC2BA19 for ; Wed, 15 Apr 2020 09:45:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5F011206D9 for ; Wed, 15 Apr 2020 09:45:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5F011206D9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B96FB8E0005; Wed, 15 Apr 2020 05:45:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B47278E0001; Wed, 15 Apr 2020 05:45:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5BE18E0005; Wed, 15 Apr 2020 05:45:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1]) by kanga.kvack.org (Postfix) with ESMTP id 8BE6D8E0001 for ; Wed, 15 Apr 2020 05:45:03 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 553FF1812A463 for ; Wed, 15 Apr 2020 09:45:03 +0000 (UTC) X-FDA: 76709605686.06.flag57_8366ff1e5ef24 X-HE-Tag: flag57_8366ff1e5ef24 X-Filterd-Recvd-Size: 6075 Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Apr 2020 09:45:02 +0000 (UTC) Received: by mail-wm1-f66.google.com with SMTP id a81so18142068wmf.5 for ; Wed, 15 Apr 2020 02:45:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=G5lIOHGicdsAxlZe8Kug7sbDWFnVmxugEFNCQ1Nk47I=; b=rIH+PHb65GJFNUuUwLgtgS+SOqhPTP/DPEDrx6NJxCMXLgU2gdU7LXsTQRGsPCMCt3 spf4be09NRwEqdiOugaLhwMxvh4w/L8aJFA3JK+2NF0ZmboBzWS8R8WQHhZHDfrz232N kFBuk9D5SVJiPbtBGsyr3TYO2J58coDQx5qvVN+TItloNoCfloRUn4TOj5Jm1f2rjfdu n+ZrjOQPOSljD/hvaAggwluEcK0481m/YvT/JL0iTRnzG9jdr2X6296WQHN9l/FMA5Wi aED1p/fv1sIiNvL5hCJIb6aOqpdYffQziKBVOpM7Lkniq1eNDjFWAGtalOfKNLY3urR3 0jMQ== X-Gm-Message-State: AGi0PuZYBGjDKMgH2wPlV/lmjQ6Ejch/Vt2nCFYln1/DvQpPCk4Qdt0G 4YAwH0P1p5HU8sMaiVAiM80= X-Google-Smtp-Source: APiQypJDU/iedD8mZ42omXzDdk/yqApY5jsDa5umx977t0F1ppQhLlYQBl0oWk5BxPownP0TfCC6vQ== X-Received: by 2002:a7b:cbc6:: with SMTP id n6mr4624991wmi.155.1586943901674; Wed, 15 Apr 2020 02:45:01 -0700 (PDT) Received: from localhost (ip-37-188-180-223.eurotel.cz. [37.188.180.223]) by smtp.gmail.com with ESMTPSA id y20sm23375849wra.79.2020.04.15.02.44.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2020 02:45:00 -0700 (PDT) Date: Wed, 15 Apr 2020 11:44:58 +0200 From: Michal Hocko To: Paul Furtado Cc: Andrew Morton , bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org Subject: Re: [Bug 207273] New: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4 Message-ID: <20200415094458.GB4629@dhcp22.suse.cz> References: <20200414212558.58eaab4de2ecf864eaa87e5d@linux-foundation.org> <20200415065059.GV4629@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > You can either try to use cgroup v2 which has much better memcg aware dirty > > throttling implementation so such a large amount of dirty pages doesn't > > accumulate in the first place > > I'd love to use cgroup v2, however this is docker + kubernetes so that > would require a lot of changes on our end to make happen, given how > recently container runtimes gained cgroup v2 support. > > > I pressume you are using defaults for > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > available memory. I would recommend using their resp. *_bytes > > alternatives and use something like 500M for background and 800M for > > dirty_bytes. > > We're using the defaults right now, however, given that this is a > containerized environment, it's problematic to set these values too > low system-wide since the containers all have dedicated volumes with > varying performance (from as low as 100MB/sec to gigabyes). Looking > around, I see that there were patches in the past to set per-cgroup > vm.dirty settings, however it doesn't look like those ever made it > into the kernel unless I'm missing something. I am not aware of that work for memcg v1. > In practice, maybe 500M > and 800M wouldn't be so bad though and may improve latency in other > ways. The other problem is that this also sets an upper bound on the > minimum container size for anything that does do IO. Well this would be a conservative approach but most allocations will simply be throttled during reclaim. It is the restricted memory reclaim context that is the bummer here. I have already brought up why this is the case in the generic write(2) system call path [1]. Maybe we can reduce the amount of NOFS requests. > That said, I'll > still I'll tune these settings in our infrastructure and see how > things go, but it sounds like something should be done inside the > kernel to help this situation, since it's so easy to trigger, but > looking at the threads that led to the commits you referenced, I can > see that this is complicated. Yeah, there are certainly things that we should be doing and reducing the NOFS allocations is the first step. From my past experience non trivial usage has turned out to be used incorrectly. I am not sure how much we can do for cgroup v1 though. If tuning for global dirty thresholds doesn't lead to a better behavior we can think of a band aid of some form. Something like this (only compile tested) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05b4ec2c6499..4e1e8d121785 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) goto retry; + /* + * Legacy memcg relies on dirty data throttling during the reclaim + * but this cannot be done for GFP_NOFS requests so we might trigger + * the oom way too early. Throttle here if we have way too many + * dirty/writeback pages. + */ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), + writeback = memcg_page_state(memcg, NR_WRITEBACK); + + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) + schedule_timeout_interruptible(1); + } + if (nr_retries--) goto retry; [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz -- Michal Hocko SUSE Labs