From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C70E8C54FCB for ; Wed, 22 Apr 2020 17:13:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7AB2D2075A for ; Wed, 22 Apr 2020 17:13:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="F2ZOPuj4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7AB2D2075A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 17E298E000A; Wed, 22 Apr 2020 13:13:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 12F8F8E0003; Wed, 22 Apr 2020 13:13:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F38D18E000A; Wed, 22 Apr 2020 13:13:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id D98F88E0003 for ; Wed, 22 Apr 2020 13:13:32 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 925EB52C1 for ; Wed, 22 Apr 2020 17:13:32 +0000 (UTC) X-FDA: 76736137464.14.lip15_590f36129722a X-HE-Tag: lip15_590f36129722a X-Filterd-Recvd-Size: 7936 Received: from mail-qk1-f196.google.com (mail-qk1-f196.google.com [209.85.222.196]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Wed, 22 Apr 2020 17:13:31 +0000 (UTC) Received: by mail-qk1-f196.google.com with SMTP id o19so3201692qkk.5 for ; Wed, 22 Apr 2020 10:13:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=MwyrCZRymIVF3h654uJnEBCqJ3AtG9v1tu0i4qSd6eM=; b=F2ZOPuj4V9GN5VN2W2CWgq90Bsax8po7aVqM1xq+7b9iqcwd18SeQRka8DMGRAn49q yEpOG0Dk9eyRg9YdDAT9SJjMRPiYO8K3H+ay8NJpFWLDyf6HaQ/kHCU0YCSbvJbWsN6f u8sRX/o8oozGqL1Cor/m1Yiz5xjCIvFul5nTjImnrorX4mgTe16FLVlGf6enl/AOro3y hQaaT81hzANlcFf96zPfyAegU79G5D+fOJg6aj4MxEJnxl30aGLJQR9N/JIjm9z/VdOg ymr7qpM4ZZV1CP7M6DG3XmEYq86spnTBkC7abrHnqcBmx/rIQb9FXDTN+4YQ0tPSwM4V 3B1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=MwyrCZRymIVF3h654uJnEBCqJ3AtG9v1tu0i4qSd6eM=; b=HWvCfynTat7jAKZOgjGfqJOuO5SQ29HlVLyLcddzBNkb87gV1CU7V+MQUYr7s9jWAP km9vNnhxxIvCa3bicRGi0lIqlUztpTDGVl9mJsX+s7jtq5EeCFV19YId/TFLDYWxL1aD gdAQ6bo5cwLzcPJ3a8wQtjOR2LoUQgFKbS26+gyqyt3t66ET5uJnPl0/ZC4JgsEyRect aqqpTNQ8IbH7J9JVJOi5H5JuDrl8FJZ+qEZyNyYnNm040Y5mX7EJljTB5KmQpxatYc01 +pLyNRlnH0VHg773P6przzFo7g325qCNiVHGxvh5Ie3qm/x+R22t2FRfcJUcjXVSQsEN GTCA== X-Gm-Message-State: AGi0Pua4zZbWuRTYTv1Fd+wYtOPMIBsnV739LUwWWfzsVlm/D0ZDAX2w 9v4cnKU4V76I+N0QA4uLOK5cPw== X-Google-Smtp-Source: APiQypLjZFtxAWxfBtcDpVxNE67M+0/HHajEG8u67pfUbPAgZV742n0ER1R1DJPHm9RS0Y/4X0l3QA== X-Received: by 2002:a05:620a:81b:: with SMTP id s27mr27972475qks.351.1587575610127; Wed, 22 Apr 2020 10:13:30 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::921]) by smtp.gmail.com with ESMTPSA id x66sm4318224qka.121.2020.04.22.10.13.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Apr 2020 10:13:29 -0700 (PDT) Date: Wed, 22 Apr 2020 13:13:28 -0400 From: Johannes Weiner To: Michal Hocko Cc: Tejun Heo , Shakeel Butt , Jakub Kicinski , Andrew Morton , Linux MM , Kernel Team , Chris Down , Cgroups Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted Message-ID: <20200422171328.GC362484@cmpxchg.org> References: <20200420164740.GF43469@mtj.thefacebook.com> <20200420170318.GV27314@dhcp22.suse.cz> <20200420170650.GA169746@mtj.thefacebook.com> <20200421110612.GD27314@dhcp22.suse.cz> <20200421142746.GA341682@cmpxchg.org> <20200421161138.GL27314@dhcp22.suse.cz> <20200421165601.GA345998@cmpxchg.org> <20200422132632.GG30312@dhcp22.suse.cz> <20200422141514.GA362484@cmpxchg.org> <20200422154318.GK30312@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200422154318.GK30312@dhcp22.suse.cz> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote: > > > That being said I believe our discussion is missing an important part. > > > There is no description of the swap.high semantic. What can user expect > > > when using it? > > > > Good point, we should include that in cgroup-v2.rst. How about this? > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index bcc80269bb6a..49e8733a9d8a 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back. > > The total amount of swap currently being used by the cgroup > > and its descendants. > > > > + memory.swap.high > > + A read-write single value file which exists on non-root > > + cgroups. The default is "max". > > + > > + Swap usage throttle limit. If a cgroup's swap usage exceeds > > + this limit, allocations inside the cgroup will be throttled. > > Hm, so this doesn't talk about which allocatios are affected. This is > good for potential future changes but I am not sure this is useful to > make any educated guess about the actual effects. One could expect that > only those allocations which could contribute to future memory.swap > usage. I fully realize that we do not want to be very specific but we > want to provide something useful I believe. I am sorry but I do not have > a good suggestion on how to make this better. Mostly because I still > struggle on how this should behave to be sane. I honestly don't really follow you here. Why is it not helpful to say all allocations will slow down when condition X is met? We do the same for memory.high. > I am also missing some information about what the user can actually do > about this situation and call out explicitly that the throttling is > not going away until the swap usage is shrunk and the kernel is not > capable of doing that on its own without a help from the userspace. This > is really different from memory.high which has means to deal with the > excess and shrink it down in most cases. The following would clarify it I think we may be talking past each other. The user can do the same thing as in any OOM situation: wait for the kill. Swap being full is an OOM situation. Yes, that does not match the kernel's internal definition of an OOM situation. But we've already established that kernel OOM killing has a different objective (memory deadlock avoidance) than userspace OOM killing (quality of life)[1] [1] https://lkml.org/lkml/2019/8/4/15 As Tejun said, things like earlyoom and oomd already kill based on swap exhaustion, no further questions asked. Reclaim has been running for a while, it went after all the low-hanging fruit: it doesn't swap as long as there is easy cache; it also didn't just swap a little, it filled up all of swap; and the pages in swap are all cold too, because refaults would free that space again. The workingset is hugely oversized for the available capacity, and nobody has any interest in sticking around to see what tricks reclaim still has up its sleeves (hint: nothing good). From here on out, it's all thrashing and pain. The kernel might not OOM kill yet, but the quality of life expectancy for a workload with full swap is trending toward zero. We've been killing based on swap exhaustion as a stand-alone trigger for several years now and it's never been the wrong call. All swap.high does is acknowledge that swap-full is a common OOM situation from a userspace view, and helps it handle that situation. Just like memory.high acknowledges that if reclaim fails per kernel definition, it's an OOM situation from a kernel view, and it helps userspace handle that. > for me > "Once the limit is exceeded it is expected that the userspace > is going to act and either free up the swapped out space > or tune the limit based on needs. The kernel itself is not > able to do that on its own. > " I mean, in rare cases, maybe userspace can do some loadshedding and be smart about it. But we certainly don't expect it to. Just like we don't expect it to when memory.high starts injecting sleeps. We expect the workload to die, usually.