From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C73FC54FD0 for ; Thu, 23 Apr 2020 15:00:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B04E020857 for ; Thu, 23 Apr 2020 15:00:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="F2+EdRXA" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B04E020857 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4D3CA8E0005; Thu, 23 Apr 2020 11:00:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4856C8E0003; Thu, 23 Apr 2020 11:00:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 325DF8E0005; Thu, 23 Apr 2020 11:00:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200]) by kanga.kvack.org (Postfix) with ESMTP id 142F38E0003 for ; Thu, 23 Apr 2020 11:00:36 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id AE65F8141 for ; Thu, 23 Apr 2020 15:00:35 +0000 (UTC) X-FDA: 76739431230.30.milk10_2b8f9d109e4d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id E27801802B9C5 for ; Thu, 23 Apr 2020 15:00:19 +0000 (UTC) X-HE-Tag: milk10_2b8f9d109e4d X-Filterd-Recvd-Size: 7905 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Thu, 23 Apr 2020 15:00:18 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id o19so6672959qkk.5 for ; Thu, 23 Apr 2020 08:00:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=oENiaGCCgBNwU5CHpSWbzNiCtZrcoAbIQaEpyFI4Zyc=; b=F2+EdRXA4b1rXFir1jFJSgrXJrJ5o9riv22KqA2p86kWQCyPvSUKy4wKB5WLOdHwtU 7D96mkRjvmaweBXzPi2GEsXuxbq388HMxcRU83Er8dCV4lcCg2/DOORi0sDl7O8iRp+5 vbtBgli28f2Zs+HaIMLOMCfturcG/BAnCI32bf3pSvbAReVR0eDHy8HtNXzG0/QRSBH+ qWU3i1iD+MXHatIKUEESuiFrTAosErCOciZ8HryVPP9t0jX8GY7r407zg/uK0sxKkAYo C2Bu6rQN59Wzr3dWDCUOUHAgilY68HaBE2geARf+DZppq7vAmSW8D9fq7H4C3yZFmVK6 SaQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=oENiaGCCgBNwU5CHpSWbzNiCtZrcoAbIQaEpyFI4Zyc=; b=SnS6DUNduhLTzLRnD6+qq8/GHuTVDUkW5mFALSU0wVRf3fKYyWSerHD9DMoHurTR4L rNUB+oddDimOkE5uGUrzXVXsmVl7K76cdEyZ/lDiyAc0m6oZX4I/MqdvLMdi/YJtjD0D Gc5qkPXVSXFsB93+HDaTegenLPqtqqhmbW0RtPpdHvbggXqph4DskAtU3tBIRNj+N9bT qriZqQyF61j21uoF4SUCPMdGlEVhhLxq+pR5FSlASRKDuAYTJruz3es6RptLWxlVePTe coq8rM2ENk+sVukbIvck+WfXTTgJyJp+yiEKQS+jvuruCPPASsJ34yudS7uoLJ9CwnB5 a8AA== X-Gm-Message-State: AGi0PuZvrTdkSTQ5/ylUku5ZK8UEnwwpEb0TZ3LPJW2aZ9XXekxaaotj 8oYwWAJ8m49RT+kByOWMk2bDVA== X-Google-Smtp-Source: APiQypIGMXWgb+JOCSTClLpsOynuZyyO0/tNOhTAuVj8e6LOlXkdiIgMHPZYM2Rc499Hi3Ex5IcfiA== X-Received: by 2002:a37:9d08:: with SMTP id g8mr4014670qke.138.1587654017733; Thu, 23 Apr 2020 08:00:17 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::921]) by smtp.gmail.com with ESMTPSA id h2sm1695542qkh.91.2020.04.23.08.00.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Apr 2020 08:00:16 -0700 (PDT) Date: Thu, 23 Apr 2020 11:00:15 -0400 From: Johannes Weiner To: Michal Hocko Cc: Tejun Heo , Shakeel Butt , Jakub Kicinski , Andrew Morton , Linux MM , Kernel Team , Chris Down , Cgroups Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted Message-ID: <20200423150015.GE362484@cmpxchg.org> References: <20200420170650.GA169746@mtj.thefacebook.com> <20200421110612.GD27314@dhcp22.suse.cz> <20200421142746.GA341682@cmpxchg.org> <20200421161138.GL27314@dhcp22.suse.cz> <20200421165601.GA345998@cmpxchg.org> <20200422132632.GG30312@dhcp22.suse.cz> <20200422141514.GA362484@cmpxchg.org> <20200422154318.GK30312@dhcp22.suse.cz> <20200422171328.GC362484@cmpxchg.org> <20200422184921.GB4206@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200422184921.GB4206@dhcp22.suse.cz> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote: > On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > > I am also missing some information about what the user can actually do > > > about this situation and call out explicitly that the throttling is > > > not going away until the swap usage is shrunk and the kernel is not > > > capable of doing that on its own without a help from the userspace. This > > > is really different from memory.high which has means to deal with the > > > excess and shrink it down in most cases. The following would clarify it > > > > I think we may be talking past each other. The user can do the same > > thing as in any OOM situation: wait for the kill. > > That assumes that reaching swap.high is going to converge to the OOM > eventually. And that is far from the general case. There might be a > lot of other reclaimable memory to reclaim and stay in the current > state. No, that's really the general case. And that's based on what users widely experience, including us at FB. When swap is full, it's over. Multiple parties have independently reached this conclusion. This will be the default assumption in major distributions soon: https://fedoraproject.org/wiki/Changes/EnableEarlyoom > > > for me > > > "Once the limit is exceeded it is expected that the userspace > > > is going to act and either free up the swapped out space > > > or tune the limit based on needs. The kernel itself is not > > > able to do that on its own. > > > " > > > > I mean, in rare cases, maybe userspace can do some loadshedding and be > > smart about it. But we certainly don't expect it to. > > I really didn't mean to suggest any clever swap management. All I > wanted so say and have documented is that users of swap.high should > be aware of the fact that kernel is not able to do much to reduce the > throttling. This is really different from memory.high where the kernel > pro-actively tries to keep the memory usage below the watermark. So a > certain level of userspace cooperation is really needed unless you can > tolerate a workload to be throttled to the end of times. That's exactly what happens with memory.high. We've seen this. The workload can go into a crawl and just stay there. It's not unlike disabling the oom killer in cgroup1 without anybody handling it. With memory.high, workloads *might* recover, but you have to handle the ones that don't. Again, we inject sleeps into memory.high when reclaim *is not* pushing back the workload anymore, when reclaim is *failing*. The state isn't as stable as with oom_control=0, but these indefinite hangs really happen in practice. Realistically, you cannot use memory.high without an OOM manager. The assymetry you see between memory.high and swap.high comes from the page cache. memory.high can set a stop to the mindless expansion of the file cache and remove *unused* cache pages from the application's workingset. It cannot permanently remove used cache pages, they'll just refault. So unused cache is where reclaim is useful. Once the workload expands its set of *used* pages past memory.high, we are talking about indefinite slowdowns / OOM situations. Because at that point, reclaim cannot push the workload back and everything will be okay: the pages it takes off mean refaults and continued reclaim, i.e. throttling. You get slowed down either way, and whether you reclaim or sleep() is - to the workload - an accounting difference. Reclaim does NOT have the power to help the workload get better. It can only do amputations to protect the rest of the system, but it cannot reduce the number of pages the workload is trying to access. The only sustainable way out of such a throttling situation is either an OOM kill or the workload voluntarily shrinking and reducing the total number of pages it uses. And doesn't that sound familiar? :-) The actual, observable effects of memory.high and swap.high semantics are much more similar than you think they are: When the workload's true workingset (not throwaway cache) grows past capacity (memory or swap), we slow down further expansion until it either changes its mind and shrinks, or userspace OOM handling takes care of it.