From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=D8eA=6I=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BA2C4C54FCB
	for <linux-mm@archiver.kernel.org>; Fri, 24 Apr 2020 15:05:15 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7BDD920706
	for <linux-mm@archiver.kernel.org>; Fri, 24 Apr 2020 15:05:15 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7BDD920706
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2961B8E0006; Fri, 24 Apr 2020 11:05:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 245A08E0003; Fri, 24 Apr 2020 11:05:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 15C048E0006; Fri, 24 Apr 2020 11:05:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237])
	by kanga.kvack.org (Postfix) with ESMTP id F23498E0003
	for <linux-mm@kvack.org>; Fri, 24 Apr 2020 11:05:14 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id AA38C181AC9B6
	for <linux-mm@kvack.org>; Fri, 24 Apr 2020 15:05:14 +0000 (UTC)
X-FDA: 76743071748.09.stem31_67f2c11b1f14d
X-HE-Tag: stem31_67f2c11b1f14d
X-Filterd-Recvd-Size: 7147
Received: from mail-wm1-f65.google.com (mail-wm1-f65.google.com [209.85.128.65])
	by imf12.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 24 Apr 2020 15:05:14 +0000 (UTC)
Received: by mail-wm1-f65.google.com with SMTP id 188so10849044wmc.2
        for <linux-mm@kvack.org>; Fri, 24 Apr 2020 08:05:14 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=tUTBy5fIsPSGLAED3RD1/QdnNbiATC1koGDZtuwuw2Y=;
        b=LbOeS2oN1O3pisRZkjQI7BGpnsCfl7Mom/hjfyUR7NG/ss+LErTqh1gLINqX6Ps3EI
         kSWhlvNG4f//qAhjY5bnuyLaEslWf120NlzXK9RLYH9EpWZEIeUqIr6Q/Ex5pzWWhW/8
         lx+EiO4Gzypz9vMOWYXez5W0QW5tBovGbjDW72IBJz86dbNSu9bM1yae8bCyP0ugwSW8
         GPUqgw+vr/gMCpAvx+CW1EPxqKze+Gl2rW6Ebvpax5aK/tMa1/fH5zii0c1sbPZzw1e0
         XkZ0vXIfyWIF/8aBR7JvqfFLIw8ONZwVg7Iieaud1GIVupginDTXHK/z57nJSuxkhuVo
         XWyA==
X-Gm-Message-State: AGi0PubyDG/LCG6SW72PdYmjECuTawjqJ6NStuCuZ8OJsKi30C3JdLcP
	plyXXFCi3dmqCl7WJK0LdvQ=
X-Google-Smtp-Source: APiQypLd173SepUHXL+0Qssr/SFU89okKZBmXtTcwCPORUe219b0PTqPw4HqcZ1hqqXlkwizgxCVPw==
X-Received: by 2002:a1c:3c54:: with SMTP id j81mr10325686wma.140.1587740712998;
        Fri, 24 Apr 2020 08:05:12 -0700 (PDT)
Received: from localhost (ip-37-188-130-62.eurotel.cz. [37.188.130.62])
        by smtp.gmail.com with ESMTPSA id j11sm8656172wrr.62.2020.04.24.08.05.10
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Apr 2020 08:05:11 -0700 (PDT)
Date: Fri, 24 Apr 2020 17:05:10 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>, Shakeel Butt <shakeelb@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>, Kernel Team <kernel-team@fb.com>,
	Chris Down <chris@chrisdown.name>,
	Cgroups <cgroups@vger.kernel.org>
Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available
 space gets depleted
Message-ID: <20200424150510.GH11591@dhcp22.suse.cz>
References: <20200421110612.GD27314@dhcp22.suse.cz>
 <20200421142746.GA341682@cmpxchg.org>
 <20200421161138.GL27314@dhcp22.suse.cz>
 <20200421165601.GA345998@cmpxchg.org>
 <20200422132632.GG30312@dhcp22.suse.cz>
 <20200422141514.GA362484@cmpxchg.org>
 <20200422154318.GK30312@dhcp22.suse.cz>
 <20200422171328.GC362484@cmpxchg.org>
 <20200422184921.GB4206@dhcp22.suse.cz>
 <20200423150015.GE362484@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200423150015.GE362484@cmpxchg.org>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu 23-04-20 11:00:15, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote:
> > On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > > > I am also missing some information about what the user can actually do
> > > > about this situation and call out explicitly that the throttling is
> > > > not going away until the swap usage is shrunk and the kernel is not
> > > > capable of doing that on its own without a help from the userspace. This
> > > > is really different from memory.high which has means to deal with the
> > > > excess and shrink it down in most cases. The following would clarify it
> > > 
> > > I think we may be talking past each other. The user can do the same
> > > thing as in any OOM situation: wait for the kill.
> > 
> > That assumes that reaching swap.high is going to converge to the OOM
> > eventually. And that is far from the general case. There might be a
> > lot of other reclaimable memory to reclaim and stay in the current
> > state.
> 
> No, that's really the general case. And that's based on what users
> widely experience, including us at FB. When swap is full, it's over.
> Multiple parties have independently reached this conclusion.

But we are talking about two things. You seem to be focusing on the full
swap (quota) while I am talking about swap.high which doesn't imply
that the quota/full swap is going to be reached soon.

[...]

> The assymetry you see between memory.high and swap.high comes from the
> page cache. memory.high can set a stop to the mindless expansion of
> the file cache and remove *unused* cache pages from the application's
> workingset. It cannot permanently remove used cache pages, they'll
> just refault. So unused cache is where reclaim is useful.

Exactly! And I have seen memory.high being used to throttle huge page
cache producers to not disrupt other workloads.
 
> Once the workload expands its set of *used* pages past memory.high, we
> are talking about indefinite slowdowns / OOM situations. Because at
> that point, reclaim cannot push the workload back and everything will
> be okay: the pages it takes off mean refaults and continued reclaim,
> i.e. throttling. You get slowed down either way, and whether you
> reclaim or sleep() is - to the workload - an accounting difference.
>
> Reclaim does NOT have the power to help the workload get better. It
> can only do amputations to protect the rest of the system, but it
> cannot reduce the number of pages the workload is trying to access.

Yes I do agree with you here and I believe this scenario wasn't really
what the dispute is about. As soon as the real working set doesn't
fit into the high limit and still growing then you are effectively
OOM and either you do handle that from the userspace or you have to
waaaaaaaaait for the kernel oom killer to trigger.

But I believe this scenario is much easier to understand because the
memory consumption is growing. What I find largely unintuitive from the
user POV is that the throttling will remain in place without a userspace
intervention even when there is no runaway.

Let me give you an example. Say you have a peak load which pushes
out a large part of an idle memory to swap. So much it fills up the
swap.high. The peak eventually finishes freeing up its resources.  The
swap situation remains the same because that memory is not refaulted and
we do not pro-actively swap in memory (aka reclaim the swap space). You
are left with throttling even though the overall memcg consumption is
really low. Kernel is currently not able to do anything about that
and the userspace would need to be aware of the situation to fault in
swapped out memory back to get a normal behavior. Do you think this
is something so obvious that people would keep it in mind when using
swap.high?

Anyway, it seems that we are not making progress here. As I've said I
believe that swap.high might lead to a surprising behavior and therefore
I would appreciate more clarity in the documentation. If you see a
problem with that for some reason then I can live with that. This is not
a reason to nack.
-- 
Michal Hocko
SUSE Labs