From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=KkLe=4N=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 305FFC35E01
	for <linux-mm@archiver.kernel.org>; Tue, 25 Feb 2020 15:11:55 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D7B7021744
	for <linux-mm@archiver.kernel.org>; Tue, 25 Feb 2020 15:11:52 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="Ny9hNRmJ"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D7B7021744
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 83ED16B0003; Tue, 25 Feb 2020 10:11:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7EEA86B0005; Tue, 25 Feb 2020 10:11:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 690FB6B0006; Tue, 25 Feb 2020 10:11:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132])
	by kanga.kvack.org (Postfix) with ESMTP id 4AD956B0003
	for <linux-mm@kvack.org>; Tue, 25 Feb 2020 10:11:52 -0500 (EST)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id DAF1F19245
	for <linux-mm@kvack.org>; Tue, 25 Feb 2020 15:11:51 +0000 (UTC)
X-FDA: 76528989222.12.joke52_8ff705dd5e534
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin12.hostedemail.com (Postfix) with ESMTP id DF50718043D32
	for <linux-mm@kvack.org>; Tue, 25 Feb 2020 15:03:11 +0000 (UTC)
X-HE-Tag: joke52_8ff705dd5e534
X-Filterd-Recvd-Size: 10318
Received: from mail-pg1-f195.google.com (mail-pg1-f195.google.com [209.85.215.195])
	by imf25.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 25 Feb 2020 15:03:09 +0000 (UTC)
Received: by mail-pg1-f195.google.com with SMTP id y30so6981505pga.13
        for <linux-mm@kvack.org>; Tue, 25 Feb 2020 07:03:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=HXcnWME3zqMcGPYyd8AtzUf91f9WMm8RU0Wrk+XdU2s=;
        b=Ny9hNRmJQiK2If/FIB2PQP4sL7zzQYgnWFwVLDzOiibHdt19oxudGd0SHvGWPzjVeQ
         XGY0QhuMkDZgd4oDp5ihkgArtje58gT8a0fsQ0iO8vcsF6XXdRT6/8/AVq7pwtkXdiLE
         C8zwUi4Sf4y0zeHdNlZzQrVXn0HRJIZ27TWdAICO58+7AcjslqOoiJZXUkiJacQHz33y
         vaEL1bVZwb/Y1IngUor2Ly5tHSY36Qu6nPgAPS8O9MXsXksR9zsh9Fd4LjSJUx/9XT4e
         oDyeUbMMTnERTgjdu8KNhd1TODAJykPEqAqCKEA7Qe5SNYrkR3KczxEjIvKJ3aIc8Gm7
         0wNw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=HXcnWME3zqMcGPYyd8AtzUf91f9WMm8RU0Wrk+XdU2s=;
        b=hvqvlHm6V8kLzDiUbX56wWtbVuhVZVi5XcXoRxZw4CJcdsuDjenqJ3oTeAWyy8TyCx
         cNW4Q7n9qE7Horqa8E0YfwQ0Kb2sf0KhpIAj+s2K80Yi/ZtyIDXDBx6b5vAnJzgagNwD
         XIKBrep4wN/Ol51o1WuFL3hwB+VxtmplmiJk5k4tE3o8PsM7KLr192Q2hrSuQEVbhY5W
         4iXdl8m1FNhwsOYbEA/T46kAqJsVWwNyMqVDGCKL77zsegd9/j/vivvDua3FCZwDFmhn
         ZKvhXZ3zrjEejOTt/g9NIvq4e/D3GWqEy39kJaCs79olvcQMrfGRwGAyKNc1TX56ZOxd
         WrCw==
X-Gm-Message-State: APjAAAVpyzW5y4S9YPq/p4Pe6FpFKYnJUwyGj3My1EhNiXgGb16202Kq
	H/H5gTxqGVVAU06XXp6nS6vaGQ==
X-Google-Smtp-Source: APXvYqyw5rQ/0pgLTRLVs92MO68JBKon9c2eQCycJK08Vr5VZ9oqEF2cN9oyR3lwpssCk1YCa3s7hA==
X-Received: by 2002:a63:6d01:: with SMTP id i1mr57157955pgc.55.1582642987383;
        Tue, 25 Feb 2020 07:03:07 -0800 (PST)
Received: from localhost ([2620:10d:c090:180::be7e])
        by smtp.gmail.com with ESMTPSA id g24sm17367256pfk.92.2020.02.25.07.03.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 25 Feb 2020 07:03:06 -0800 (PST)
Date: Tue, 25 Feb 2020 10:03:04 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Roman Gushchin <guro@fb.com>,
	Michal Hocko <mhocko@suse.com>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
Message-ID: <20200225150304.GA10257@cmpxchg.org>
References: <20191219200718.15696-1-hannes@cmpxchg.org>
 <20191219200718.15696-4-hannes@cmpxchg.org>
 <20200221171256.GB23476@blackbody.suse.cz>
 <20200221185839.GB70967@cmpxchg.org>
 <20200225133720.GA6709@blackbody.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <20200225133720.GA6709@blackbody.suse.cz>
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hello Michal,

On Tue, Feb 25, 2020 at 02:37:20PM +0100, Michal Koutn=FD wrote:
> On Fri, Feb 21, 2020 at 01:58:39PM -0500, Johannes Weiner <hannes@cmpxc=
hg.org> wrote:
> > When you set task's and logger's memory.low to "max" or 10G or any
> > bogus number like this, a limit reclaim in job treats this as origin
> > protection and tries hard to avoid reclaiming anything in either of
> > the two cgroups.
> What do you mean by origin protection? (I'm starting to see some
> misunderstanding here, c.f. my remark regarding the parent=3D=3Droot
> condition in the other patch [1]).

By origin protection I mean protection values at the first level of
children in a reclaim scope. Those are taken as absolute numbers
during a reclaim cycle and propagated down the tree.

Say you have the following configuration:

                root_mem_cgroup
               /
              A (max=3D12G, low=3D10G)
             /
            B (low=3Dmax)

If global reclaim occurs, the protection for subtree A is 10G, and B
then gets a proportional share of that.

However, if limit reclaim in A occurs due to the 12G max limit, the
protection for subtree B is max.

> > memory.events::low skyrockets even though no intended
> > protection was violated, we'll have reclaim latencies (especially whe=
n
> > there are a few dying cgroups accumluated in subtree).
> Hopefully, I see where are you coming from. There would be no (false)
> low notifications if the elow was calculated all they way top-down from
> the real root. Would such calculation be the way to go?

That hinges on whether an opt-out mechanism makes sense, and we
disagree on that part.

> > that job can't possibly *know* about the top-level host
> > protection that lies beyond the delegation point and outside its own
> > namespace,
> Yes, I agree.
>=20
> > and that it needs to propagate protection against rpm upgrades into
> > its own leaf groups for each tasklet and component.
> If a job wants to use concrete protection than it sets it, if it wants
> to use protection from above, then it can express it with the infinity
> (after changing the effective calculation I described above).
>=20
> Now, you may argue that the infinity would be nonsensical if it's not a
> subordinate job. Simplest approach would be likely to introduce the
> special "inherit" value (such a literal name may be misleading as it
> would be also "dont-care").

Again, a complication of the interface for *everybody* on the premise
that retaining an opt-out mechanism makes sense. We disagree on that.

> > Again, in practice we have found this to be totally unmanageable and
> > routinely first forgot and then had trouble hacking the propagation
> > into random jobs that create their own groups.
> I've been bitten by this as well. However, the protection defaults to
> off and I find it this matches the general rule that kernel provides th=
e
> mechanism and user(space) the policy.
>
> > And when you add new hardware configurations, you cannot just make a
> > top-level change in the host config, you have to update all the job
> > specs of workloads running in the fleet.
> (I acknowledge the current mechanism lacks an explicit way to express
> the inherit/dont-care value.)
>=20
>=20
> > My patch brings memory configuration in line with other cgroup2
> > controllers.
> Other controllers mostly provide the limit or weight controls, I'd say
> protection semantics is specific only to the memory controller so
> far [2]. I don't think (at least by now) it can be aligned as the weigh=
t
> or limit semantics.

Can you explain why you think protection is different from a weight?

Both specify a minimum amount of a resource that the cgroup can use
under contention, while allowing the cgroup to use more than that
share if there is no contention with siblings.

You configure memory in bytes instead of a relative proportion, but
that's only because bytes are a natural unit of memory whereas a
relative proportion of time is a natural unit of CPU and IO.

I'm having trouble concluding from this that the inheritance rules
must be fundamentally different.

For example, if you assign a share of CPU or IO to a subtree, that
applies to the entire subtree. Nobody has proposed being able to
opt-out of shares in a subtree, let alone forcing individual cgroups
to *opt-in* to receive these shares.

I can't fathom why you think assigning pieces of memory to a subtree
must be fundamentally different.

> > I've made the case why it's not a supported usecase, and why it is a
> > meaningless configuration in practice due to the way other controller=
s
> > already behave.
> I see how your reasoning works for limits (you set memory limit and you
> need to control io/cpu too to maintain intended isolation). I'm confuse=
d
> why having a scapegoat (or donor) sibling for protection should not be
> supported or how it breaks protection for others if not combined with
> io/cpu controllers. Feel free to point me to the message if I overlooke=
d
> it.

Because a lack of memory translates to paging, which consumes IO and
CPU. If you relinquish a cgroup's share of memory (whether with a
limit or with a lack of protection under pressure), you increases its
share of IO. To express a priority order between workloads, you cannot
opt out of memory protection without also opting out of the IO shares.

Say you have the following configuration:

                   A
                  / \
                 B   C
                /\
               D  E

D houses your main workload, C a secondary workload, E is not
important. You give B protection and C less protection. You opt E out
of B's memory share to give it all to D. You established a memory
order of D > C > E.

Now to the IO side. You assign B a higher weight than C, and D a
higher weight then E.

Now you apply memory pressure, what happens?. D isn't reclaimed, C is
somewhat reclaimed, E is reclaimed hard. D will not page, C will page
a little bit, E will page hard *with the higher IO priority of B*.

Now C is stuck behind E. This is a priority inversion.

Yes, from a pure accounting perspective, you've managed to enforce
that E will never have more physical pages allocated than C at any
given time. But what did that accomplish? What was the practical
benefit of having made E a scapegoat?

Since I'm repeating myself on this topic, I would really like to turn
your questions around:

1. Can you please make a practical use case for having scape goats or
   donor groups to justify retaining what I consider to be an
   unimportant artifact in the memory.low semantics?

2. If you think opting out of hierarchically assigned resources is a
   fundamentally important usecase, can you please either make an
   argument why it should also apply to CPU and IO, or alternatively
   explain in detail why they are meaningfully different?