From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 305FFC35E01 for ; Tue, 25 Feb 2020 15:11:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D7B7021744 for ; Tue, 25 Feb 2020 15:11:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="Ny9hNRmJ" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D7B7021744 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 83ED16B0003; Tue, 25 Feb 2020 10:11:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EEA86B0005; Tue, 25 Feb 2020 10:11:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 690FB6B0006; Tue, 25 Feb 2020 10:11:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132]) by kanga.kvack.org (Postfix) with ESMTP id 4AD956B0003 for ; Tue, 25 Feb 2020 10:11:52 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id DAF1F19245 for ; Tue, 25 Feb 2020 15:11:51 +0000 (UTC) X-FDA: 76528989222.12.joke52_8ff705dd5e534 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id DF50718043D32 for ; Tue, 25 Feb 2020 15:03:11 +0000 (UTC) X-HE-Tag: joke52_8ff705dd5e534 X-Filterd-Recvd-Size: 10318 Received: from mail-pg1-f195.google.com (mail-pg1-f195.google.com [209.85.215.195]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Tue, 25 Feb 2020 15:03:09 +0000 (UTC) Received: by mail-pg1-f195.google.com with SMTP id y30so6981505pga.13 for ; Tue, 25 Feb 2020 07:03:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=HXcnWME3zqMcGPYyd8AtzUf91f9WMm8RU0Wrk+XdU2s=; b=Ny9hNRmJQiK2If/FIB2PQP4sL7zzQYgnWFwVLDzOiibHdt19oxudGd0SHvGWPzjVeQ XGY0QhuMkDZgd4oDp5ihkgArtje58gT8a0fsQ0iO8vcsF6XXdRT6/8/AVq7pwtkXdiLE C8zwUi4Sf4y0zeHdNlZzQrVXn0HRJIZ27TWdAICO58+7AcjslqOoiJZXUkiJacQHz33y vaEL1bVZwb/Y1IngUor2Ly5tHSY36Qu6nPgAPS8O9MXsXksR9zsh9Fd4LjSJUx/9XT4e oDyeUbMMTnERTgjdu8KNhd1TODAJykPEqAqCKEA7Qe5SNYrkR3KczxEjIvKJ3aIc8Gm7 0wNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=HXcnWME3zqMcGPYyd8AtzUf91f9WMm8RU0Wrk+XdU2s=; b=hvqvlHm6V8kLzDiUbX56wWtbVuhVZVi5XcXoRxZw4CJcdsuDjenqJ3oTeAWyy8TyCx cNW4Q7n9qE7Horqa8E0YfwQ0Kb2sf0KhpIAj+s2K80Yi/ZtyIDXDBx6b5vAnJzgagNwD XIKBrep4wN/Ol51o1WuFL3hwB+VxtmplmiJk5k4tE3o8PsM7KLr192Q2hrSuQEVbhY5W 4iXdl8m1FNhwsOYbEA/T46kAqJsVWwNyMqVDGCKL77zsegd9/j/vivvDua3FCZwDFmhn ZKvhXZ3zrjEejOTt/g9NIvq4e/D3GWqEy39kJaCs79olvcQMrfGRwGAyKNc1TX56ZOxd WrCw== X-Gm-Message-State: APjAAAVpyzW5y4S9YPq/p4Pe6FpFKYnJUwyGj3My1EhNiXgGb16202Kq H/H5gTxqGVVAU06XXp6nS6vaGQ== X-Google-Smtp-Source: APXvYqyw5rQ/0pgLTRLVs92MO68JBKon9c2eQCycJK08Vr5VZ9oqEF2cN9oyR3lwpssCk1YCa3s7hA== X-Received: by 2002:a63:6d01:: with SMTP id i1mr57157955pgc.55.1582642987383; Tue, 25 Feb 2020 07:03:07 -0800 (PST) Received: from localhost ([2620:10d:c090:180::be7e]) by smtp.gmail.com with ESMTPSA id g24sm17367256pfk.92.2020.02.25.07.03.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Feb 2020 07:03:06 -0800 (PST) Date: Tue, 25 Feb 2020 10:03:04 -0500 From: Johannes Weiner To: Michal =?iso-8859-1?Q?Koutn=FD?= Cc: Andrew Morton , Roman Gushchin , Michal Hocko , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Message-ID: <20200225150304.GA10257@cmpxchg.org> References: <20191219200718.15696-1-hannes@cmpxchg.org> <20191219200718.15696-4-hannes@cmpxchg.org> <20200221171256.GB23476@blackbody.suse.cz> <20200221185839.GB70967@cmpxchg.org> <20200225133720.GA6709@blackbody.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20200225133720.GA6709@blackbody.suse.cz> Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello Michal, On Tue, Feb 25, 2020 at 02:37:20PM +0100, Michal Koutn=FD wrote: > On Fri, Feb 21, 2020 at 01:58:39PM -0500, Johannes Weiner wrote: > > When you set task's and logger's memory.low to "max" or 10G or any > > bogus number like this, a limit reclaim in job treats this as origin > > protection and tries hard to avoid reclaiming anything in either of > > the two cgroups. > What do you mean by origin protection? (I'm starting to see some > misunderstanding here, c.f. my remark regarding the parent=3D=3Droot > condition in the other patch [1]). By origin protection I mean protection values at the first level of children in a reclaim scope. Those are taken as absolute numbers during a reclaim cycle and propagated down the tree. Say you have the following configuration: root_mem_cgroup / A (max=3D12G, low=3D10G) / B (low=3Dmax) If global reclaim occurs, the protection for subtree A is 10G, and B then gets a proportional share of that. However, if limit reclaim in A occurs due to the 12G max limit, the protection for subtree B is max. > > memory.events::low skyrockets even though no intended > > protection was violated, we'll have reclaim latencies (especially whe= n > > there are a few dying cgroups accumluated in subtree). > Hopefully, I see where are you coming from. There would be no (false) > low notifications if the elow was calculated all they way top-down from > the real root. Would such calculation be the way to go? That hinges on whether an opt-out mechanism makes sense, and we disagree on that part. > > that job can't possibly *know* about the top-level host > > protection that lies beyond the delegation point and outside its own > > namespace, > Yes, I agree. >=20 > > and that it needs to propagate protection against rpm upgrades into > > its own leaf groups for each tasklet and component. > If a job wants to use concrete protection than it sets it, if it wants > to use protection from above, then it can express it with the infinity > (after changing the effective calculation I described above). >=20 > Now, you may argue that the infinity would be nonsensical if it's not a > subordinate job. Simplest approach would be likely to introduce the > special "inherit" value (such a literal name may be misleading as it > would be also "dont-care"). Again, a complication of the interface for *everybody* on the premise that retaining an opt-out mechanism makes sense. We disagree on that. > > Again, in practice we have found this to be totally unmanageable and > > routinely first forgot and then had trouble hacking the propagation > > into random jobs that create their own groups. > I've been bitten by this as well. However, the protection defaults to > off and I find it this matches the general rule that kernel provides th= e > mechanism and user(space) the policy. > > > And when you add new hardware configurations, you cannot just make a > > top-level change in the host config, you have to update all the job > > specs of workloads running in the fleet. > (I acknowledge the current mechanism lacks an explicit way to express > the inherit/dont-care value.) >=20 >=20 > > My patch brings memory configuration in line with other cgroup2 > > controllers. > Other controllers mostly provide the limit or weight controls, I'd say > protection semantics is specific only to the memory controller so > far [2]. I don't think (at least by now) it can be aligned as the weigh= t > or limit semantics. Can you explain why you think protection is different from a weight? Both specify a minimum amount of a resource that the cgroup can use under contention, while allowing the cgroup to use more than that share if there is no contention with siblings. You configure memory in bytes instead of a relative proportion, but that's only because bytes are a natural unit of memory whereas a relative proportion of time is a natural unit of CPU and IO. I'm having trouble concluding from this that the inheritance rules must be fundamentally different. For example, if you assign a share of CPU or IO to a subtree, that applies to the entire subtree. Nobody has proposed being able to opt-out of shares in a subtree, let alone forcing individual cgroups to *opt-in* to receive these shares. I can't fathom why you think assigning pieces of memory to a subtree must be fundamentally different. > > I've made the case why it's not a supported usecase, and why it is a > > meaningless configuration in practice due to the way other controller= s > > already behave. > I see how your reasoning works for limits (you set memory limit and you > need to control io/cpu too to maintain intended isolation). I'm confuse= d > why having a scapegoat (or donor) sibling for protection should not be > supported or how it breaks protection for others if not combined with > io/cpu controllers. Feel free to point me to the message if I overlooke= d > it. Because a lack of memory translates to paging, which consumes IO and CPU. If you relinquish a cgroup's share of memory (whether with a limit or with a lack of protection under pressure), you increases its share of IO. To express a priority order between workloads, you cannot opt out of memory protection without also opting out of the IO shares. Say you have the following configuration: A / \ B C /\ D E D houses your main workload, C a secondary workload, E is not important. You give B protection and C less protection. You opt E out of B's memory share to give it all to D. You established a memory order of D > C > E. Now to the IO side. You assign B a higher weight than C, and D a higher weight then E. Now you apply memory pressure, what happens?. D isn't reclaimed, C is somewhat reclaimed, E is reclaimed hard. D will not page, C will page a little bit, E will page hard *with the higher IO priority of B*. Now C is stuck behind E. This is a priority inversion. Yes, from a pure accounting perspective, you've managed to enforce that E will never have more physical pages allocated than C at any given time. But what did that accomplish? What was the practical benefit of having made E a scapegoat? Since I'm repeating myself on this topic, I would really like to turn your questions around: 1. Can you please make a practical use case for having scape goats or donor groups to justify retaining what I consider to be an unimportant artifact in the memory.low semantics? 2. If you think opting out of hierarchically assigned resources is a fundamentally important usecase, can you please either make an argument why it should also apply to CPU and IO, or alternatively explain in detail why they are meaningfully different?