From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A2BBC352A3 for ; Tue, 11 Feb 2020 16:47:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 18E7820708 for ; Tue, 11 Feb 2020 16:47:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 18E7820708 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B065F6B02F5; Tue, 11 Feb 2020 11:47:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AB6506B02F7; Tue, 11 Feb 2020 11:47:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9CC286B02F8; Tue, 11 Feb 2020 11:47:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240]) by kanga.kvack.org (Postfix) with ESMTP id 853086B02F5 for ; Tue, 11 Feb 2020 11:47:58 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 16DF42839 for ; Tue, 11 Feb 2020 16:47:58 +0000 (UTC) X-FDA: 76478428236.26.trees30_6296ef2f07f60 X-HE-Tag: trees30_6296ef2f07f60 X-Filterd-Recvd-Size: 9427 Received: from mail-wm1-f65.google.com (mail-wm1-f65.google.com [209.85.128.65]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Tue, 11 Feb 2020 16:47:57 +0000 (UTC) Received: by mail-wm1-f65.google.com with SMTP id b17so4440407wmb.0 for ; Tue, 11 Feb 2020 08:47:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=CMl+mRO9Zvf2SDJ75RJRDZjMrbhk+aBHXg/UK4B09DA=; b=Pwk9rgO1H76FyNy+1mFEjGOfBC+v6Pb3FxQZfuNa7P6I0kXIfTauFmHdVa+Oa1kdtq Vv6Y50TjX04Vbwr86HtuDwHccWBnyoSmlZqZjwqIQaDEi7h4ZnH7+DkBdpZVPutWJ8Xw HOMK75viaZEauSDnHVrlZ2efnoqoxleyv1pt5ka5av4mYgG+Sn6uL+HV43tYSwACeZPn LGuXKFIxOhE2dsWmheu2nGfNhWGyKPmJp/fOiIPV8l3MvfTszabKTClM3ibgl1rOlVrC dnuR5Nbn7n/PZep4LkZz+LfjVOWU8vjIGNXZd7jlngaCJftUGjFbKZeFyJpZiHJiQ/8Z y7OA== X-Gm-Message-State: APjAAAXFYU+0IKc44KunI9diyDgvcUBGMaiWHC6t9dzzEEa0P+Y8APt1 /TA3rQY6evGhp1J9yT4DM/w= X-Google-Smtp-Source: APXvYqyRmpnV0x90UnxeJU/1d5i6cXU6wvw0RqwIVycsFRl5ODFQlgFraCMrIyXLjxV9s936/PfKjA== X-Received: by 2002:a1c:b603:: with SMTP id g3mr6892696wmf.133.1581439676118; Tue, 11 Feb 2020 08:47:56 -0800 (PST) Received: from localhost (ip-37-188-227-72.eurotel.cz. [37.188.227.72]) by smtp.gmail.com with ESMTPSA id x11sm4418883wmg.46.2020.02.11.08.47.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Feb 2020 08:47:55 -0800 (PST) Date: Tue, 11 Feb 2020 17:47:53 +0100 From: Michal Hocko To: Johannes Weiner Cc: Andrew Morton , Roman Gushchin , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Message-ID: <20200211164753.GQ10636@dhcp22.suse.cz> References: <20191219200718.15696-1-hannes@cmpxchg.org> <20191219200718.15696-4-hannes@cmpxchg.org> <20200130170020.GZ24244@dhcp22.suse.cz> <20200203215201.GD6380@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200203215201.GD6380@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 03-02-20 16:52:01, Johannes Weiner wrote: > On Thu, Jan 30, 2020 at 06:00:20PM +0100, Michal Hocko wrote: > > On Thu 19-12-19 15:07:18, Johannes Weiner wrote: > > > Right now, the effective protection of any given cgroup is capped by > > > its own explicit memory.low setting, regardless of what the parent > > > says. The reasons for this are mostly historical and ease of > > > implementation: to make delegation of memory.low safe, effective > > > protection is the min() of all memory.low up the tree. > > > > > > Unfortunately, this limitation makes it impossible to protect an > > > entire subtree from another without forcing the user to make explicit > > > protection allocations all the way to the leaf cgroups - something > > > that is highly undesirable in real life scenarios. > > > > > > Consider memory in a data center host. At the cgroup top level, we > > > have a distinction between system management software and the actual > > > workload the system is executing. Both branches are further subdivided > > > into individual services, job components etc. > > > > > > We want to protect the workload as a whole from the system management > > > software, but that doesn't mean we want to protect and prioritize > > > individual workload wrt each other. Their memory demand can vary over > > > time, and we'd want the VM to simply cache the hottest data within the > > > workload subtree. Yet, the current memory.low limitations force us to > > > allocate a fixed amount of protection to each workload component in > > > order to get protection from system management software in > > > general. This results in very inefficient resource distribution. > > > > I do agree that configuring the reclaim protection is not an easy task. > > Especially in a deeper reclaim hierarchy. systemd tends to create a deep > > and commonly shared subtrees. So having a protected workload really > > requires to be put directly into a new first level cgroup in practice > > AFAICT. That is a simpler example though. Just imagine you want to > > protect a certain user slice. > > Can you elaborate a bit on this? I don't quite understand the two > usecases you are contrasting here. Essentially this is about two different usecases. The first one is about protecting a hierarchy and spreading the protection among different workloads and the second is how to protect an inner memcg without configuring protection all the way up the hierarchy. > > You seem to be facing a different problem though IIUC. You know how much > > memory you want to protect and you do not have to care about the cgroup > > hierarchy up but you do not know/care how to distribute that protection > > among workloads running under that protection. I agree that this is a > > reasonable usecase. > > I'm not sure I'm parsing this right, but the use case is this: > > When I'm running a multi-component workload on a host without any > cgrouping, the individual components compete over the host's memory > based on rate of allocation, how often they reference their memory and > so forth. It's a need-based distribution of pages, and the weight can > change as demand changes throughout the life of the workload. > > If I now stick several of such workloads into a containerized > environment, I want to use memory.low to assign each workload as a > whole a chunk of memory it can use - I don't want to assign fixed-size > subchunks to each individual component of each workload! I want the > same free competition *within* the workload while setting clear rules > for competition *between* the different workloads. Yeah, that matches my understanding of the problem your are trying to solve here. > > [ What I can do today to achieve this is disable the memory controller > for the subgroups. When I do this, all pages of the workload are on > one single LRU that is protected by one single memory.low. > > But obviously I lose any detailed accounting as well. > > This patch allows me to have the same recursive protection semantics > while retaining accounting. ] > > > Those both problems however show that we have a more general > > configurability problem for both leaf and intermediate nodes. They are > > both a result of strong requirements imposed by delegation as you have > > noted above. I am thinking didn't we just go too rigid here? > > The requirement for delegation is that child groups cannot claim more > than the parent affords. Is that the restriction you are referring to? yes. > > Delegation points are certainly a security boundary and they should > > be treated like that but do we really need a strong containment when > > the reclaim protection is under admin full control? Does the admin > > really have to reconfigure a large part of the hierarchy to protect a > > particular subtree? > > > > I do not have a great answer on how to implement this unfortunately. The > > best I could come up with was to add a "$inherited_protection" magic > > value to distinguish from an explicit >=0 protection. What's the > > difference? $inherited_protection would be a default and it would always > > refer to the closest explicit protection up the hierarchy (with 0 as a > > default if there is none defined). > > A > > / \ > > B C (low=10G) > > / \ > > D E (low = 5G) > > > > A, B don't get any protection (low=0). C gets protection (10G) and > > distributes the pressure to D, E when in excess. D inherits (low=10G) > > and E overrides the protection to 5G. > > > > That would help both usecases AFAICS while the delegation should be > > still possible (configure the delegation point with an explicit > > value). I have very likely not thought that through completely. Does > > that sound like a completely insane idea? > > > > Or do you think that the two usecases are simply impossible to handle > > at the same time? > > Doesn't my patch accomplish this? Unless I am missing something then I am afraid it doesn't. Say you have a default systemd cgroup deployment (aka deeper cgroup hierarchy with slices and scopes) and now you want to grant a reclaim protection on a leaf cgroup (or even a whole slice that is not really important). All the hierarchy up the tree has the protection set to 0 by default, right? You simply cannot get that protection. You would need to configure the protection up the hierarchy and that is really cumbersome. > Any cgroup or group of cgroups still cannot claim more than the > ancestral protection for the subtree. If a cgroup says 10G, the sum of > all children's protection will never exceed that. This ensures > delegation is safe. Right. And delegation usecase really requres that. No question about that. I am merely arguing that if you do not delegate then this is way too strict. -- Michal Hocko SUSE Labs