From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Roman Gushchin <guro@fb.com>, Tejun Heo <tj@kernel.org>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
Date: Tue, 11 Feb 2020 17:47:53 +0100 [thread overview]
Message-ID: <20200211164753.GQ10636@dhcp22.suse.cz> (raw)
In-Reply-To: <20200203215201.GD6380@cmpxchg.org>
On Mon 03-02-20 16:52:01, Johannes Weiner wrote:
> On Thu, Jan 30, 2020 at 06:00:20PM +0100, Michal Hocko wrote:
> > On Thu 19-12-19 15:07:18, Johannes Weiner wrote:
> > > Right now, the effective protection of any given cgroup is capped by
> > > its own explicit memory.low setting, regardless of what the parent
> > > says. The reasons for this are mostly historical and ease of
> > > implementation: to make delegation of memory.low safe, effective
> > > protection is the min() of all memory.low up the tree.
> > >
> > > Unfortunately, this limitation makes it impossible to protect an
> > > entire subtree from another without forcing the user to make explicit
> > > protection allocations all the way to the leaf cgroups - something
> > > that is highly undesirable in real life scenarios.
> > >
> > > Consider memory in a data center host. At the cgroup top level, we
> > > have a distinction between system management software and the actual
> > > workload the system is executing. Both branches are further subdivided
> > > into individual services, job components etc.
> > >
> > > We want to protect the workload as a whole from the system management
> > > software, but that doesn't mean we want to protect and prioritize
> > > individual workload wrt each other. Their memory demand can vary over
> > > time, and we'd want the VM to simply cache the hottest data within the
> > > workload subtree. Yet, the current memory.low limitations force us to
> > > allocate a fixed amount of protection to each workload component in
> > > order to get protection from system management software in
> > > general. This results in very inefficient resource distribution.
> >
> > I do agree that configuring the reclaim protection is not an easy task.
> > Especially in a deeper reclaim hierarchy. systemd tends to create a deep
> > and commonly shared subtrees. So having a protected workload really
> > requires to be put directly into a new first level cgroup in practice
> > AFAICT. That is a simpler example though. Just imagine you want to
> > protect a certain user slice.
>
> Can you elaborate a bit on this? I don't quite understand the two
> usecases you are contrasting here.
Essentially this is about two different usecases. The first one is about
protecting a hierarchy and spreading the protection among different
workloads and the second is how to protect an inner memcg without
configuring protection all the way up the hierarchy.
> > You seem to be facing a different problem though IIUC. You know how much
> > memory you want to protect and you do not have to care about the cgroup
> > hierarchy up but you do not know/care how to distribute that protection
> > among workloads running under that protection. I agree that this is a
> > reasonable usecase.
>
> I'm not sure I'm parsing this right, but the use case is this:
>
> When I'm running a multi-component workload on a host without any
> cgrouping, the individual components compete over the host's memory
> based on rate of allocation, how often they reference their memory and
> so forth. It's a need-based distribution of pages, and the weight can
> change as demand changes throughout the life of the workload.
>
> If I now stick several of such workloads into a containerized
> environment, I want to use memory.low to assign each workload as a
> whole a chunk of memory it can use - I don't want to assign fixed-size
> subchunks to each individual component of each workload! I want the
> same free competition *within* the workload while setting clear rules
> for competition *between* the different workloads.
Yeah, that matches my understanding of the problem your are trying to
solve here.
>
> [ What I can do today to achieve this is disable the memory controller
> for the subgroups. When I do this, all pages of the workload are on
> one single LRU that is protected by one single memory.low.
>
> But obviously I lose any detailed accounting as well.
>
> This patch allows me to have the same recursive protection semantics
> while retaining accounting. ]
>
> > Those both problems however show that we have a more general
> > configurability problem for both leaf and intermediate nodes. They are
> > both a result of strong requirements imposed by delegation as you have
> > noted above. I am thinking didn't we just go too rigid here?
>
> The requirement for delegation is that child groups cannot claim more
> than the parent affords. Is that the restriction you are referring to?
yes.
> > Delegation points are certainly a security boundary and they should
> > be treated like that but do we really need a strong containment when
> > the reclaim protection is under admin full control? Does the admin
> > really have to reconfigure a large part of the hierarchy to protect a
> > particular subtree?
> >
> > I do not have a great answer on how to implement this unfortunately. The
> > best I could come up with was to add a "$inherited_protection" magic
> > value to distinguish from an explicit >=0 protection. What's the
> > difference? $inherited_protection would be a default and it would always
> > refer to the closest explicit protection up the hierarchy (with 0 as a
> > default if there is none defined).
> > A
> > / \
> > B C (low=10G)
> > / \
> > D E (low = 5G)
> >
> > A, B don't get any protection (low=0). C gets protection (10G) and
> > distributes the pressure to D, E when in excess. D inherits (low=10G)
> > and E overrides the protection to 5G.
> >
> > That would help both usecases AFAICS while the delegation should be
> > still possible (configure the delegation point with an explicit
> > value). I have very likely not thought that through completely. Does
> > that sound like a completely insane idea?
> >
> > Or do you think that the two usecases are simply impossible to handle
> > at the same time?
>
> Doesn't my patch accomplish this?
Unless I am missing something then I am afraid it doesn't. Say you have a
default systemd cgroup deployment (aka deeper cgroup hierarchy with
slices and scopes) and now you want to grant a reclaim protection on a
leaf cgroup (or even a whole slice that is not really important). All the
hierarchy up the tree has the protection set to 0 by default, right? You
simply cannot get that protection. You would need to configure the
protection up the hierarchy and that is really cumbersome.
> Any cgroup or group of cgroups still cannot claim more than the
> ancestral protection for the subtree. If a cgroup says 10G, the sum of
> all children's protection will never exceed that. This ensures
> delegation is safe.
Right. And delegation usecase really requres that. No question about
that. I am merely arguing that if you do not delegate then this is way
too strict.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2020-02-11 16:47 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
2020-01-30 11:49 ` Michal Hocko
2020-02-03 21:21 ` Johannes Weiner
2020-02-03 21:38 ` Roman Gushchin
2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
2020-01-30 12:54 ` Michal Hocko
2020-02-21 17:10 ` Michal Koutný
2020-02-25 18:40 ` Johannes Weiner
2020-02-26 16:46 ` Michal Koutný
2020-02-26 19:40 ` Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
2020-01-30 17:00 ` Michal Hocko
2020-02-03 21:52 ` Johannes Weiner
2020-02-10 15:21 ` Johannes Weiner
2020-02-11 16:47 ` Michal Hocko [this message]
2020-02-12 17:08 ` Johannes Weiner
2020-02-13 7:40 ` Michal Hocko
2020-02-13 13:23 ` Johannes Weiner
2020-02-13 15:46 ` Michal Hocko
2020-02-13 17:41 ` Johannes Weiner
2020-02-13 17:58 ` Johannes Weiner
2020-02-14 7:59 ` Michal Hocko
2020-02-13 13:53 ` Tejun Heo
2020-02-13 15:47 ` Michal Hocko
2020-02-13 15:52 ` Tejun Heo
2020-02-13 16:36 ` Michal Hocko
2020-02-13 16:57 ` Tejun Heo
2020-02-14 7:15 ` Michal Hocko
2020-02-14 13:57 ` Tejun Heo
2020-02-14 15:13 ` Michal Hocko
2020-02-14 15:40 ` Tejun Heo
2020-02-14 16:53 ` Johannes Weiner
2020-02-14 17:17 ` Tejun Heo
2020-02-17 8:41 ` Michal Hocko
2020-02-18 19:52 ` Johannes Weiner
2020-02-21 10:11 ` Michal Hocko
2020-02-21 15:43 ` Johannes Weiner
2020-02-25 12:20 ` Michal Hocko
2020-02-25 18:17 ` Johannes Weiner
2020-02-26 17:56 ` Michal Hocko
2020-02-21 17:12 ` Michal Koutný
2020-02-21 18:58 ` Johannes Weiner
2020-02-25 13:37 ` Michal Koutný
2020-02-25 15:03 ` Johannes Weiner
2020-02-26 13:22 ` Michal Koutný
2020-02-26 15:05 ` Johannes Weiner
2020-02-27 13:35 ` Michal Koutný
2020-02-27 15:06 ` Johannes Weiner
2019-12-19 20:22 ` [PATCH v2 0/3] mm: memcontrol: recursive memory protection Tejun Heo
2019-12-20 4:06 ` Roman Gushchin
2019-12-20 4:29 ` Chris Down
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200211164753.GQ10636@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).