linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Roman Gushchin <guro@fb.com>, Michal Hocko <mhocko@suse.com>,
	Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
Date: Fri, 21 Feb 2020 13:58:39 -0500	[thread overview]
Message-ID: <20200221185839.GB70967@cmpxchg.org> (raw)
In-Reply-To: <20200221171256.GB23476@blackbody.suse.cz>

On Fri, Feb 21, 2020 at 06:12:56PM +0100, Michal Koutný wrote:
> On Thu, Dec 19, 2019 at 03:07:18PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Unfortunately, this limitation makes it impossible to protect an
> > entire subtree from another without forcing the user to make explicit
> > protection allocations all the way to the leaf cgroups - something
> > that is highly undesirable in real life scenarios.
> I see that the jobs in descedant cgroups don't know (or care) what
> protection is above them and hence the implicit distribution is sensible
> here.
> 
> However, the protection your case requires can already be reached thanks
> to the the hierachical capping and overcommit normalization -- you can
> set memory.low to "max" at all the non-caring descendants.
> IIUC, that is the same as setting zeroes (after your patch) and relying
> on the recursive distribution of unused protection -- or is there a
> mistake in my reasonineg?

That is correct, but it comes with major problems. We did in fact try
exactly this as a workaround in our fleet, but had to revert and
develop the patch we are discussing now instead.

The reason is this: max isn't a "don't care" value. It's just a high
number with actual meaning in the configuration, and that interferes
when you try to compose it with other settings, such as limits.

Here is a configuration we actually use in practice:

                workload.slice (memory.low=20G)
                /                      \
              job (max=12G, low=10G)    job2 (max=12G, low=10G)
             /   \
           task logger

The idea is that we want to mostly protect the workload from other
stuff running in the system (low=20G), but we also want to catch a job
when it goes wild, to ensure reproducibility in testing regardless of
how loaded the host otherwise is (max=12G).

When you set task's and logger's memory.low to "max" or 10G or any
bogus number like this, a limit reclaim in job treats this as origin
protection and tries hard to avoid reclaiming anything in either of
the two cgroups. memory.events::low skyrockets even though no intended
protection was violated, we'll have reclaim latencies (especially when
there are a few dying cgroups accumluated in subtree).

So we had to undo this setting because of workload performance and
problems with monitoring workload health (the bogus low events).

The secondary problem with requiring explicit downward propagation is
that you may want to protect all jobs on the host from system
management software, as a very high-level host configuration. But a
random job that gets scheduled on a host, that lives in a delegated
cgroup and namespace, and creates its own nested tree of cgroups to
manage stuff - that job can't possibly *know* about the top-level host
protection that lies beyond the delegation point and outside its own
namespace, and that it needs to propagate protection against rpm
upgrades into its own leaf groups for each tasklet and component.

Again, in practice we have found this to be totally unmanageable and
routinely first forgot and then had trouble hacking the propagation
into random jobs that create their own groups.

[ And these job subgroups don't even use their *own* memory.low
  prioritization between siblings yet - god knows how you would
  integrate that with the values that you may inherit from higher
  level ancestors. ]

And when you add new hardware configurations, you cannot just make a
top-level change in the host config, you have to update all the job
specs of workloads running in the fleet.

My patch brings memory configuration in line with other cgroup2
controllers. You can make a high-level decision to prioritize one
subtree over another, just like a top-level weight assignment in CPU
or IO, and then you can delegate the subtree to a different entity
that doesn't need to be aware of and reflect that decision all the way
down the tree in its own settings.

And of course can compose it properly with limits.

> So in my view, the recursive distribution doesn't bring anything new,
> however, its new semantics of memory.low doesn't allow turning the
> protection off in a protected subtree (delegating the decision to
> distribute protection within parent bounds is IMO a valid use case).

I've made the case why it's not a supported usecase, and why it is a
meaningless configuration in practice due to the way other controllers
already behave.

I think at this point in the discussion, the only thing I can do is
remind you that the behavior I'm introducing is gated behind a mount
option that nobody is forced to enable if they insist on disagreeing
against all evidence to the contrary.


  reply	other threads:[~2020-02-21 18:58 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
2020-01-30 11:49   ` Michal Hocko
2020-02-03 21:21     ` Johannes Weiner
2020-02-03 21:38       ` Roman Gushchin
2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
2020-01-30 12:54   ` Michal Hocko
2020-02-21 17:10   ` Michal Koutný
2020-02-25 18:40     ` Johannes Weiner
2020-02-26 16:46       ` Michal Koutný
2020-02-26 19:40         ` Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
2020-01-30 17:00   ` Michal Hocko
2020-02-03 21:52     ` Johannes Weiner
2020-02-10 15:21       ` Johannes Weiner
2020-02-11 16:47       ` Michal Hocko
2020-02-12 17:08         ` Johannes Weiner
2020-02-13  7:40           ` Michal Hocko
2020-02-13 13:23             ` Johannes Weiner
2020-02-13 15:46               ` Michal Hocko
2020-02-13 17:41                 ` Johannes Weiner
2020-02-13 17:58                   ` Johannes Weiner
2020-02-14  7:59                     ` Michal Hocko
2020-02-13 13:53             ` Tejun Heo
2020-02-13 15:47               ` Michal Hocko
2020-02-13 15:52                 ` Tejun Heo
2020-02-13 16:36                   ` Michal Hocko
2020-02-13 16:57                     ` Tejun Heo
2020-02-14  7:15                       ` Michal Hocko
2020-02-14 13:57                         ` Tejun Heo
2020-02-14 15:13                           ` Michal Hocko
2020-02-14 15:40                             ` Tejun Heo
2020-02-14 16:53                             ` Johannes Weiner
2020-02-14 17:17                               ` Tejun Heo
2020-02-17  8:41                               ` Michal Hocko
2020-02-18 19:52                                 ` Johannes Weiner
2020-02-21 10:11                                   ` Michal Hocko
2020-02-21 15:43                                     ` Johannes Weiner
2020-02-25 12:20                                       ` Michal Hocko
2020-02-25 18:17                                         ` Johannes Weiner
2020-02-26 17:56                                           ` Michal Hocko
2020-02-21 17:12   ` Michal Koutný
2020-02-21 18:58     ` Johannes Weiner [this message]
2020-02-25 13:37       ` Michal Koutný
2020-02-25 15:03         ` Johannes Weiner
2020-02-26 13:22           ` Michal Koutný
2020-02-26 15:05             ` Johannes Weiner
2020-02-27 13:35               ` Michal Koutný
2020-02-27 15:06                 ` Johannes Weiner
2019-12-19 20:22 ` [PATCH v2 0/3] mm: memcontrol: recursive memory protection Tejun Heo
2019-12-20  4:06 ` Roman Gushchin
2019-12-20  4:29 ` Chris Down

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200221185839.GB70967@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).