linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Roman Gushchin" <guro@fb.com>,
	"Yang Shi" <yang.shi@linux.alibaba.com>,
	"Greg Thelen" <gthelen@google.com>,
	"David Rientjes" <rientjes@google.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Linux MM" <linux-mm@kvack.org>,
	Cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Date: Thu, 1 Oct 2020 10:31:49 -0400	[thread overview]
Message-ID: <20201001143149.GA493631@cmpxchg.org> (raw)
In-Reply-To: <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw@mail.gmail.com>

On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote:
> On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > > [...]
> > > > My take is that a proactive reclaim feature, whose goal is never to
> > > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > > would ideally have:
> > > >
> > > > - a pressure or size target specified by userspace but with
> > > >   enforcement driven inside the kernel from the allocation path
> > > >
> > > > - the enforcement work NOT be done synchronously by the workload
> > > >   (something I'd argue we want for *all* memory limits)
> > > >
> > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > >   cgroup's memory allocations causing the work (again something I'd
> > > >   argue we want in general)
> > > >
> > > > - a delegatable knob that is independent of setting the maximum size
> > > >   of a container, as that expresses a different type of policy
> > > >
> > > > - if size target, self-limiting (ha) enforcement on a pressure
> > > >   threshold or stop enforcement when the userspace component dies
> > > >
> > > > Thoughts?
> > >
> > > Agreed with above points. What do you think about
> > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz.
> >
> > I definitely agree with what you wrote in this email for background
> > reclaim. Indeed, your description sounds like what I proposed in
> > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
> > - what's missing from that patch is proper work attribution.
> >
> > > I assume that you do not want to override memory.high to implement
> > > this because that tends to be tricky from the configuration POV as
> > > you mentioned above. But a new limit (memory.middle for a lack of a
> > > better name) to define the background reclaim sounds like a good fit
> > > with above points.
> >
> > I can see that with a new memory.middle you could kind of sort of do
> > both - background reclaim and proactive reclaim.
> >
> > That said, I do see advantages in keeping them separate:
> >
> > 1. Background reclaim is essentially an allocation optimization that
> >    we may want to provide per default, just like kswapd.
> >
> >    Kswapd is tweakable of course, but I think actually few users do,
> >    and it works pretty well out of the box. It would be nice to
> >    provide the same thing on a per-cgroup basis per default and not
> >    ask users to make decisions that we are generally better at making.
> >
> > 2. Proactive reclaim may actually be better configured through a
> >    pressure threshold rather than a size target.
> >
> >    As per above, the goal is not to be punitive or containing. The
> >    goal is to keep the LRUs warm and move the colder pages to disk.
> >
> >    But how aggressively do you run reclaim for this purpose? What
> >    target value should a user write to such a memory.middle file?
> >
> >    For one, it depends on the job. A batch job, or a less important
> >    background job, may tolerate higher paging overhead than an
> >    interactive job. That means more of its pages could be trimmed from
> >    RAM and reloaded on-demand from disk.
> >
> >    But also, it depends on the storage device. If you move a workload
> >    from a machine with a slow disk to a machine with a fast disk, you
> >    can page more data in the same amount of time. That means while
> >    your workload tolerances stays the same, the faster the disk, the
> >    more aggressively you can do reclaim and offload memory.
> >
> >    So again, what should a user write to such a control file?
> >
> >    Of course, you can approximate an optimal target size for the
> >    workload. You can run a manual workingset analysis with page_idle,
> >    damon, or similar, determine a hot/cold cutoff based on what you
> >    know about the storage characteristics, then echo a number of pages
> >    or a size target into a cgroup file and let kernel do the reclaim
> >    accordingly. The drawbacks are that the kernel LRU may do a
> >    different hot/cold classification than you did and evict the wrong
> >    pages, the storage device latencies may vary based on overall IO
> >    pattern, and two equally warm pages may have very different paging
> >    overhead depending on whether readahead can avert a major fault or
> >    not. So it's easy to overshoot the tolerance target and disrupt the
> >    workload, or undershoot and have stale LRU data, waste memory etc.
> >
> >    You can also do a feedback loop, where you guess an optimal size,
> >    then adjust based on how much paging overhead the workload is
> >    experiencing, i.e. memory pressure. The drawbacks are that you have
> >    to monitor pressure closely and react quickly when the workload is
> >    expanding, as it can be potentially sensitive to latencies in the
> >    usec range. This can be tricky to do from userspace.
> >
> 
> This is actually what we do in our production i.e. feedback loop to
> adjust the next iteration of proactive reclaim.

That's what we do also right now. It works reasonably well, the only
two pain points are/have been the reaction time under quick workload
expansion and inadvertently forcing the workload into direct reclaim.

> We eliminated the IO or slow disk issues you mentioned by only
> focusing on anon memory and doing zswap.

Interesting, may I ask how the file cache is managed in this setup?

> >    So instead of asking users for a target size whose suitability
> >    heavily depends on the kernel's LRU implementation, the readahead
> >    code, the IO device's capability and general load, why not directly
> >    ask the user for a pressure level that the workload is comfortable
> >    with and which captures all of the above factors implicitly? Then
> >    let the kernel do this feedback loop from a per-cgroup worker.
> 
> I am assuming here by pressure level you are referring to the PSI like
> interface e.g. allowing the users to tell about their jobs that X
> amount of stalls in a fixed time window is tolerable.

Right, essentially the same parameters that psi poll() would take.


  reply	other threads:[~2020-10-01 14:33 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 21:57 [PATCH] memcg: introduce per-memcg reclaim interface Shakeel Butt
2020-09-10  6:36 ` SeongJae Park
2020-09-10 16:10   ` Shakeel Butt
2020-09-10 16:34     ` SeongJae Park
2020-09-21 16:30 ` Michal Hocko
2020-09-21 17:50   ` Shakeel Butt
2020-09-22 11:49     ` Michal Hocko
2020-09-22 15:54       ` Shakeel Butt
2020-09-22 16:55         ` Michal Hocko
2020-09-22 18:10           ` Shakeel Butt
2020-09-22 18:31             ` Michal Hocko
2020-09-22 18:56               ` Shakeel Butt
2020-09-22 19:08             ` Michal Hocko
2020-09-22 20:02               ` Yang Shi
2020-09-22 22:38               ` Shakeel Butt
2020-09-28 21:02 ` Johannes Weiner
2020-09-29 15:04   ` Michal Hocko
2020-09-29 21:53     ` Johannes Weiner
2020-09-30 15:45       ` Shakeel Butt
2020-10-01 14:31         ` Johannes Weiner [this message]
2020-10-06 16:55           ` Shakeel Butt
2020-10-08 14:53             ` Johannes Weiner
2020-10-08 15:55               ` Shakeel Butt
2020-10-08 21:09                 ` Johannes Weiner
2020-09-30 15:26   ` Shakeel Butt
2020-10-01 15:10     ` Johannes Weiner
2020-10-05 21:59       ` Shakeel Butt
2020-10-08 15:14         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201001143149.GA493631@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).