All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-mm@kvack.org, Yosry Ahmed <yosryahmed@google.com>,
	Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>
Subject: Re: [RFC] Mechanism to induce memory reclaim
Date: Mon, 7 Mar 2022 15:26:18 -0500	[thread overview]
Message-ID: <YiZqau8LQyNoLSd7@cmpxchg.org> (raw)
In-Reply-To: <20220307183141.npa4627fpbsbgwvv@google.com>

On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > [...]
> > > Some questions to get discussion going:
> > >
> > >  - Overall feedback or suggestions for the proposal in general?
> 
> > Do we really need this interface? What would be usecases which cannot
> > use an existing interfaces we have for that? Most notably memcg and
> > their high limit?
> 
> 
> Let me take a stab at this. The specific reasons why high limit is not a
> good interface to implement proactive reclaim:
> 
> 1) It can cause allocations from the target application to get
> throttled.
> 
> 2) It leaves a state (high limit) in the kernel which needs to be reset
> by the userspace part of proactive reclaimer.
> 
> If I remember correctly, Facebook actually tried to use high limit to
> implement the proactive reclaim but due to exactly these limitations [1]
> they went the route [2] aligned with this proposal.
> 
> To further explain why the above limitations are pretty bad: The
> proactive reclaimers usually use feedback loop to decide how much to
> squeeze from the target applications without impacting their performance
> or impacting within a tolerable range. The metrics used for the feedback
> loop are either refaults or PSI and these metrics becomes messy due to
> application getting throttled due to high limit.
> 
> For (2), the high limit interface is a very awkward interface to use to
> do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> due to whatever reason during triggering the reclaim in an application,
> it can leave the application in a bad state (memory pressure state and
> throttled) for a long time.

Yes.

In addition to the proactive reclaimer crashing, we also had problems
of it simply not responding quickly enough.

Because there is a delay between reclaim (action) and refaults
(feedback), there is a very real upper limit of pages you can
reasonably reclaim per second, without risking pressure spikes that
far exceed tolerances. A fixed memory.high limit can easily exceed
that safe reclaim rate when the workload expands abruptly. Even if the
proactive reclaimer process is alive, it's almost impossible to step
between a rapidly allocating process and its cgroup limit in time.

The semantics of writing to memory.high also require that the new
limit is met before returning to userspace. This can take a long time,
during which the reclaimer cannot re-evaluate the optimal target size
based on observed pressure. We routinely saw the reclaimer get stuck
in the kernel hammering a suffering workload down to a stale target.

We tried for quite a while to make this work, but the limit semantics
turned out to not be a good fit for proactive reclaim.

A mechanism to request a fixed number of pages to reclaim turned out
to work much, much better in practice. We've been using a simple
per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).

With tiered memory systems coming up, I can see the need for
restricting to specific numa nodes. Demoting from DRAM to CXL has a
different cost function than evicting RAM/CXL to storage, and those
two things probably need to happen at different rates.


  reply	other threads:[~2022-03-07 20:26 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes
2022-03-07  0:49 ` Yu Zhao
2022-03-07 14:41 ` Michal Hocko
2022-03-07 18:31   ` Shakeel Butt
2022-03-07 20:26     ` Johannes Weiner [this message]
2022-03-08 12:53       ` Michal Hocko
2022-03-08 14:44         ` Dan Schatzberg
2022-03-08 16:05           ` Michal Hocko
2022-03-08 17:21             ` Wei Xu
2022-03-08 17:23             ` Johannes Weiner
2022-03-08 12:52     ` Michal Hocko
2022-03-09 22:03       ` David Rientjes
2022-03-10 16:58         ` Johannes Weiner
2022-03-10 17:25           ` Shakeel Butt
2022-03-10 17:33           ` Wei Xu
2022-03-10 17:42             ` Johannes Weiner
2022-03-07 20:50 ` Johannes Weiner
2022-03-07 22:53   ` Wei Xu
2022-03-08 12:53     ` Michal Hocko
2022-03-08 14:49   ` Dan Schatzberg
2022-03-08 19:27     ` Johannes Weiner
2022-03-08 22:37       ` Dan Schatzberg
2022-03-09 22:30   ` David Rientjes
2022-03-10 16:10     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YiZqau8LQyNoLSd7@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=gthelen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=weixugc@google.com \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.