All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Mechanism to induce memory reclaim
@ 2022-03-06 23:11 David Rientjes
  2022-03-07  0:49 ` Yu Zhao
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: David Rientjes @ 2022-03-06 23:11 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Yu Zhao, Dave Hansen
  Cc: linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

Hi everybody,

We'd like to discuss formalizing a mechanism to induce memory reclaim by
the kernel.

The current multigenerational LRU proposal introduces a debugfs
mechanism[1] for this.  The "TMO: Transparent Memory Offloading in
Datacenters" paper also discusses a per-memcg mechanism[2].  While the
former can be used for debugging of MGLRU, both can quite powerfully be
used for proactive reclaim.

Google's datacenters use a similar per-memcg mechanism for the same
purpose.  Thus, formalizing the mechanism would allow our userspace to use
an upstream supported interface that will be stable and consistent.

This could be an incremental addition to MGLRU's lru_gen debugfs mechanism
but, since the concept has no direct dependency on the work, we believe it
is useful independent of the reclaim mechanism in use (both with and
without CONFIG_LRU_GEN).

Idea: introduce a per-node sysfs mechanism for inducing memory reclaim
that can be useful for global (non-memcg constrained) reclaim and possible
even if memcg is not enabled in the kernel or mounted.  This could
optionally take a memcg id to induce reclaim for a memcg hierarchy.

IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
each NUMA node N on the system.  (It would be similar to the existing
per-node sysfs "compact" mechanism used to trigger compaction from
userspace.)

Userspace would write the following to this file:
 - nr_to_reclaim pages
 - swappiness factor
 - memcg_id of the hierarchy to reclaim from, if any[*]
 - flags to specify context, if any[**]
 
 [*] if global reclaim or memcg is not enabled/mounted, this is 0 since
     this is the return value of mem_cgroup_id()
 [**] this is offered for extensibility to specify the context in which
      reclaim is being done (clean file pages only, demotion for memory
      tiering vs eviction, etc), otherwise 0
 
An alternative may be to introduce a /sys/kernel/mm/reclaim mechanism that
also takes a nodemask to reclaim from.  The kernel would reclaim memory
over the set of nodes passed to it.

Some questions to get discussion going:

 - Overall feedback or suggestions for the proposal in general?
 
 - This proposal uses a value specified in pages to reclaim; this could be
   a number of bytes instead.  I have no strong opinion, does anybody
   else?

 - Should this be a per-node mechanism under sysfs like the existing
   "compact" mechanism or should it be implemented as a single file that
   can optionally specify a nodemask to reclaim from?

Thanks!

[1] https://lore.kernel.org/linux-mm/20220208081902.3550911-12-yuzhao@google.com
[2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes
@ 2022-03-07  0:49 ` Yu Zhao
  2022-03-07 14:41 ` Michal Hocko
  2022-03-07 20:50 ` Johannes Weiner
  2 siblings, 0 replies; 24+ messages in thread
From: Yu Zhao @ 2022-03-07  0:49 UTC (permalink / raw)
  To: David Rientjes, Andrea Righi
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Dave Hansen,
	Linux-MM, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

On Sun, Mar 6, 2022 at 4:11 PM David Rientjes <rientjes@google.com> wrote:
>
> Hi everybody,
>
> We'd like to discuss formalizing a mechanism to induce memory reclaim by
> the kernel.
>
> The current multigenerational LRU proposal introduces a debugfs
> mechanism[1] for this.  The "TMO: Transparent Memory Offloading in
> Datacenters" paper also discusses a per-memcg mechanism[2].  While the
> former can be used for debugging of MGLRU, both can quite powerfully be
> used for proactive reclaim.
>
> Google's datacenters use a similar per-memcg mechanism for the same
> purpose.  Thus, formalizing the mechanism would allow our userspace to use
> an upstream supported interface that will be stable and consistent.
>
> This could be an incremental addition to MGLRU's lru_gen debugfs mechanism
> but, since the concept has no direct dependency on the work, we believe it
> is useful independent of the reclaim mechanism in use (both with and
> without CONFIG_LRU_GEN).
>
> Idea: introduce a per-node sysfs mechanism for inducing memory reclaim
> that can be useful for global (non-memcg constrained) reclaim and possible
> even if memcg is not enabled in the kernel or mounted.  This could
> optionally take a memcg id to induce reclaim for a memcg hierarchy.
>
> IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
> each NUMA node N on the system.  (It would be similar to the existing
> per-node sysfs "compact" mechanism used to trigger compaction from
> userspace.)
>
> Userspace would write the following to this file:
>  - nr_to_reclaim pages
>  - swappiness factor
>  - memcg_id of the hierarchy to reclaim from, if any[*]
>  - flags to specify context, if any[**]
>
>  [*] if global reclaim or memcg is not enabled/mounted, this is 0 since
>      this is the return value of mem_cgroup_id()
>  [**] this is offered for extensibility to specify the context in which
>       reclaim is being done (clean file pages only, demotion for memory
>       tiering vs eviction, etc), otherwise 0
>
> An alternative may be to introduce a /sys/kernel/mm/reclaim mechanism that
> also takes a nodemask to reclaim from.  The kernel would reclaim memory
> over the set of nodes passed to it.
>
> Some questions to get discussion going:
>
>  - Overall feedback or suggestions for the proposal in general?
>
>  - This proposal uses a value specified in pages to reclaim; this could be
>    a number of bytes instead.  I have no strong opinion, does anybody
>    else?
>
>  - Should this be a per-node mechanism under sysfs like the existing
>    "compact" mechanism or should it be implemented as a single file that
>    can optionally specify a nodemask to reclaim from?
>
> Thanks!
>
> [1] https://lore.kernel.org/linux-mm/20220208081902.3550911-12-yuzhao@google.com
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3)

Adding Canonical who also provided additional use cases [3] for this
potential ABI.

[3] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes
  2022-03-07  0:49 ` Yu Zhao
@ 2022-03-07 14:41 ` Michal Hocko
  2022-03-07 18:31   ` Shakeel Butt
  2022-03-07 20:50 ` Johannes Weiner
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-03-07 14:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, Yu Zhao, Dave Hansen, linux-mm,
	Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

On Sun 06-03-22 15:11:23, David Rientjes wrote:
[...]
> Some questions to get discussion going:
> 
>  - Overall feedback or suggestions for the proposal in general?

Do we really need this interface? What would be usecases which cannot
use an existing interfaces we have for that? Most notably memcg and 
their high limit?

I do agree that the global means to trigger the reclaim
(min_free_kbytes) is far from a precise tool but it would be interesting
to hear more about why a number of reclaimed pages would be a more
useful interface. Could you elaborate on that please?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 14:41 ` Michal Hocko
@ 2022-03-07 18:31   ` Shakeel Butt
  2022-03-07 20:26     ` Johannes Weiner
  2022-03-08 12:52     ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: Shakeel Butt @ 2022-03-07 18:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Johannes Weiner, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> On Sun 06-03-22 15:11:23, David Rientjes wrote:
> [...]
> > Some questions to get discussion going:
> >
> >  - Overall feedback or suggestions for the proposal in general?

> Do we really need this interface? What would be usecases which cannot
> use an existing interfaces we have for that? Most notably memcg and
> their high limit?


Let me take a stab at this. The specific reasons why high limit is not a
good interface to implement proactive reclaim:

1) It can cause allocations from the target application to get
throttled.

2) It leaves a state (high limit) in the kernel which needs to be reset
by the userspace part of proactive reclaimer.

If I remember correctly, Facebook actually tried to use high limit to
implement the proactive reclaim but due to exactly these limitations [1]
they went the route [2] aligned with this proposal.

To further explain why the above limitations are pretty bad: The
proactive reclaimers usually use feedback loop to decide how much to
squeeze from the target applications without impacting their performance
or impacting within a tolerable range. The metrics used for the feedback
loop are either refaults or PSI and these metrics becomes messy due to
application getting throttled due to high limit.

For (2), the high limit interface is a very awkward interface to use to
do proactive reclaim. If the userspace proactive reclaimer fails/crashed
due to whatever reason during triggering the reclaim in an application,
it can leave the application in a bad state (memory pressure state and
throttled) for a long time.

[1] https://lore.kernel.org/all/20200928210216.GA378894@cmpxchg.org/
[2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 18:31   ` Shakeel Butt
@ 2022-03-07 20:26     ` Johannes Weiner
  2022-03-08 12:53       ` Michal Hocko
  2022-03-08 12:52     ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-03-07 20:26 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, David Rientjes, Andrew Morton, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > [...]
> > > Some questions to get discussion going:
> > >
> > >  - Overall feedback or suggestions for the proposal in general?
> 
> > Do we really need this interface? What would be usecases which cannot
> > use an existing interfaces we have for that? Most notably memcg and
> > their high limit?
> 
> 
> Let me take a stab at this. The specific reasons why high limit is not a
> good interface to implement proactive reclaim:
> 
> 1) It can cause allocations from the target application to get
> throttled.
> 
> 2) It leaves a state (high limit) in the kernel which needs to be reset
> by the userspace part of proactive reclaimer.
> 
> If I remember correctly, Facebook actually tried to use high limit to
> implement the proactive reclaim but due to exactly these limitations [1]
> they went the route [2] aligned with this proposal.
> 
> To further explain why the above limitations are pretty bad: The
> proactive reclaimers usually use feedback loop to decide how much to
> squeeze from the target applications without impacting their performance
> or impacting within a tolerable range. The metrics used for the feedback
> loop are either refaults or PSI and these metrics becomes messy due to
> application getting throttled due to high limit.
> 
> For (2), the high limit interface is a very awkward interface to use to
> do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> due to whatever reason during triggering the reclaim in an application,
> it can leave the application in a bad state (memory pressure state and
> throttled) for a long time.

Yes.

In addition to the proactive reclaimer crashing, we also had problems
of it simply not responding quickly enough.

Because there is a delay between reclaim (action) and refaults
(feedback), there is a very real upper limit of pages you can
reasonably reclaim per second, without risking pressure spikes that
far exceed tolerances. A fixed memory.high limit can easily exceed
that safe reclaim rate when the workload expands abruptly. Even if the
proactive reclaimer process is alive, it's almost impossible to step
between a rapidly allocating process and its cgroup limit in time.

The semantics of writing to memory.high also require that the new
limit is met before returning to userspace. This can take a long time,
during which the reclaimer cannot re-evaluate the optimal target size
based on observed pressure. We routinely saw the reclaimer get stuck
in the kernel hammering a suffering workload down to a stale target.

We tried for quite a while to make this work, but the limit semantics
turned out to not be a good fit for proactive reclaim.

A mechanism to request a fixed number of pages to reclaim turned out
to work much, much better in practice. We've been using a simple
per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).

With tiered memory systems coming up, I can see the need for
restricting to specific numa nodes. Demoting from DRAM to CXL has a
different cost function than evicting RAM/CXL to storage, and those
two things probably need to happen at different rates.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes
  2022-03-07  0:49 ` Yu Zhao
  2022-03-07 14:41 ` Michal Hocko
@ 2022-03-07 20:50 ` Johannes Weiner
  2022-03-07 22:53   ` Wei Xu
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-03-07 20:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm,
	Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote:
> Hi everybody,
> 
> We'd like to discuss formalizing a mechanism to induce memory reclaim by
> the kernel.
> 
> The current multigenerational LRU proposal introduces a debugfs
> mechanism[1] for this.  The "TMO: Transparent Memory Offloading in
> Datacenters" paper also discusses a per-memcg mechanism[2].  While the
> former can be used for debugging of MGLRU, both can quite powerfully be
> used for proactive reclaim.
> 
> Google's datacenters use a similar per-memcg mechanism for the same
> purpose.  Thus, formalizing the mechanism would allow our userspace to use
> an upstream supported interface that will be stable and consistent.
> 
> This could be an incremental addition to MGLRU's lru_gen debugfs mechanism
> but, since the concept has no direct dependency on the work, we believe it
> is useful independent of the reclaim mechanism in use (both with and
> without CONFIG_LRU_GEN).
> 
> Idea: introduce a per-node sysfs mechanism for inducing memory reclaim
> that can be useful for global (non-memcg constrained) reclaim and possible
> even if memcg is not enabled in the kernel or mounted.  This could
> optionally take a memcg id to induce reclaim for a memcg hierarchy.
> 
> IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
> each NUMA node N on the system.  (It would be similar to the existing
> per-node sysfs "compact" mechanism used to trigger compaction from
> userspace.)

I generally think a proactive reclaim interface is a good idea.

A per-cgroup control knob would make more sense to me, as cgroupfs
takes care of delegation, namespacing etc. and so would permit
self-directed proactive reclaim inside containers.

> Userspace would write the following to this file:
>  - nr_to_reclaim pages

This makes sense, although (and you hinted at this below), I'm
thinking it should be in bytes, especially if part of cgroupfs.

>  - swappiness factor

This I'm not sure about.

Mostly because I'm not sure about swappiness in general. It balances
between anon and file, but both of them are aged according to the same
LRU rules. The only reason to prefer one over the other seems to be
when the cost of reloading one (refault vs swapin) isn't the same as
the other. That's usually a hardware property, which in a perfect
world we'd auto-tune inside the kernel based on observed IO
performance. Not sure why you'd want this per reclaim request.

>  - flags to specify context, if any[**]
>  
>  [**] this is offered for extensibility to specify the context in which
>       reclaim is being done (clean file pages only, demotion for memory
>       tiering vs eviction, etc), otherwise 0

This one is curious. I don't understand the use cases for either of
these examples, and I can't think of other flags a user may pass on a
per-invocation basis. Would you care to elaborate some?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 20:50 ` Johannes Weiner
@ 2022-03-07 22:53   ` Wei Xu
  2022-03-08 12:53     ` Michal Hocko
  2022-03-08 14:49   ` Dan Schatzberg
  2022-03-09 22:30   ` David Rientjes
  2 siblings, 1 reply; 24+ messages in thread
From: Wei Xu @ 2022-03-07 22:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao,
	Dave Hansen, Linux MM, Yosry Ahmed, Shakeel Butt, Greg Thelen

On Mon, Mar 7, 2022 at 12:50 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote:
> > Hi everybody,
> >
> > We'd like to discuss formalizing a mechanism to induce memory reclaim by
> > the kernel.
> >
> > The current multigenerational LRU proposal introduces a debugfs
> > mechanism[1] for this.  The "TMO: Transparent Memory Offloading in
> > Datacenters" paper also discusses a per-memcg mechanism[2].  While the
> > former can be used for debugging of MGLRU, both can quite powerfully be
> > used for proactive reclaim.
> >
> > Google's datacenters use a similar per-memcg mechanism for the same
> > purpose.  Thus, formalizing the mechanism would allow our userspace to use
> > an upstream supported interface that will be stable and consistent.
> >
> > This could be an incremental addition to MGLRU's lru_gen debugfs mechanism
> > but, since the concept has no direct dependency on the work, we believe it
> > is useful independent of the reclaim mechanism in use (both with and
> > without CONFIG_LRU_GEN).
> >
> > Idea: introduce a per-node sysfs mechanism for inducing memory reclaim
> > that can be useful for global (non-memcg constrained) reclaim and possible
> > even if memcg is not enabled in the kernel or mounted.  This could
> > optionally take a memcg id to induce reclaim for a memcg hierarchy.
> >
> > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
> > each NUMA node N on the system.  (It would be similar to the existing
> > per-node sysfs "compact" mechanism used to trigger compaction from
> > userspace.)
>
> I generally think a proactive reclaim interface is a good idea.

It is great to hear this.

> A per-cgroup control knob would make more sense to me, as cgroupfs
> takes care of delegation, namespacing etc. and so would permit
> self-directed proactive reclaim inside containers.

A per-cgroup control works perfectly for Google's data center use case
as well.  But a sysfs interface, such as /sys/kernel/mm/reclaim, that
takes a node mask and a memcg id as the arguments can be used by
proactive reclaimers on systems that don't use memcg (e.g. some
desktop Linux distros) as well, which is more general.  A special
value for memcg id indicating global reclaim can be passed to support
non-memcg use cases.

> > Userspace would write the following to this file:
> >  - nr_to_reclaim pages
>
> This makes sense, although (and you hinted at this below), I'm
> thinking it should be in bytes, especially if part of cgroupfs.
>
> >  - swappiness factor
>
> This I'm not sure about.
>
> Mostly because I'm not sure about swappiness in general. It balances
> between anon and file, but both of them are aged according to the same
> LRU rules. The only reason to prefer one over the other seems to be
> when the cost of reloading one (refault vs swapin) isn't the same as
> the other. That's usually a hardware property, which in a perfect
> world we'd auto-tune inside the kernel based on observed IO
> performance. Not sure why you'd want this per reclaim request.

The choice between anon and file pages is not only a hardware
property, but also a matter of policy decisions. It is useful to allow
the userspace policy daemon the flexibility to choose anon pages or
file pages or both to reclaim from, for the exact reasons that you
have described.  This is important for the use cases in Google (where
anon pages are the primary focus of proactive reclaim).

Maybe instead of the swappiness factor, we can replace this parameter
with a page type mask to more explicitly select which types of pages
to reclaim.

> >  - flags to specify context, if any[**]
> >
> >  [**] this is offered for extensibility to specify the context in which
> >       reclaim is being done (clean file pages only, demotion for memory
> >       tiering vs eviction, etc), otherwise 0
>
> This one is curious. I don't understand the use cases for either of
> these examples, and I can't think of other flags a user may pass on a
> per-invocation basis. Would you care to elaborate some?

One of the flag examples is to control whether the requested proactive
reclaim can induce I/Os. This can be especially useful for memory
tiering to lower cost memory devices, where I/Os would likely not be
preferred for reclaim-based demotion requested proactively.

Wei


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 18:31   ` Shakeel Butt
  2022-03-07 20:26     ` Johannes Weiner
@ 2022-03-08 12:52     ` Michal Hocko
  2022-03-09 22:03       ` David Rientjes
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-03-08 12:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Rientjes, Andrew Morton, Johannes Weiner, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Mon 07-03-22 18:31:41, Shakeel Butt wrote:
> On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > [...]
> > > Some questions to get discussion going:
> > >
> > >  - Overall feedback or suggestions for the proposal in general?
> 
> > Do we really need this interface? What would be usecases which cannot
> > use an existing interfaces we have for that? Most notably memcg and
> > their high limit?
> 
> 
> Let me take a stab at this. The specific reasons why high limit is not a
> good interface to implement proactive reclaim:
> 
> 1) It can cause allocations from the target application to get
> throttled.
> 
> 2) It leaves a state (high limit) in the kernel which needs to be reset
> by the userspace part of proactive reclaimer.
> 
> If I remember correctly, Facebook actually tried to use high limit to
> implement the proactive reclaim but due to exactly these limitations [1]
> they went the route [2] aligned with this proposal.

I do remember we have discussed this in the past. There were proposals
for an additional limit to trigger a background reclaim [3] or to add a
pressure based memcg knob [4]. For the nr_to_reclaim based interface
there were some challenges outlined in that email thread. I do
understand that practical experience could have confirmed or diminished
those concerns.

I am definitely happy to restart those discussion but it would be really
great to summarize existing options and why they do not work in
practice. It would be also great to mention why concerns about nr_to_reclaim
based interface expressed in the past are not standing out anymore wrt.
other proposals.

> To further explain why the above limitations are pretty bad: The
> proactive reclaimers usually use feedback loop to decide how much to
> squeeze from the target applications without impacting their performance
> or impacting within a tolerable range. The metrics used for the feedback
> loop are either refaults or PSI and these metrics becomes messy due to
> application getting throttled due to high limit.

One thing is not really clear to me here. You are saying that the
PSI/refaults are influenced by the throttling IIUC. Does that mean that
your reclaimer is living outside of the controlled memcg? Or why does it
make any difference who is reclaiming the memory from the the metrics
POV?  I do understand that you want to avoid throttling on the regular
workload in that memcg and this is where the high limit comes short but
the work has to be done by somebody, right?
 
> For (2), the high limit interface is a very awkward interface to use to
> do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> due to whatever reason during triggering the reclaim in an application,
> it can leave the application in a bad state (memory pressure state and
> throttled) for a long time.

Fair enough.

> [1] https://lore.kernel.org/all/20200928210216.GA378894@cmpxchg.org/
> [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3)

[3] http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz
    resp. http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org/
[4] http://lkml.kernel.org/r/20200928210216.GA378894@cmpxchg.org
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 20:26     ` Johannes Weiner
@ 2022-03-08 12:53       ` Michal Hocko
  2022-03-08 14:44         ` Dan Schatzberg
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2022-03-08 12:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > > [...]
> > > > Some questions to get discussion going:
> > > >
> > > >  - Overall feedback or suggestions for the proposal in general?
> > 
> > > Do we really need this interface? What would be usecases which cannot
> > > use an existing interfaces we have for that? Most notably memcg and
> > > their high limit?
> > 
> > 
> > Let me take a stab at this. The specific reasons why high limit is not a
> > good interface to implement proactive reclaim:
> > 
> > 1) It can cause allocations from the target application to get
> > throttled.
> > 
> > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > by the userspace part of proactive reclaimer.
> > 
> > If I remember correctly, Facebook actually tried to use high limit to
> > implement the proactive reclaim but due to exactly these limitations [1]
> > they went the route [2] aligned with this proposal.
> > 
> > To further explain why the above limitations are pretty bad: The
> > proactive reclaimers usually use feedback loop to decide how much to
> > squeeze from the target applications without impacting their performance
> > or impacting within a tolerable range. The metrics used for the feedback
> > loop are either refaults or PSI and these metrics becomes messy due to
> > application getting throttled due to high limit.
> > 
> > For (2), the high limit interface is a very awkward interface to use to
> > do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> > due to whatever reason during triggering the reclaim in an application,
> > it can leave the application in a bad state (memory pressure state and
> > throttled) for a long time.
> 
> Yes.
> 
> In addition to the proactive reclaimer crashing, we also had problems
> of it simply not responding quickly enough.
> 
> Because there is a delay between reclaim (action) and refaults
> (feedback), there is a very real upper limit of pages you can
> reasonably reclaim per second, without risking pressure spikes that
> far exceed tolerances. A fixed memory.high limit can easily exceed
> that safe reclaim rate when the workload expands abruptly. Even if the
> proactive reclaimer process is alive, it's almost impossible to step
> between a rapidly allocating process and its cgroup limit in time.
> 
> The semantics of writing to memory.high also require that the new
> limit is met before returning to userspace. This can take a long time,
> during which the reclaimer cannot re-evaluate the optimal target size
> based on observed pressure. We routinely saw the reclaimer get stuck
> in the kernel hammering a suffering workload down to a stale target.
> 
> We tried for quite a while to make this work, but the limit semantics
> turned out to not be a good fit for proactive reclaim.

Thanks for sharing your experience, Johannes. This is a useful insight.

> A mechanism to request a fixed number of pages to reclaim turned out
> to work much, much better in practice. We've been using a simple
> per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).

Could you share more details here please? How have you managed to find
the reclaim target and how have you overcome challenges to react in time
to have some head room for the actual reclaim?
 
> With tiered memory systems coming up, I can see the need for
> restricting to specific numa nodes. Demoting from DRAM to CXL has a
> different cost function than evicting RAM/CXL to storage, and those
> two things probably need to happen at different rates.

Yes, in an absense of per-node watermarks I can see how a per-node
reclaim trigger could be useful. The question is whether a per-node
wmark interface wouldn't be a better fit.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 22:53   ` Wei Xu
@ 2022-03-08 12:53     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2022-03-08 12:53 UTC (permalink / raw)
  To: Wei Xu
  Cc: Johannes Weiner, David Rientjes, Andrew Morton, Yu Zhao,
	Dave Hansen, Linux MM, Yosry Ahmed, Shakeel Butt, Greg Thelen

On Mon 07-03-22 14:53:40, Wei Xu wrote:
[...]
> The choice between anon and file pages is not only a hardware
> property, but also a matter of policy decisions. It is useful to allow
> the userspace policy daemon the flexibility to choose anon pages or
> file pages or both to reclaim from, for the exact reasons that you
> have described.  This is important for the use cases in Google (where
> anon pages are the primary focus of proactive reclaim).
> 
> Maybe instead of the swappiness factor, we can replace this parameter
> with a page type mask to more explicitly select which types of pages
> to reclaim.

I am concerned this could lead to even more problems. Where do you draw
the line? Do you want to control slab reclaim or even shrinkers based
reclaim?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 12:53       ` Michal Hocko
@ 2022-03-08 14:44         ` Dan Schatzberg
  2022-03-08 16:05           ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Dan Schatzberg @ 2022-03-08 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, David Rientjes, Andrew Morton,
	Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote:
> On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > > > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > > > [...]
> > > > > Some questions to get discussion going:
> > > > >
> > > > >  - Overall feedback or suggestions for the proposal in general?
> > > 
> > > > Do we really need this interface? What would be usecases which cannot
> > > > use an existing interfaces we have for that? Most notably memcg and
> > > > their high limit?
> > > 
> > > 
> > > Let me take a stab at this. The specific reasons why high limit is not a
> > > good interface to implement proactive reclaim:
> > > 
> > > 1) It can cause allocations from the target application to get
> > > throttled.
> > > 
> > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > by the userspace part of proactive reclaimer.
> > > 
> > > If I remember correctly, Facebook actually tried to use high limit to
> > > implement the proactive reclaim but due to exactly these limitations [1]
> > > they went the route [2] aligned with this proposal.
> > > 
> > > To further explain why the above limitations are pretty bad: The
> > > proactive reclaimers usually use feedback loop to decide how much to
> > > squeeze from the target applications without impacting their performance
> > > or impacting within a tolerable range. The metrics used for the feedback
> > > loop are either refaults or PSI and these metrics becomes messy due to
> > > application getting throttled due to high limit.
> > > 
> > > For (2), the high limit interface is a very awkward interface to use to
> > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> > > due to whatever reason during triggering the reclaim in an application,
> > > it can leave the application in a bad state (memory pressure state and
> > > throttled) for a long time.
> > 
> > Yes.
> > 
> > In addition to the proactive reclaimer crashing, we also had problems
> > of it simply not responding quickly enough.
> > 
> > Because there is a delay between reclaim (action) and refaults
> > (feedback), there is a very real upper limit of pages you can
> > reasonably reclaim per second, without risking pressure spikes that
> > far exceed tolerances. A fixed memory.high limit can easily exceed
> > that safe reclaim rate when the workload expands abruptly. Even if the
> > proactive reclaimer process is alive, it's almost impossible to step
> > between a rapidly allocating process and its cgroup limit in time.
> > 
> > The semantics of writing to memory.high also require that the new
> > limit is met before returning to userspace. This can take a long time,
> > during which the reclaimer cannot re-evaluate the optimal target size
> > based on observed pressure. We routinely saw the reclaimer get stuck
> > in the kernel hammering a suffering workload down to a stale target.
> > 
> > We tried for quite a while to make this work, but the limit semantics
> > turned out to not be a good fit for proactive reclaim.
> 
> Thanks for sharing your experience, Johannes. This is a useful insight.

Just to add another issue with memory.high - there's a race window
between reading memory.current and setting memory.high if you want to
reclaim just a little bit of memory. On a fast expanding workload this
could result in reclaiming much more than intended.

> 
> > A mechanism to request a fixed number of pages to reclaim turned out
> > to work much, much better in practice. We've been using a simple
> > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).
> 
> Could you share more details here please? How have you managed to find
> the reclaim target and how have you overcome challenges to react in time
> to have some head room for the actual reclaim?

We have a userspace agent that just repeatedly triggers proactive
reclaim and monitors PSI metrics to maintain some constant but low
pressure. In the complete absense of pressure we will reclaim some
configurable percentage of the workload's memory. This reclaim amount
tapers down to zero as PSI approaches the target threshold.

I don't follow your question regarding head-room. Could you elaborate?



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 20:50 ` Johannes Weiner
  2022-03-07 22:53   ` Wei Xu
@ 2022-03-08 14:49   ` Dan Schatzberg
  2022-03-08 19:27     ` Johannes Weiner
  2022-03-09 22:30   ` David Rientjes
  2 siblings, 1 reply; 24+ messages in thread
From: Dan Schatzberg @ 2022-03-08 14:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt,
	Greg Thelen

On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote:
> On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote:
> >  - swappiness factor
> 
> This I'm not sure about.
> 
> Mostly because I'm not sure about swappiness in general. It balances
> between anon and file, but both of them are aged according to the same
> LRU rules. The only reason to prefer one over the other seems to be
> when the cost of reloading one (refault vs swapin) isn't the same as
> the other. That's usually a hardware property, which in a perfect
> world we'd auto-tune inside the kernel based on observed IO
> performance. Not sure why you'd want this per reclaim request.

I think this could be useful for budgeting write-endurance. You may
want to tune down a workload's swappiness on a per-reclaim basis in
order to control how much swap-out (and therefore disk writes) its
doing. Right now the only way to control this is by writing to
vm.swappiness before doing the explicit reclaim which can momentarily
effect other reclaim behavior on the machine.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 14:44         ` Dan Schatzberg
@ 2022-03-08 16:05           ` Michal Hocko
  2022-03-08 17:21             ` Wei Xu
  2022-03-08 17:23             ` Johannes Weiner
  0 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2022-03-08 16:05 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Johannes Weiner, Shakeel Butt, David Rientjes, Andrew Morton,
	Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Tue 08-03-22 09:44:35, Dan Schatzberg wrote:
> On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote:
> > On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
[...]
> > > A mechanism to request a fixed number of pages to reclaim turned out
> > > to work much, much better in practice. We've been using a simple
> > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).
> > 
> > Could you share more details here please? How have you managed to find
> > the reclaim target and how have you overcome challenges to react in time
> > to have some head room for the actual reclaim?
> 
> We have a userspace agent that just repeatedly triggers proactive
> reclaim and monitors PSI metrics to maintain some constant but low
> pressure. In the complete absense of pressure we will reclaim some
> configurable percentage of the workload's memory. This reclaim amount
> tapers down to zero as PSI approaches the target threshold.
> 
> I don't follow your question regarding head-room. Could you elaborate?

One of the concern that was expressed in the past is how effectively
can pro-active userspace reclaimer act on memory demand transitions. It
takes some time to get refaults/PSI changes and then you should
be acting rather swiftly. At least if you aim at somehow smooth
transition. Tuning this up to work reliably seems to be far
from trivial. Not to mention that changes in the memory reclaim
implementation could make the whole tuning rather fragile.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 16:05           ` Michal Hocko
@ 2022-03-08 17:21             ` Wei Xu
  2022-03-08 17:23             ` Johannes Weiner
  1 sibling, 0 replies; 24+ messages in thread
From: Wei Xu @ 2022-03-08 17:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Schatzberg, Johannes Weiner, Shakeel Butt, David Rientjes,
	Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed,
	Greg Thelen

On Tue, Mar 8, 2022 at 8:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 08-03-22 09:44:35, Dan Schatzberg wrote:
> > On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote:
> > > On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> [...]
> > > > A mechanism to request a fixed number of pages to reclaim turned out
> > > > to work much, much better in practice. We've been using a simple
> > > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).
> > >
> > > Could you share more details here please? How have you managed to find
> > > the reclaim target and how have you overcome challenges to react in time
> > > to have some head room for the actual reclaim?
> >
> > We have a userspace agent that just repeatedly triggers proactive
> > reclaim and monitors PSI metrics to maintain some constant but low
> > pressure. In the complete absense of pressure we will reclaim some
> > configurable percentage of the workload's memory. This reclaim amount
> > tapers down to zero as PSI approaches the target threshold.
> >
> > I don't follow your question regarding head-room. Could you elaborate?
>
> One of the concern that was expressed in the past is how effectively
> can pro-active userspace reclaimer act on memory demand transitions. It
> takes some time to get refaults/PSI changes and then you should
> be acting rather swiftly. At least if you aim at somehow smooth
> transition. Tuning this up to work reliably seems to be far
> from trivial. Not to mention that changes in the memory reclaim
> implementation could make the whole tuning rather fragile.

The userspace reclaimer is not a complete replacement of the kernel
memory reclaim (kswapd or direct reclaim). At least in Google's user
cases, it is to proactively identify memory savings opportunities and
reclaim some amount of cold pages set by the policy to free up the
memory for more demanding jobs or scheduling new jobs.  If a job
(container) has a rapid memory demand increase, it would just mean
less proactive savings from this job.  The userspace reclaimer doesn't
have to act much more swiftly for such jobs with the proposed
nr_bytes_to_reclaim interface.  If the userspace reclaim interface was
memory.high-based, then such jobs would indeed be a serious problem.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 16:05           ` Michal Hocko
  2022-03-08 17:21             ` Wei Xu
@ 2022-03-08 17:23             ` Johannes Weiner
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-03-08 17:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Schatzberg, Shakeel Butt, David Rientjes, Andrew Morton,
	Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Tue, Mar 08, 2022 at 05:05:11PM +0100, Michal Hocko wrote:
> On Tue 08-03-22 09:44:35, Dan Schatzberg wrote:
> > On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote:
> > > On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> [...]
> > > > A mechanism to request a fixed number of pages to reclaim turned out
> > > > to work much, much better in practice. We've been using a simple
> > > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).
> > > 
> > > Could you share more details here please? How have you managed to find
> > > the reclaim target and how have you overcome challenges to react in time
> > > to have some head room for the actual reclaim?
> > 
> > We have a userspace agent that just repeatedly triggers proactive
> > reclaim and monitors PSI metrics to maintain some constant but low
> > pressure. In the complete absense of pressure we will reclaim some
> > configurable percentage of the workload's memory. This reclaim amount
> > tapers down to zero as PSI approaches the target threshold.
> > 
> > I don't follow your question regarding head-room. Could you elaborate?
> 
> One of the concern that was expressed in the past is how effectively
> can pro-active userspace reclaimer act on memory demand transitions. It
> takes some time to get refaults/PSI changes and then you should
> be acting rather swiftly.

This was a concern with the fixed limit, but not so much with the
one-off requests for reclaim. There is nothing in the way that would
prevent the workload from quickly allocating all the memory it
needs. The goal of proactive reclaim isn't to punish or restrict the
workload, but rather to continuously probe it for cold pages, to
measure the minimum amount of memory it requires to run healthily.

> At least if you aim at somehow smooth transition. Tuning this up to
> work reliably seems to be far from trivial. Not to mention that
> changes in the memory reclaim implementation could make the whole
> tuning rather fragile.

When reclaim becomes worse at finding the coldest memory, pressure
rises with fewer pages evicted and we back off earlier. So a reclaim
regression doesn't necessarily translate to less smooth operations or
increased workload impact, but rather to an increased memory
footprint. This may be measurable, but isn't really an operational
emergency - unless reclaim gets 50% worse, which isn't very likely, and
in which case we'd stop the kernel upgrade until the bug is fixed ;)

It's pretty robust. The tuning was done empirically, but now the same
configuration has held up to many different services; some with swap,
some with zswap, some with just cache, different types of SSDs,
different kernel versions, even drastic reclaim changes such as
Joonsoo's workingset for anon pages change.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 14:49   ` Dan Schatzberg
@ 2022-03-08 19:27     ` Johannes Weiner
  2022-03-08 22:37       ` Dan Schatzberg
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-03-08 19:27 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt,
	Greg Thelen

On Tue, Mar 08, 2022 at 09:49:20AM -0500, Dan Schatzberg wrote:
> On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote:
> > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote:
> > >  - swappiness factor
> > 
> > This I'm not sure about.
> > 
> > Mostly because I'm not sure about swappiness in general. It balances
> > between anon and file, but both of them are aged according to the same
> > LRU rules. The only reason to prefer one over the other seems to be
> > when the cost of reloading one (refault vs swapin) isn't the same as
> > the other. That's usually a hardware property, which in a perfect
> > world we'd auto-tune inside the kernel based on observed IO
> > performance. Not sure why you'd want this per reclaim request.
> 
> I think this could be useful for budgeting write-endurance. You may
> want to tune down a workload's swappiness on a per-reclaim basis in
> order to control how much swap-out (and therefore disk writes) its
> doing. Right now the only way to control this is by writing to
> vm.swappiness before doing the explicit reclaim which can momentarily
> effect other reclaim behavior on the machine.

Yeah the global swappiness setting is not ideal for tuning behavior of
individual workloads. On the other hand, flash life and write budget
are global resources shared by all workloads on the system. Does it
make sense longer term to take a workload-centric approach to that?

There are also filesystem writes to think about. If the swappable set
has already been swapped and cached, reclaiming it again doesn't
require IO. Reclaiming dirty cache OTOH requires IO, and upping
reclaim pressure on files will increase the writeback flush rates
(which reduces cache effectiveness and increases aggregate writes).

I wonder if it would make more sense to recognize the concept of write
endurance more broadly in MM code than just swap. Where you specify a
rate limit (globally? with per-cgroup shares?), and then, yes, the VM
will back away from swap iff it writes too much. But also throttle
writeback and push back on file reclaim and dirtying processes in
accordance with that policy.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 19:27     ` Johannes Weiner
@ 2022-03-08 22:37       ` Dan Schatzberg
  0 siblings, 0 replies; 24+ messages in thread
From: Dan Schatzberg @ 2022-03-08 22:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt,
	Greg Thelen

On Tue, Mar 08, 2022 at 02:27:49PM -0500, Johannes Weiner wrote:
> On Tue, Mar 08, 2022 at 09:49:20AM -0500, Dan Schatzberg wrote:
> > On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote:
> > > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote:
> > > >  - swappiness factor
> > > 
> > > This I'm not sure about.
> > > 
> > > Mostly because I'm not sure about swappiness in general. It balances
> > > between anon and file, but both of them are aged according to the same
> > > LRU rules. The only reason to prefer one over the other seems to be
> > > when the cost of reloading one (refault vs swapin) isn't the same as
> > > the other. That's usually a hardware property, which in a perfect
> > > world we'd auto-tune inside the kernel based on observed IO
> > > performance. Not sure why you'd want this per reclaim request.
> > 
> > I think this could be useful for budgeting write-endurance. You may
> > want to tune down a workload's swappiness on a per-reclaim basis in
> > order to control how much swap-out (and therefore disk writes) its
> > doing. Right now the only way to control this is by writing to
> > vm.swappiness before doing the explicit reclaim which can momentarily
> > effect other reclaim behavior on the machine.
> 
> Yeah the global swappiness setting is not ideal for tuning behavior of
> individual workloads. On the other hand, flash life and write budget
> are global resources shared by all workloads on the system. Does it
> make sense longer term to take a workload-centric approach to that?

Indeed flash life is a global resource, but it may be desireable to
budget it on a per-workload basis. Consider a workload with a lot of
warm anonymous memory - proactive reclaim of this workload may be able
to consume the entire write budget of the machine. This could result
in a co-located workload getting reduced reclaim due to insufficient
write budget. We'd like some form of isolation here so that the
co-located workload receives some fair-share of the write budget which
is hard to do without some additional control.

> There are also filesystem writes to think about. If the swappable set
> has already been swapped and cached, reclaiming it again doesn't
> require IO. Reclaiming dirty cache OTOH requires IO, and upping
> reclaim pressure on files will increase the writeback flush rates
> (which reduces cache effectiveness and increases aggregate writes).
> 
> I wonder if it would make more sense to recognize the concept of write
> endurance more broadly in MM code than just swap. Where you specify a
> rate limit (globally? with per-cgroup shares?), and then, yes, the VM
> will back away from swap iff it writes too much. But also throttle
> writeback and push back on file reclaim and dirtying processes in
> accordance with that policy.

Absolutely, we should discuss details but broadly I agree with the idea that
there's more than just per-cgroup swappiness control as a way to gain
control over mm-induced write endurance consumption.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-08 12:52     ` Michal Hocko
@ 2022-03-09 22:03       ` David Rientjes
  2022-03-10 16:58         ` Johannes Weiner
  0 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2022-03-09 22:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Yu Zhao,
	Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Tue, 8 Mar 2022, Michal Hocko wrote:

> > Let me take a stab at this. The specific reasons why high limit is not a
> > good interface to implement proactive reclaim:
> > 
> > 1) It can cause allocations from the target application to get
> > throttled.
> > 
> > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > by the userspace part of proactive reclaimer.
> > 
> > If I remember correctly, Facebook actually tried to use high limit to
> > implement the proactive reclaim but due to exactly these limitations [1]
> > they went the route [2] aligned with this proposal.
> 
> I do remember we have discussed this in the past. There were proposals
> for an additional limit to trigger a background reclaim [3] or to add a
> pressure based memcg knob [4]. For the nr_to_reclaim based interface
> there were some challenges outlined in that email thread. I do
> understand that practical experience could have confirmed or diminished
> those concerns.
> 
> I am definitely happy to restart those discussion but it would be really
> great to summarize existing options and why they do not work in
> practice. It would be also great to mention why concerns about nr_to_reclaim
> based interface expressed in the past are not standing out anymore wrt.
> other proposals.
> 

Johannes, since you had pointed out that the current approach used at Meta 
and described in the TMO paper works well in practice and is based on 
prior discussions of memory.reclaim[1], do you have any lingering concerns 
from that 2020 thread?

My first email in this thread proposes something that can still do memcg 
based reclaim but is also possible even without CONFIG_MEMCG enabled.  
That's particularly helpful for configs used by customers that don't use 
memcg, namely Chrome OS.  I assume we're not losing any functionality that 
your use case depends on if we are to introduce a per-node sysfs mechanism 
for this as an alternative since you can still specify a memcg id?

[1] https://lkml.org/lkml/2020/9/9/1094


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-07 20:50 ` Johannes Weiner
  2022-03-07 22:53   ` Wei Xu
  2022-03-08 14:49   ` Dan Schatzberg
@ 2022-03-09 22:30   ` David Rientjes
  2022-03-10 16:10     ` Johannes Weiner
  2 siblings, 1 reply; 24+ messages in thread
From: David Rientjes @ 2022-03-09 22:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm,
	Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

On Mon, 7 Mar 2022, Johannes Weiner wrote:

> > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
> > each NUMA node N on the system.  (It would be similar to the existing
> > per-node sysfs "compact" mechanism used to trigger compaction from
> > userspace.)
> 
> I generally think a proactive reclaim interface is a good idea.
> 
> A per-cgroup control knob would make more sense to me, as cgroupfs
> takes care of delegation, namespacing etc. and so would permit
> self-directed proactive reclaim inside containers.
> 

This is an interesting point and something that would need to be decided.  
There's pros and cons to both approaches, per-cgroup mechanism vs purely a 
per-node sysfs mechanism that can take a cgroup id.

The reason we'd like this in sysfs is because of users who do not enable 
CONFIG_MEMCG but would still benefit from proactive reclaim.  Such users 
do exist and do not rely on memcg, such as Chrome OS, and from my 
understanding this is normally done to speed up hibernation.

But I note your use of "per-cgroup" control knob and not specifically 
"per-memcg".  Were you considering a proactive reclaim mechanism for a 
cgroup other than memcg?  A new one?

I'm wondering if it would make sense for such a cgroup interface, if 
eventually needed, to be added incrementally on top of a per-node sysfs 
interface.  (We know today that there is a need for proactive reclaim for 
users who do not use memcg at all.)

> > Userspace would write the following to this file:
> >  - nr_to_reclaim pages
> 
> This makes sense, although (and you hinted at this below), I'm
> thinking it should be in bytes, especially if part of cgroupfs.
> 

If we agree upon a sysfs interface I assume there would be no objection to 
this in nr_to_reclaim pages?  I agree if this is to be a memcg knob that 
it should be expressed in bytes for consistency with other knobs.

> >  - swappiness factor
> 
> This I'm not sure about.
> 
> Mostly because I'm not sure about swappiness in general. It balances
> between anon and file, but both of them are aged according to the same
> LRU rules. The only reason to prefer one over the other seems to be
> when the cost of reloading one (refault vs swapin) isn't the same as
> the other. That's usually a hardware property, which in a perfect
> world we'd auto-tune inside the kernel based on observed IO
> performance. Not sure why you'd want this per reclaim request.
> 
> >  - flags to specify context, if any[**]
> >  
> >  [**] this is offered for extensibility to specify the context in which
> >       reclaim is being done (clean file pages only, demotion for memory
> >       tiering vs eviction, etc), otherwise 0
> 
> This one is curious. I don't understand the use cases for either of
> these examples, and I can't think of other flags a user may pass on a
> per-invocation basis. Would you care to elaborate some?
> 

If we combine the above two concerns, maybe only a flags argument is 
sufficient where you can specify only anon or only file (and neither means 
both)?  What is controllable by swappiness could be controlled by two 
different writes to the interface, one for (possibly) anon and one for 
(possibly) file.

There was discussion about treating the two different types of memory 
differently as a function of reload cost, cost of doing I/O for discard, 
and how much swap space we want proactive reclaim to take, as well as the 
only current alternative is to be playing with the global vm.swappiness.

Michal asked if this would include slab reclaim or shrinkers, I think the 
answer is "possibly yes," but no initial use case for this (flags would be 
extensible to permit the addition of it incrementally).  In fact, if you 
were to pass a cgroup id of 0 to induce global proactive reclaim you could 
mimic the same control we have with vm.drop_caches today but does not 
include reclaiming all of a memory type.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-09 22:30   ` David Rientjes
@ 2022-03-10 16:10     ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-03-10 16:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm,
	Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen

On Wed, Mar 09, 2022 at 02:30:24PM -0800, David Rientjes wrote:
> On Mon, 7 Mar 2022, Johannes Weiner wrote:
> 
> > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for
> > > each NUMA node N on the system.  (It would be similar to the existing
> > > per-node sysfs "compact" mechanism used to trigger compaction from
> > > userspace.)
> > 
> > I generally think a proactive reclaim interface is a good idea.
> > 
> > A per-cgroup control knob would make more sense to me, as cgroupfs
> > takes care of delegation, namespacing etc. and so would permit
> > self-directed proactive reclaim inside containers.
> > 
> 
> This is an interesting point and something that would need to be decided.  
> There's pros and cons to both approaches, per-cgroup mechanism vs purely a 
> per-node sysfs mechanism that can take a cgroup id.

I think we can just add both and avoid the cgroupid quirk.

We've done this many times: psi has global and cgroupfs interfaces, so
does vmstat, so does (did) swappiness etc. I don't see a problem with
adding a system and a cgroup interface for this.

> The reason we'd like this in sysfs is because of users who do not enable 
> CONFIG_MEMCG but would still benefit from proactive reclaim.  Such users 
> do exist and do not rely on memcg, such as Chrome OS, and from my 
> understanding this is normally done to speed up hibernation.

Yes, that makes sense.

> But I note your use of "per-cgroup" control knob and not specifically 
> "per-memcg".  Were you considering a proactive reclaim mechanism for a 
> cgroup other than memcg?  A new one?

No subtle nuance intended, I'm just using them interchangeably with
cgroup2. I meant: a cgroup that has the memory controller enabled :)

> I'm wondering if it would make sense for such a cgroup interface, if 
> eventually needed, to be added incrementally on top of a per-node sysfs 
> interface.  (We know today that there is a need for proactive reclaim for 
> users who do not use memcg at all.)

We've already had delegated deployments as well. Both uses are real.

But again, I don't think we have to choose at all. Let's add both!

> > > Userspace would write the following to this file:
> > >  - nr_to_reclaim pages
> > 
> > This makes sense, although (and you hinted at this below), I'm
> > thinking it should be in bytes, especially if part of cgroupfs.
> > 
> 
> If we agree upon a sysfs interface I assume there would be no objection to 
> this in nr_to_reclaim pages?  I agree if this is to be a memcg knob that 
> it should be expressed in bytes for consistency with other knobs.

Pages in general are somewhat fraught as a unit for facing
userspace. It requires people to use _SC_PAGESIZE, but they don't:

https://twitter.com/marcan42/status/1498710903675842563

Is there an argument *for* using pages?

> > >  - swappiness factor
> > 
> > This I'm not sure about.
> > 
> > Mostly because I'm not sure about swappiness in general. It balances
> > between anon and file, but both of them are aged according to the same
> > LRU rules. The only reason to prefer one over the other seems to be
> > when the cost of reloading one (refault vs swapin) isn't the same as
> > the other. That's usually a hardware property, which in a perfect
> > world we'd auto-tune inside the kernel based on observed IO
> > performance. Not sure why you'd want this per reclaim request.
> > 
> > >  - flags to specify context, if any[**]
> > >  
> > >  [**] this is offered for extensibility to specify the context in which
> > >       reclaim is being done (clean file pages only, demotion for memory
> > >       tiering vs eviction, etc), otherwise 0
> > 
> > This one is curious. I don't understand the use cases for either of
> > these examples, and I can't think of other flags a user may pass on a
> > per-invocation basis. Would you care to elaborate some?
> > 
> 
> If we combine the above two concerns, maybe only a flags argument is 
> sufficient where you can specify only anon or only file (and neither means 
> both)?  What is controllable by swappiness could be controlled by two 
> different writes to the interface, one for (possibly) anon and one for 
> (possibly) file.
> 
> There was discussion about treating the two different types of memory 
> differently as a function of reload cost, cost of doing I/O for discard, 
> and how much swap space we want proactive reclaim to take, as well as the 
> only current alternative is to be playing with the global vm.swappiness.
> 
> Michal asked if this would include slab reclaim or shrinkers, I think the 
> answer is "possibly yes," but no initial use case for this (flags would be 
> extensible to permit the addition of it incrementally).  In fact, if you 
> were to pass a cgroup id of 0 to induce global proactive reclaim you could 
> mimic the same control we have with vm.drop_caches today but does not 
> include reclaiming all of a memory type.

Ok, I think I see.

My impression is that this is mechanism that optimally the kernel's
reclaim algorithm should provide, rather than (just) application/setup
dependent policy preferences.

The cost of reload for example. Yes, it needs to be balanced between
anon and file. But is there a target to aim for besides lowest
aggregate paging overhead for the application?

How much swap space to use is a good point too, but we already have an
expression of intended per-cgroup share from the user:
memory.swap.high and memory.swap.max. Shouldn't reclaim in general
back off gradually from swap as utilization approaches 100%? Is
proactive reclaim different from conventional reclaim in this regard?

The write endurance question is similar. Policy would be to express a
global budget and per-cgroup shares of that budget; mechanism would be
to have this inform reclaim and writeback behavior.

My question would be why the mechanism *shouldn't* live in the
kernel. And then allow userspace to configure it in a way in which
most people actually understand: flash write budgets, swap space
allowances etc.

The interface proposed here strikes me as rather low-level. It's less
of a conventional user interface, as much as it is building blocks for
implementing parts of the reclaim algorithm in userspace.

I'm not necessarily against that. It's just unusual and IMO deserves
some more discussion. I want to make sure that if there are
shortcomings in the kernel we address them rather than work around.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-09 22:03       ` David Rientjes
@ 2022-03-10 16:58         ` Johannes Weiner
  2022-03-10 17:25           ` Shakeel Butt
  2022-03-10 17:33           ` Wei Xu
  0 siblings, 2 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-03-10 16:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Shakeel Butt, Andrew Morton, Yu Zhao, Dave Hansen,
	linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen

On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote:
> On Tue, 8 Mar 2022, Michal Hocko wrote:
> 
> > > Let me take a stab at this. The specific reasons why high limit is not a
> > > good interface to implement proactive reclaim:
> > > 
> > > 1) It can cause allocations from the target application to get
> > > throttled.
> > > 
> > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > by the userspace part of proactive reclaimer.
> > > 
> > > If I remember correctly, Facebook actually tried to use high limit to
> > > implement the proactive reclaim but due to exactly these limitations [1]
> > > they went the route [2] aligned with this proposal.
> > 
> > I do remember we have discussed this in the past. There were proposals
> > for an additional limit to trigger a background reclaim [3] or to add a
> > pressure based memcg knob [4]. For the nr_to_reclaim based interface
> > there were some challenges outlined in that email thread. I do
> > understand that practical experience could have confirmed or diminished
> > those concerns.
> > 
> > I am definitely happy to restart those discussion but it would be really
> > great to summarize existing options and why they do not work in
> > practice. It would be also great to mention why concerns about nr_to_reclaim
> > based interface expressed in the past are not standing out anymore wrt.
> > other proposals.
> > 
> 
> Johannes, since you had pointed out that the current approach used at Meta 
> and described in the TMO paper works well in practice and is based on 
> prior discussions of memory.reclaim[1], do you have any lingering concerns 
> from that 2020 thread?

I'd be okay with merging the interface proposed in that thread as-is.

> My first email in this thread proposes something that can still do memcg 
> based reclaim but is also possible even without CONFIG_MEMCG enabled.  
> That's particularly helpful for configs used by customers that don't use 
> memcg, namely Chrome OS.  I assume we're not losing any functionality that 
> your use case depends on if we are to introduce a per-node sysfs mechanism 
> for this as an alternative since you can still specify a memcg id?

We'd lose the delegation functionality with this proposal.

But per the other thread, I wouldn't be opposed to adding a global
per-node interface in addition to the cgroupfs one.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-10 16:58         ` Johannes Weiner
@ 2022-03-10 17:25           ` Shakeel Butt
  2022-03-10 17:33           ` Wei Xu
  1 sibling, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2022-03-10 17:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Michal Hocko, Andrew Morton, Yu Zhao,
	Dave Hansen, Linux MM, Yosry Ahmed, Wei Xu, Greg Thelen

On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote:
> > On Tue, 8 Mar 2022, Michal Hocko wrote:
> >
> > > > Let me take a stab at this. The specific reasons why high limit is not a
> > > > good interface to implement proactive reclaim:
> > > >
> > > > 1) It can cause allocations from the target application to get
> > > > throttled.
> > > >
> > > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > > by the userspace part of proactive reclaimer.
> > > >
> > > > If I remember correctly, Facebook actually tried to use high limit to
> > > > implement the proactive reclaim but due to exactly these limitations [1]
> > > > they went the route [2] aligned with this proposal.
> > >
> > > I do remember we have discussed this in the past. There were proposals
> > > for an additional limit to trigger a background reclaim [3] or to add a
> > > pressure based memcg knob [4]. For the nr_to_reclaim based interface
> > > there were some challenges outlined in that email thread. I do
> > > understand that practical experience could have confirmed or diminished
> > > those concerns.
> > >
> > > I am definitely happy to restart those discussion but it would be really
> > > great to summarize existing options and why they do not work in
> > > practice. It would be also great to mention why concerns about nr_to_reclaim
> > > based interface expressed in the past are not standing out anymore wrt.
> > > other proposals.
> > >
> >
> > Johannes, since you had pointed out that the current approach used at Meta
> > and described in the TMO paper works well in practice and is based on
> > prior discussions of memory.reclaim[1], do you have any lingering concerns
> > from that 2020 thread?
>
> I'd be okay with merging the interface proposed in that thread as-is.
>

Thanks, I will revise the commit message of that patch and send it out
again. Also I will try to address Michal's concerns as well.

Shakeel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-10 16:58         ` Johannes Weiner
  2022-03-10 17:25           ` Shakeel Butt
@ 2022-03-10 17:33           ` Wei Xu
  2022-03-10 17:42             ` Johannes Weiner
  1 sibling, 1 reply; 24+ messages in thread
From: Wei Xu @ 2022-03-10 17:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Michal Hocko, Shakeel Butt, Andrew Morton,
	Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Greg Thelen

On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote:
> > On Tue, 8 Mar 2022, Michal Hocko wrote:
> >
> > > > Let me take a stab at this. The specific reasons why high limit is not a
> > > > good interface to implement proactive reclaim:
> > > >
> > > > 1) It can cause allocations from the target application to get
> > > > throttled.
> > > >
> > > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > > by the userspace part of proactive reclaimer.
> > > >
> > > > If I remember correctly, Facebook actually tried to use high limit to
> > > > implement the proactive reclaim but due to exactly these limitations [1]
> > > > they went the route [2] aligned with this proposal.
> > >
> > > I do remember we have discussed this in the past. There were proposals
> > > for an additional limit to trigger a background reclaim [3] or to add a
> > > pressure based memcg knob [4]. For the nr_to_reclaim based interface
> > > there were some challenges outlined in that email thread. I do
> > > understand that practical experience could have confirmed or diminished
> > > those concerns.
> > >
> > > I am definitely happy to restart those discussion but it would be really
> > > great to summarize existing options and why they do not work in
> > > practice. It would be also great to mention why concerns about nr_to_reclaim
> > > based interface expressed in the past are not standing out anymore wrt.
> > > other proposals.
> > >
> >
> > Johannes, since you had pointed out that the current approach used at Meta
> > and described in the TMO paper works well in practice and is based on
> > prior discussions of memory.reclaim[1], do you have any lingering concerns
> > from that 2020 thread?
>
> I'd be okay with merging the interface proposed in that thread as-is.

We will need a nodemask argument for the memory tiering use case. We
can add it as an optional argument to memory.reclaim later.  Or do you
think we should add a different interface (e.g. memory.demote) for
memory tiering instead?

> > My first email in this thread proposes something that can still do memcg
> > based reclaim but is also possible even without CONFIG_MEMCG enabled.
> > That's particularly helpful for configs used by customers that don't use
> > memcg, namely Chrome OS.  I assume we're not losing any functionality that
> > your use case depends on if we are to introduce a per-node sysfs mechanism
> > for this as an alternative since you can still specify a memcg id?
>
> We'd lose the delegation functionality with this proposal.
>
> But per the other thread, I wouldn't be opposed to adding a global
> per-node interface in addition to the cgroupfs one.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Mechanism to induce memory reclaim
  2022-03-10 17:33           ` Wei Xu
@ 2022-03-10 17:42             ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-03-10 17:42 UTC (permalink / raw)
  To: Wei Xu
  Cc: David Rientjes, Michal Hocko, Shakeel Butt, Andrew Morton,
	Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Greg Thelen

On Thu, Mar 10, 2022 at 09:33:48AM -0800, Wei Xu wrote:
> On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote:
> > > On Tue, 8 Mar 2022, Michal Hocko wrote:
> > >
> > > > > Let me take a stab at this. The specific reasons why high limit is not a
> > > > > good interface to implement proactive reclaim:
> > > > >
> > > > > 1) It can cause allocations from the target application to get
> > > > > throttled.
> > > > >
> > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > > > > by the userspace part of proactive reclaimer.
> > > > >
> > > > > If I remember correctly, Facebook actually tried to use high limit to
> > > > > implement the proactive reclaim but due to exactly these limitations [1]
> > > > > they went the route [2] aligned with this proposal.
> > > >
> > > > I do remember we have discussed this in the past. There were proposals
> > > > for an additional limit to trigger a background reclaim [3] or to add a
> > > > pressure based memcg knob [4]. For the nr_to_reclaim based interface
> > > > there were some challenges outlined in that email thread. I do
> > > > understand that practical experience could have confirmed or diminished
> > > > those concerns.
> > > >
> > > > I am definitely happy to restart those discussion but it would be really
> > > > great to summarize existing options and why they do not work in
> > > > practice. It would be also great to mention why concerns about nr_to_reclaim
> > > > based interface expressed in the past are not standing out anymore wrt.
> > > > other proposals.
> > > >
> > >
> > > Johannes, since you had pointed out that the current approach used at Meta
> > > and described in the TMO paper works well in practice and is based on
> > > prior discussions of memory.reclaim[1], do you have any lingering concerns
> > > from that 2020 thread?
> >
> > I'd be okay with merging the interface proposed in that thread as-is.
> 
> We will need a nodemask argument for the memory tiering use case. We
> can add it as an optional argument to memory.reclaim later.  Or do you
> think we should add a different interface (e.g. memory.demote) for
> memory tiering instead?

Yes, good point. We can add an optional parameter later on, methinks,
as the behavior for when it's omitted shouldn't change.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-03-10 17:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes
2022-03-07  0:49 ` Yu Zhao
2022-03-07 14:41 ` Michal Hocko
2022-03-07 18:31   ` Shakeel Butt
2022-03-07 20:26     ` Johannes Weiner
2022-03-08 12:53       ` Michal Hocko
2022-03-08 14:44         ` Dan Schatzberg
2022-03-08 16:05           ` Michal Hocko
2022-03-08 17:21             ` Wei Xu
2022-03-08 17:23             ` Johannes Weiner
2022-03-08 12:52     ` Michal Hocko
2022-03-09 22:03       ` David Rientjes
2022-03-10 16:58         ` Johannes Weiner
2022-03-10 17:25           ` Shakeel Butt
2022-03-10 17:33           ` Wei Xu
2022-03-10 17:42             ` Johannes Weiner
2022-03-07 20:50 ` Johannes Weiner
2022-03-07 22:53   ` Wei Xu
2022-03-08 12:53     ` Michal Hocko
2022-03-08 14:49   ` Dan Schatzberg
2022-03-08 19:27     ` Johannes Weiner
2022-03-08 22:37       ` Dan Schatzberg
2022-03-09 22:30   ` David Rientjes
2022-03-10 16:10     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.