* [RFC] Mechanism to induce memory reclaim @ 2022-03-06 23:11 David Rientjes 2022-03-07 0:49 ` Yu Zhao ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: David Rientjes @ 2022-03-06 23:11 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner, Michal Hocko, Yu Zhao, Dave Hansen Cc: linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen Hi everybody, We'd like to discuss formalizing a mechanism to induce memory reclaim by the kernel. The current multigenerational LRU proposal introduces a debugfs mechanism[1] for this. The "TMO: Transparent Memory Offloading in Datacenters" paper also discusses a per-memcg mechanism[2]. While the former can be used for debugging of MGLRU, both can quite powerfully be used for proactive reclaim. Google's datacenters use a similar per-memcg mechanism for the same purpose. Thus, formalizing the mechanism would allow our userspace to use an upstream supported interface that will be stable and consistent. This could be an incremental addition to MGLRU's lru_gen debugfs mechanism but, since the concept has no direct dependency on the work, we believe it is useful independent of the reclaim mechanism in use (both with and without CONFIG_LRU_GEN). Idea: introduce a per-node sysfs mechanism for inducing memory reclaim that can be useful for global (non-memcg constrained) reclaim and possible even if memcg is not enabled in the kernel or mounted. This could optionally take a memcg id to induce reclaim for a memcg hierarchy. IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for each NUMA node N on the system. (It would be similar to the existing per-node sysfs "compact" mechanism used to trigger compaction from userspace.) Userspace would write the following to this file: - nr_to_reclaim pages - swappiness factor - memcg_id of the hierarchy to reclaim from, if any[*] - flags to specify context, if any[**] [*] if global reclaim or memcg is not enabled/mounted, this is 0 since this is the return value of mem_cgroup_id() [**] this is offered for extensibility to specify the context in which reclaim is being done (clean file pages only, demotion for memory tiering vs eviction, etc), otherwise 0 An alternative may be to introduce a /sys/kernel/mm/reclaim mechanism that also takes a nodemask to reclaim from. The kernel would reclaim memory over the set of nodes passed to it. Some questions to get discussion going: - Overall feedback or suggestions for the proposal in general? - This proposal uses a value specified in pages to reclaim; this could be a number of bytes instead. I have no strong opinion, does anybody else? - Should this be a per-node mechanism under sysfs like the existing "compact" mechanism or should it be implemented as a single file that can optionally specify a nodemask to reclaim from? Thanks! [1] https://lore.kernel.org/linux-mm/20220208081902.3550911-12-yuzhao@google.com [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes @ 2022-03-07 0:49 ` Yu Zhao 2022-03-07 14:41 ` Michal Hocko 2022-03-07 20:50 ` Johannes Weiner 2 siblings, 0 replies; 24+ messages in thread From: Yu Zhao @ 2022-03-07 0:49 UTC (permalink / raw) To: David Rientjes, Andrea Righi Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Dave Hansen, Linux-MM, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Sun, Mar 6, 2022 at 4:11 PM David Rientjes <rientjes@google.com> wrote: > > Hi everybody, > > We'd like to discuss formalizing a mechanism to induce memory reclaim by > the kernel. > > The current multigenerational LRU proposal introduces a debugfs > mechanism[1] for this. The "TMO: Transparent Memory Offloading in > Datacenters" paper also discusses a per-memcg mechanism[2]. While the > former can be used for debugging of MGLRU, both can quite powerfully be > used for proactive reclaim. > > Google's datacenters use a similar per-memcg mechanism for the same > purpose. Thus, formalizing the mechanism would allow our userspace to use > an upstream supported interface that will be stable and consistent. > > This could be an incremental addition to MGLRU's lru_gen debugfs mechanism > but, since the concept has no direct dependency on the work, we believe it > is useful independent of the reclaim mechanism in use (both with and > without CONFIG_LRU_GEN). > > Idea: introduce a per-node sysfs mechanism for inducing memory reclaim > that can be useful for global (non-memcg constrained) reclaim and possible > even if memcg is not enabled in the kernel or mounted. This could > optionally take a memcg id to induce reclaim for a memcg hierarchy. > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > each NUMA node N on the system. (It would be similar to the existing > per-node sysfs "compact" mechanism used to trigger compaction from > userspace.) > > Userspace would write the following to this file: > - nr_to_reclaim pages > - swappiness factor > - memcg_id of the hierarchy to reclaim from, if any[*] > - flags to specify context, if any[**] > > [*] if global reclaim or memcg is not enabled/mounted, this is 0 since > this is the return value of mem_cgroup_id() > [**] this is offered for extensibility to specify the context in which > reclaim is being done (clean file pages only, demotion for memory > tiering vs eviction, etc), otherwise 0 > > An alternative may be to introduce a /sys/kernel/mm/reclaim mechanism that > also takes a nodemask to reclaim from. The kernel would reclaim memory > over the set of nodes passed to it. > > Some questions to get discussion going: > > - Overall feedback or suggestions for the proposal in general? > > - This proposal uses a value specified in pages to reclaim; this could be > a number of bytes instead. I have no strong opinion, does anybody > else? > > - Should this be a per-node mechanism under sysfs like the existing > "compact" mechanism or should it be implemented as a single file that > can optionally specify a nodemask to reclaim from? > > Thanks! > > [1] https://lore.kernel.org/linux-mm/20220208081902.3550911-12-yuzhao@google.com > [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3) Adding Canonical who also provided additional use cases [3] for this potential ABI. [3] https://lore.kernel.org/lkml/20201005081313.732745-1-andrea.righi@canonical.com/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes 2022-03-07 0:49 ` Yu Zhao @ 2022-03-07 14:41 ` Michal Hocko 2022-03-07 18:31 ` Shakeel Butt 2022-03-07 20:50 ` Johannes Weiner 2 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2022-03-07 14:41 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Johannes Weiner, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Sun 06-03-22 15:11:23, David Rientjes wrote: [...] > Some questions to get discussion going: > > - Overall feedback or suggestions for the proposal in general? Do we really need this interface? What would be usecases which cannot use an existing interfaces we have for that? Most notably memcg and their high limit? I do agree that the global means to trigger the reclaim (min_free_kbytes) is far from a precise tool but it would be interesting to hear more about why a number of reclaimed pages would be a more useful interface. Could you elaborate on that please? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 14:41 ` Michal Hocko @ 2022-03-07 18:31 ` Shakeel Butt 2022-03-07 20:26 ` Johannes Weiner 2022-03-08 12:52 ` Michal Hocko 0 siblings, 2 replies; 24+ messages in thread From: Shakeel Butt @ 2022-03-07 18:31 UTC (permalink / raw) To: Michal Hocko Cc: David Rientjes, Andrew Morton, Johannes Weiner, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > On Sun 06-03-22 15:11:23, David Rientjes wrote: > [...] > > Some questions to get discussion going: > > > > - Overall feedback or suggestions for the proposal in general? > Do we really need this interface? What would be usecases which cannot > use an existing interfaces we have for that? Most notably memcg and > their high limit? Let me take a stab at this. The specific reasons why high limit is not a good interface to implement proactive reclaim: 1) It can cause allocations from the target application to get throttled. 2) It leaves a state (high limit) in the kernel which needs to be reset by the userspace part of proactive reclaimer. If I remember correctly, Facebook actually tried to use high limit to implement the proactive reclaim but due to exactly these limitations [1] they went the route [2] aligned with this proposal. To further explain why the above limitations are pretty bad: The proactive reclaimers usually use feedback loop to decide how much to squeeze from the target applications without impacting their performance or impacting within a tolerable range. The metrics used for the feedback loop are either refaults or PSI and these metrics becomes messy due to application getting throttled due to high limit. For (2), the high limit interface is a very awkward interface to use to do proactive reclaim. If the userspace proactive reclaimer fails/crashed due to whatever reason during triggering the reclaim in an application, it can leave the application in a bad state (memory pressure state and throttled) for a long time. [1] https://lore.kernel.org/all/20200928210216.GA378894@cmpxchg.org/ [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 18:31 ` Shakeel Butt @ 2022-03-07 20:26 ` Johannes Weiner 2022-03-08 12:53 ` Michal Hocko 2022-03-08 12:52 ` Michal Hocko 1 sibling, 1 reply; 24+ messages in thread From: Johannes Weiner @ 2022-03-07 20:26 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Hocko, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote: > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > [...] > > > Some questions to get discussion going: > > > > > > - Overall feedback or suggestions for the proposal in general? > > > Do we really need this interface? What would be usecases which cannot > > use an existing interfaces we have for that? Most notably memcg and > > their high limit? > > > Let me take a stab at this. The specific reasons why high limit is not a > good interface to implement proactive reclaim: > > 1) It can cause allocations from the target application to get > throttled. > > 2) It leaves a state (high limit) in the kernel which needs to be reset > by the userspace part of proactive reclaimer. > > If I remember correctly, Facebook actually tried to use high limit to > implement the proactive reclaim but due to exactly these limitations [1] > they went the route [2] aligned with this proposal. > > To further explain why the above limitations are pretty bad: The > proactive reclaimers usually use feedback loop to decide how much to > squeeze from the target applications without impacting their performance > or impacting within a tolerable range. The metrics used for the feedback > loop are either refaults or PSI and these metrics becomes messy due to > application getting throttled due to high limit. > > For (2), the high limit interface is a very awkward interface to use to > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > due to whatever reason during triggering the reclaim in an application, > it can leave the application in a bad state (memory pressure state and > throttled) for a long time. Yes. In addition to the proactive reclaimer crashing, we also had problems of it simply not responding quickly enough. Because there is a delay between reclaim (action) and refaults (feedback), there is a very real upper limit of pages you can reasonably reclaim per second, without risking pressure spikes that far exceed tolerances. A fixed memory.high limit can easily exceed that safe reclaim rate when the workload expands abruptly. Even if the proactive reclaimer process is alive, it's almost impossible to step between a rapidly allocating process and its cgroup limit in time. The semantics of writing to memory.high also require that the new limit is met before returning to userspace. This can take a long time, during which the reclaimer cannot re-evaluate the optimal target size based on observed pressure. We routinely saw the reclaimer get stuck in the kernel hammering a suffering workload down to a stale target. We tried for quite a while to make this work, but the limit semantics turned out to not be a good fit for proactive reclaim. A mechanism to request a fixed number of pages to reclaim turned out to work much, much better in practice. We've been using a simple per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). With tiered memory systems coming up, I can see the need for restricting to specific numa nodes. Demoting from DRAM to CXL has a different cost function than evicting RAM/CXL to storage, and those two things probably need to happen at different rates. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 20:26 ` Johannes Weiner @ 2022-03-08 12:53 ` Michal Hocko 2022-03-08 14:44 ` Dan Schatzberg 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2022-03-08 12:53 UTC (permalink / raw) To: Johannes Weiner Cc: Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote: > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > > [...] > > > > Some questions to get discussion going: > > > > > > > > - Overall feedback or suggestions for the proposal in general? > > > > > Do we really need this interface? What would be usecases which cannot > > > use an existing interfaces we have for that? Most notably memcg and > > > their high limit? > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > good interface to implement proactive reclaim: > > > > 1) It can cause allocations from the target application to get > > throttled. > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > by the userspace part of proactive reclaimer. > > > > If I remember correctly, Facebook actually tried to use high limit to > > implement the proactive reclaim but due to exactly these limitations [1] > > they went the route [2] aligned with this proposal. > > > > To further explain why the above limitations are pretty bad: The > > proactive reclaimers usually use feedback loop to decide how much to > > squeeze from the target applications without impacting their performance > > or impacting within a tolerable range. The metrics used for the feedback > > loop are either refaults or PSI and these metrics becomes messy due to > > application getting throttled due to high limit. > > > > For (2), the high limit interface is a very awkward interface to use to > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > > due to whatever reason during triggering the reclaim in an application, > > it can leave the application in a bad state (memory pressure state and > > throttled) for a long time. > > Yes. > > In addition to the proactive reclaimer crashing, we also had problems > of it simply not responding quickly enough. > > Because there is a delay between reclaim (action) and refaults > (feedback), there is a very real upper limit of pages you can > reasonably reclaim per second, without risking pressure spikes that > far exceed tolerances. A fixed memory.high limit can easily exceed > that safe reclaim rate when the workload expands abruptly. Even if the > proactive reclaimer process is alive, it's almost impossible to step > between a rapidly allocating process and its cgroup limit in time. > > The semantics of writing to memory.high also require that the new > limit is met before returning to userspace. This can take a long time, > during which the reclaimer cannot re-evaluate the optimal target size > based on observed pressure. We routinely saw the reclaimer get stuck > in the kernel hammering a suffering workload down to a stale target. > > We tried for quite a while to make this work, but the limit semantics > turned out to not be a good fit for proactive reclaim. Thanks for sharing your experience, Johannes. This is a useful insight. > A mechanism to request a fixed number of pages to reclaim turned out > to work much, much better in practice. We've been using a simple > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). Could you share more details here please? How have you managed to find the reclaim target and how have you overcome challenges to react in time to have some head room for the actual reclaim? > With tiered memory systems coming up, I can see the need for > restricting to specific numa nodes. Demoting from DRAM to CXL has a > different cost function than evicting RAM/CXL to storage, and those > two things probably need to happen at different rates. Yes, in an absense of per-node watermarks I can see how a per-node reclaim trigger could be useful. The question is whether a per-node wmark interface wouldn't be a better fit. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 12:53 ` Michal Hocko @ 2022-03-08 14:44 ` Dan Schatzberg 2022-03-08 16:05 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Dan Schatzberg @ 2022-03-08 14:44 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote: > On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote: > > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > > > [...] > > > > > Some questions to get discussion going: > > > > > > > > > > - Overall feedback or suggestions for the proposal in general? > > > > > > > Do we really need this interface? What would be usecases which cannot > > > > use an existing interfaces we have for that? Most notably memcg and > > > > their high limit? > > > > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > good interface to implement proactive reclaim: > > > > > > 1) It can cause allocations from the target application to get > > > throttled. > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > by the userspace part of proactive reclaimer. > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > implement the proactive reclaim but due to exactly these limitations [1] > > > they went the route [2] aligned with this proposal. > > > > > > To further explain why the above limitations are pretty bad: The > > > proactive reclaimers usually use feedback loop to decide how much to > > > squeeze from the target applications without impacting their performance > > > or impacting within a tolerable range. The metrics used for the feedback > > > loop are either refaults or PSI and these metrics becomes messy due to > > > application getting throttled due to high limit. > > > > > > For (2), the high limit interface is a very awkward interface to use to > > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > > > due to whatever reason during triggering the reclaim in an application, > > > it can leave the application in a bad state (memory pressure state and > > > throttled) for a long time. > > > > Yes. > > > > In addition to the proactive reclaimer crashing, we also had problems > > of it simply not responding quickly enough. > > > > Because there is a delay between reclaim (action) and refaults > > (feedback), there is a very real upper limit of pages you can > > reasonably reclaim per second, without risking pressure spikes that > > far exceed tolerances. A fixed memory.high limit can easily exceed > > that safe reclaim rate when the workload expands abruptly. Even if the > > proactive reclaimer process is alive, it's almost impossible to step > > between a rapidly allocating process and its cgroup limit in time. > > > > The semantics of writing to memory.high also require that the new > > limit is met before returning to userspace. This can take a long time, > > during which the reclaimer cannot re-evaluate the optimal target size > > based on observed pressure. We routinely saw the reclaimer get stuck > > in the kernel hammering a suffering workload down to a stale target. > > > > We tried for quite a while to make this work, but the limit semantics > > turned out to not be a good fit for proactive reclaim. > > Thanks for sharing your experience, Johannes. This is a useful insight. Just to add another issue with memory.high - there's a race window between reading memory.current and setting memory.high if you want to reclaim just a little bit of memory. On a fast expanding workload this could result in reclaiming much more than intended. > > > A mechanism to request a fixed number of pages to reclaim turned out > > to work much, much better in practice. We've been using a simple > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). > > Could you share more details here please? How have you managed to find > the reclaim target and how have you overcome challenges to react in time > to have some head room for the actual reclaim? We have a userspace agent that just repeatedly triggers proactive reclaim and monitors PSI metrics to maintain some constant but low pressure. In the complete absense of pressure we will reclaim some configurable percentage of the workload's memory. This reclaim amount tapers down to zero as PSI approaches the target threshold. I don't follow your question regarding head-room. Could you elaborate? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 14:44 ` Dan Schatzberg @ 2022-03-08 16:05 ` Michal Hocko 2022-03-08 17:21 ` Wei Xu 2022-03-08 17:23 ` Johannes Weiner 0 siblings, 2 replies; 24+ messages in thread From: Michal Hocko @ 2022-03-08 16:05 UTC (permalink / raw) To: Dan Schatzberg Cc: Johannes Weiner, Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Tue 08-03-22 09:44:35, Dan Schatzberg wrote: > On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote: > > On Mon 07-03-22 15:26:18, Johannes Weiner wrote: [...] > > > A mechanism to request a fixed number of pages to reclaim turned out > > > to work much, much better in practice. We've been using a simple > > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). > > > > Could you share more details here please? How have you managed to find > > the reclaim target and how have you overcome challenges to react in time > > to have some head room for the actual reclaim? > > We have a userspace agent that just repeatedly triggers proactive > reclaim and monitors PSI metrics to maintain some constant but low > pressure. In the complete absense of pressure we will reclaim some > configurable percentage of the workload's memory. This reclaim amount > tapers down to zero as PSI approaches the target threshold. > > I don't follow your question regarding head-room. Could you elaborate? One of the concern that was expressed in the past is how effectively can pro-active userspace reclaimer act on memory demand transitions. It takes some time to get refaults/PSI changes and then you should be acting rather swiftly. At least if you aim at somehow smooth transition. Tuning this up to work reliably seems to be far from trivial. Not to mention that changes in the memory reclaim implementation could make the whole tuning rather fragile. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 16:05 ` Michal Hocko @ 2022-03-08 17:21 ` Wei Xu 2022-03-08 17:23 ` Johannes Weiner 1 sibling, 0 replies; 24+ messages in thread From: Wei Xu @ 2022-03-08 17:21 UTC (permalink / raw) To: Michal Hocko Cc: Dan Schatzberg, Johannes Weiner, Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Greg Thelen On Tue, Mar 8, 2022 at 8:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 08-03-22 09:44:35, Dan Schatzberg wrote: > > On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote: > > > On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > [...] > > > > A mechanism to request a fixed number of pages to reclaim turned out > > > > to work much, much better in practice. We've been using a simple > > > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). > > > > > > Could you share more details here please? How have you managed to find > > > the reclaim target and how have you overcome challenges to react in time > > > to have some head room for the actual reclaim? > > > > We have a userspace agent that just repeatedly triggers proactive > > reclaim and monitors PSI metrics to maintain some constant but low > > pressure. In the complete absense of pressure we will reclaim some > > configurable percentage of the workload's memory. This reclaim amount > > tapers down to zero as PSI approaches the target threshold. > > > > I don't follow your question regarding head-room. Could you elaborate? > > One of the concern that was expressed in the past is how effectively > can pro-active userspace reclaimer act on memory demand transitions. It > takes some time to get refaults/PSI changes and then you should > be acting rather swiftly. At least if you aim at somehow smooth > transition. Tuning this up to work reliably seems to be far > from trivial. Not to mention that changes in the memory reclaim > implementation could make the whole tuning rather fragile. The userspace reclaimer is not a complete replacement of the kernel memory reclaim (kswapd or direct reclaim). At least in Google's user cases, it is to proactively identify memory savings opportunities and reclaim some amount of cold pages set by the policy to free up the memory for more demanding jobs or scheduling new jobs. If a job (container) has a rapid memory demand increase, it would just mean less proactive savings from this job. The userspace reclaimer doesn't have to act much more swiftly for such jobs with the proposed nr_bytes_to_reclaim interface. If the userspace reclaim interface was memory.high-based, then such jobs would indeed be a serious problem. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 16:05 ` Michal Hocko 2022-03-08 17:21 ` Wei Xu @ 2022-03-08 17:23 ` Johannes Weiner 1 sibling, 0 replies; 24+ messages in thread From: Johannes Weiner @ 2022-03-08 17:23 UTC (permalink / raw) To: Michal Hocko Cc: Dan Schatzberg, Shakeel Butt, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Tue, Mar 08, 2022 at 05:05:11PM +0100, Michal Hocko wrote: > On Tue 08-03-22 09:44:35, Dan Schatzberg wrote: > > On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote: > > > On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > [...] > > > > A mechanism to request a fixed number of pages to reclaim turned out > > > > to work much, much better in practice. We've been using a simple > > > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). > > > > > > Could you share more details here please? How have you managed to find > > > the reclaim target and how have you overcome challenges to react in time > > > to have some head room for the actual reclaim? > > > > We have a userspace agent that just repeatedly triggers proactive > > reclaim and monitors PSI metrics to maintain some constant but low > > pressure. In the complete absense of pressure we will reclaim some > > configurable percentage of the workload's memory. This reclaim amount > > tapers down to zero as PSI approaches the target threshold. > > > > I don't follow your question regarding head-room. Could you elaborate? > > One of the concern that was expressed in the past is how effectively > can pro-active userspace reclaimer act on memory demand transitions. It > takes some time to get refaults/PSI changes and then you should > be acting rather swiftly. This was a concern with the fixed limit, but not so much with the one-off requests for reclaim. There is nothing in the way that would prevent the workload from quickly allocating all the memory it needs. The goal of proactive reclaim isn't to punish or restrict the workload, but rather to continuously probe it for cold pages, to measure the minimum amount of memory it requires to run healthily. > At least if you aim at somehow smooth transition. Tuning this up to > work reliably seems to be far from trivial. Not to mention that > changes in the memory reclaim implementation could make the whole > tuning rather fragile. When reclaim becomes worse at finding the coldest memory, pressure rises with fewer pages evicted and we back off earlier. So a reclaim regression doesn't necessarily translate to less smooth operations or increased workload impact, but rather to an increased memory footprint. This may be measurable, but isn't really an operational emergency - unless reclaim gets 50% worse, which isn't very likely, and in which case we'd stop the kernel upgrade until the bug is fixed ;) It's pretty robust. The tuning was done empirically, but now the same configuration has held up to many different services; some with swap, some with zswap, some with just cache, different types of SSDs, different kernel versions, even drastic reclaim changes such as Joonsoo's workingset for anon pages change. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 18:31 ` Shakeel Butt 2022-03-07 20:26 ` Johannes Weiner @ 2022-03-08 12:52 ` Michal Hocko 2022-03-09 22:03 ` David Rientjes 1 sibling, 1 reply; 24+ messages in thread From: Michal Hocko @ 2022-03-08 12:52 UTC (permalink / raw) To: Shakeel Butt Cc: David Rientjes, Andrew Morton, Johannes Weiner, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Mon 07-03-22 18:31:41, Shakeel Butt wrote: > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > [...] > > > Some questions to get discussion going: > > > > > > - Overall feedback or suggestions for the proposal in general? > > > Do we really need this interface? What would be usecases which cannot > > use an existing interfaces we have for that? Most notably memcg and > > their high limit? > > > Let me take a stab at this. The specific reasons why high limit is not a > good interface to implement proactive reclaim: > > 1) It can cause allocations from the target application to get > throttled. > > 2) It leaves a state (high limit) in the kernel which needs to be reset > by the userspace part of proactive reclaimer. > > If I remember correctly, Facebook actually tried to use high limit to > implement the proactive reclaim but due to exactly these limitations [1] > they went the route [2] aligned with this proposal. I do remember we have discussed this in the past. There were proposals for an additional limit to trigger a background reclaim [3] or to add a pressure based memcg knob [4]. For the nr_to_reclaim based interface there were some challenges outlined in that email thread. I do understand that practical experience could have confirmed or diminished those concerns. I am definitely happy to restart those discussion but it would be really great to summarize existing options and why they do not work in practice. It would be also great to mention why concerns about nr_to_reclaim based interface expressed in the past are not standing out anymore wrt. other proposals. > To further explain why the above limitations are pretty bad: The > proactive reclaimers usually use feedback loop to decide how much to > squeeze from the target applications without impacting their performance > or impacting within a tolerable range. The metrics used for the feedback > loop are either refaults or PSI and these metrics becomes messy due to > application getting throttled due to high limit. One thing is not really clear to me here. You are saying that the PSI/refaults are influenced by the throttling IIUC. Does that mean that your reclaimer is living outside of the controlled memcg? Or why does it make any difference who is reclaiming the memory from the the metrics POV? I do understand that you want to avoid throttling on the regular workload in that memcg and this is where the high limit comes short but the work has to be done by somebody, right? > For (2), the high limit interface is a very awkward interface to use to > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > due to whatever reason during triggering the reclaim in an application, > it can leave the application in a bad state (memory pressure state and > throttled) for a long time. Fair enough. > [1] https://lore.kernel.org/all/20200928210216.GA378894@cmpxchg.org/ > [2] https://dl.acm.org/doi/10.1145/3503222.3507731 (Section 3.3) [3] http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz resp. http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org/ [4] http://lkml.kernel.org/r/20200928210216.GA378894@cmpxchg.org -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 12:52 ` Michal Hocko @ 2022-03-09 22:03 ` David Rientjes 2022-03-10 16:58 ` Johannes Weiner 0 siblings, 1 reply; 24+ messages in thread From: David Rientjes @ 2022-03-09 22:03 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Andrew Morton, Johannes Weiner, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Tue, 8 Mar 2022, Michal Hocko wrote: > > Let me take a stab at this. The specific reasons why high limit is not a > > good interface to implement proactive reclaim: > > > > 1) It can cause allocations from the target application to get > > throttled. > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > by the userspace part of proactive reclaimer. > > > > If I remember correctly, Facebook actually tried to use high limit to > > implement the proactive reclaim but due to exactly these limitations [1] > > they went the route [2] aligned with this proposal. > > I do remember we have discussed this in the past. There were proposals > for an additional limit to trigger a background reclaim [3] or to add a > pressure based memcg knob [4]. For the nr_to_reclaim based interface > there were some challenges outlined in that email thread. I do > understand that practical experience could have confirmed or diminished > those concerns. > > I am definitely happy to restart those discussion but it would be really > great to summarize existing options and why they do not work in > practice. It would be also great to mention why concerns about nr_to_reclaim > based interface expressed in the past are not standing out anymore wrt. > other proposals. > Johannes, since you had pointed out that the current approach used at Meta and described in the TMO paper works well in practice and is based on prior discussions of memory.reclaim[1], do you have any lingering concerns from that 2020 thread? My first email in this thread proposes something that can still do memcg based reclaim but is also possible even without CONFIG_MEMCG enabled. That's particularly helpful for configs used by customers that don't use memcg, namely Chrome OS. I assume we're not losing any functionality that your use case depends on if we are to introduce a per-node sysfs mechanism for this as an alternative since you can still specify a memcg id? [1] https://lkml.org/lkml/2020/9/9/1094 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-09 22:03 ` David Rientjes @ 2022-03-10 16:58 ` Johannes Weiner 2022-03-10 17:25 ` Shakeel Butt 2022-03-10 17:33 ` Wei Xu 0 siblings, 2 replies; 24+ messages in thread From: Johannes Weiner @ 2022-03-10 16:58 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Shakeel Butt, Andrew Morton, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Greg Thelen On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote: > On Tue, 8 Mar 2022, Michal Hocko wrote: > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > good interface to implement proactive reclaim: > > > > > > 1) It can cause allocations from the target application to get > > > throttled. > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > by the userspace part of proactive reclaimer. > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > implement the proactive reclaim but due to exactly these limitations [1] > > > they went the route [2] aligned with this proposal. > > > > I do remember we have discussed this in the past. There were proposals > > for an additional limit to trigger a background reclaim [3] or to add a > > pressure based memcg knob [4]. For the nr_to_reclaim based interface > > there were some challenges outlined in that email thread. I do > > understand that practical experience could have confirmed or diminished > > those concerns. > > > > I am definitely happy to restart those discussion but it would be really > > great to summarize existing options and why they do not work in > > practice. It would be also great to mention why concerns about nr_to_reclaim > > based interface expressed in the past are not standing out anymore wrt. > > other proposals. > > > > Johannes, since you had pointed out that the current approach used at Meta > and described in the TMO paper works well in practice and is based on > prior discussions of memory.reclaim[1], do you have any lingering concerns > from that 2020 thread? I'd be okay with merging the interface proposed in that thread as-is. > My first email in this thread proposes something that can still do memcg > based reclaim but is also possible even without CONFIG_MEMCG enabled. > That's particularly helpful for configs used by customers that don't use > memcg, namely Chrome OS. I assume we're not losing any functionality that > your use case depends on if we are to introduce a per-node sysfs mechanism > for this as an alternative since you can still specify a memcg id? We'd lose the delegation functionality with this proposal. But per the other thread, I wouldn't be opposed to adding a global per-node interface in addition to the cgroupfs one. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-10 16:58 ` Johannes Weiner @ 2022-03-10 17:25 ` Shakeel Butt 2022-03-10 17:33 ` Wei Xu 1 sibling, 0 replies; 24+ messages in thread From: Shakeel Butt @ 2022-03-10 17:25 UTC (permalink / raw) To: Johannes Weiner Cc: David Rientjes, Michal Hocko, Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Wei Xu, Greg Thelen On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote: > > On Tue, 8 Mar 2022, Michal Hocko wrote: > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > > good interface to implement proactive reclaim: > > > > > > > > 1) It can cause allocations from the target application to get > > > > throttled. > > > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > > by the userspace part of proactive reclaimer. > > > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > > implement the proactive reclaim but due to exactly these limitations [1] > > > > they went the route [2] aligned with this proposal. > > > > > > I do remember we have discussed this in the past. There were proposals > > > for an additional limit to trigger a background reclaim [3] or to add a > > > pressure based memcg knob [4]. For the nr_to_reclaim based interface > > > there were some challenges outlined in that email thread. I do > > > understand that practical experience could have confirmed or diminished > > > those concerns. > > > > > > I am definitely happy to restart those discussion but it would be really > > > great to summarize existing options and why they do not work in > > > practice. It would be also great to mention why concerns about nr_to_reclaim > > > based interface expressed in the past are not standing out anymore wrt. > > > other proposals. > > > > > > > Johannes, since you had pointed out that the current approach used at Meta > > and described in the TMO paper works well in practice and is based on > > prior discussions of memory.reclaim[1], do you have any lingering concerns > > from that 2020 thread? > > I'd be okay with merging the interface proposed in that thread as-is. > Thanks, I will revise the commit message of that patch and send it out again. Also I will try to address Michal's concerns as well. Shakeel ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-10 16:58 ` Johannes Weiner 2022-03-10 17:25 ` Shakeel Butt @ 2022-03-10 17:33 ` Wei Xu 2022-03-10 17:42 ` Johannes Weiner 1 sibling, 1 reply; 24+ messages in thread From: Wei Xu @ 2022-03-10 17:33 UTC (permalink / raw) To: Johannes Weiner Cc: David Rientjes, Michal Hocko, Shakeel Butt, Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Greg Thelen On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote: > > On Tue, 8 Mar 2022, Michal Hocko wrote: > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > > good interface to implement proactive reclaim: > > > > > > > > 1) It can cause allocations from the target application to get > > > > throttled. > > > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > > by the userspace part of proactive reclaimer. > > > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > > implement the proactive reclaim but due to exactly these limitations [1] > > > > they went the route [2] aligned with this proposal. > > > > > > I do remember we have discussed this in the past. There were proposals > > > for an additional limit to trigger a background reclaim [3] or to add a > > > pressure based memcg knob [4]. For the nr_to_reclaim based interface > > > there were some challenges outlined in that email thread. I do > > > understand that practical experience could have confirmed or diminished > > > those concerns. > > > > > > I am definitely happy to restart those discussion but it would be really > > > great to summarize existing options and why they do not work in > > > practice. It would be also great to mention why concerns about nr_to_reclaim > > > based interface expressed in the past are not standing out anymore wrt. > > > other proposals. > > > > > > > Johannes, since you had pointed out that the current approach used at Meta > > and described in the TMO paper works well in practice and is based on > > prior discussions of memory.reclaim[1], do you have any lingering concerns > > from that 2020 thread? > > I'd be okay with merging the interface proposed in that thread as-is. We will need a nodemask argument for the memory tiering use case. We can add it as an optional argument to memory.reclaim later. Or do you think we should add a different interface (e.g. memory.demote) for memory tiering instead? > > My first email in this thread proposes something that can still do memcg > > based reclaim but is also possible even without CONFIG_MEMCG enabled. > > That's particularly helpful for configs used by customers that don't use > > memcg, namely Chrome OS. I assume we're not losing any functionality that > > your use case depends on if we are to introduce a per-node sysfs mechanism > > for this as an alternative since you can still specify a memcg id? > > We'd lose the delegation functionality with this proposal. > > But per the other thread, I wouldn't be opposed to adding a global > per-node interface in addition to the cgroupfs one. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-10 17:33 ` Wei Xu @ 2022-03-10 17:42 ` Johannes Weiner 0 siblings, 0 replies; 24+ messages in thread From: Johannes Weiner @ 2022-03-10 17:42 UTC (permalink / raw) To: Wei Xu Cc: David Rientjes, Michal Hocko, Shakeel Butt, Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Greg Thelen On Thu, Mar 10, 2022 at 09:33:48AM -0800, Wei Xu wrote: > On Thu, Mar 10, 2022 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Mar 09, 2022 at 02:03:21PM -0800, David Rientjes wrote: > > > On Tue, 8 Mar 2022, Michal Hocko wrote: > > > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > > > good interface to implement proactive reclaim: > > > > > > > > > > 1) It can cause allocations from the target application to get > > > > > throttled. > > > > > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > > > by the userspace part of proactive reclaimer. > > > > > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > > > implement the proactive reclaim but due to exactly these limitations [1] > > > > > they went the route [2] aligned with this proposal. > > > > > > > > I do remember we have discussed this in the past. There were proposals > > > > for an additional limit to trigger a background reclaim [3] or to add a > > > > pressure based memcg knob [4]. For the nr_to_reclaim based interface > > > > there were some challenges outlined in that email thread. I do > > > > understand that practical experience could have confirmed or diminished > > > > those concerns. > > > > > > > > I am definitely happy to restart those discussion but it would be really > > > > great to summarize existing options and why they do not work in > > > > practice. It would be also great to mention why concerns about nr_to_reclaim > > > > based interface expressed in the past are not standing out anymore wrt. > > > > other proposals. > > > > > > > > > > Johannes, since you had pointed out that the current approach used at Meta > > > and described in the TMO paper works well in practice and is based on > > > prior discussions of memory.reclaim[1], do you have any lingering concerns > > > from that 2020 thread? > > > > I'd be okay with merging the interface proposed in that thread as-is. > > We will need a nodemask argument for the memory tiering use case. We > can add it as an optional argument to memory.reclaim later. Or do you > think we should add a different interface (e.g. memory.demote) for > memory tiering instead? Yes, good point. We can add an optional parameter later on, methinks, as the behavior for when it's omitted shouldn't change. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes 2022-03-07 0:49 ` Yu Zhao 2022-03-07 14:41 ` Michal Hocko @ 2022-03-07 20:50 ` Johannes Weiner 2022-03-07 22:53 ` Wei Xu ` (2 more replies) 2 siblings, 3 replies; 24+ messages in thread From: Johannes Weiner @ 2022-03-07 20:50 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > Hi everybody, > > We'd like to discuss formalizing a mechanism to induce memory reclaim by > the kernel. > > The current multigenerational LRU proposal introduces a debugfs > mechanism[1] for this. The "TMO: Transparent Memory Offloading in > Datacenters" paper also discusses a per-memcg mechanism[2]. While the > former can be used for debugging of MGLRU, both can quite powerfully be > used for proactive reclaim. > > Google's datacenters use a similar per-memcg mechanism for the same > purpose. Thus, formalizing the mechanism would allow our userspace to use > an upstream supported interface that will be stable and consistent. > > This could be an incremental addition to MGLRU's lru_gen debugfs mechanism > but, since the concept has no direct dependency on the work, we believe it > is useful independent of the reclaim mechanism in use (both with and > without CONFIG_LRU_GEN). > > Idea: introduce a per-node sysfs mechanism for inducing memory reclaim > that can be useful for global (non-memcg constrained) reclaim and possible > even if memcg is not enabled in the kernel or mounted. This could > optionally take a memcg id to induce reclaim for a memcg hierarchy. > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > each NUMA node N on the system. (It would be similar to the existing > per-node sysfs "compact" mechanism used to trigger compaction from > userspace.) I generally think a proactive reclaim interface is a good idea. A per-cgroup control knob would make more sense to me, as cgroupfs takes care of delegation, namespacing etc. and so would permit self-directed proactive reclaim inside containers. > Userspace would write the following to this file: > - nr_to_reclaim pages This makes sense, although (and you hinted at this below), I'm thinking it should be in bytes, especially if part of cgroupfs. > - swappiness factor This I'm not sure about. Mostly because I'm not sure about swappiness in general. It balances between anon and file, but both of them are aged according to the same LRU rules. The only reason to prefer one over the other seems to be when the cost of reloading one (refault vs swapin) isn't the same as the other. That's usually a hardware property, which in a perfect world we'd auto-tune inside the kernel based on observed IO performance. Not sure why you'd want this per reclaim request. > - flags to specify context, if any[**] > > [**] this is offered for extensibility to specify the context in which > reclaim is being done (clean file pages only, demotion for memory > tiering vs eviction, etc), otherwise 0 This one is curious. I don't understand the use cases for either of these examples, and I can't think of other flags a user may pass on a per-invocation basis. Would you care to elaborate some? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 20:50 ` Johannes Weiner @ 2022-03-07 22:53 ` Wei Xu 2022-03-08 12:53 ` Michal Hocko 2022-03-08 14:49 ` Dan Schatzberg 2022-03-09 22:30 ` David Rientjes 2 siblings, 1 reply; 24+ messages in thread From: Wei Xu @ 2022-03-07 22:53 UTC (permalink / raw) To: Johannes Weiner Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Shakeel Butt, Greg Thelen On Mon, Mar 7, 2022 at 12:50 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > > Hi everybody, > > > > We'd like to discuss formalizing a mechanism to induce memory reclaim by > > the kernel. > > > > The current multigenerational LRU proposal introduces a debugfs > > mechanism[1] for this. The "TMO: Transparent Memory Offloading in > > Datacenters" paper also discusses a per-memcg mechanism[2]. While the > > former can be used for debugging of MGLRU, both can quite powerfully be > > used for proactive reclaim. > > > > Google's datacenters use a similar per-memcg mechanism for the same > > purpose. Thus, formalizing the mechanism would allow our userspace to use > > an upstream supported interface that will be stable and consistent. > > > > This could be an incremental addition to MGLRU's lru_gen debugfs mechanism > > but, since the concept has no direct dependency on the work, we believe it > > is useful independent of the reclaim mechanism in use (both with and > > without CONFIG_LRU_GEN). > > > > Idea: introduce a per-node sysfs mechanism for inducing memory reclaim > > that can be useful for global (non-memcg constrained) reclaim and possible > > even if memcg is not enabled in the kernel or mounted. This could > > optionally take a memcg id to induce reclaim for a memcg hierarchy. > > > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > > each NUMA node N on the system. (It would be similar to the existing > > per-node sysfs "compact" mechanism used to trigger compaction from > > userspace.) > > I generally think a proactive reclaim interface is a good idea. It is great to hear this. > A per-cgroup control knob would make more sense to me, as cgroupfs > takes care of delegation, namespacing etc. and so would permit > self-directed proactive reclaim inside containers. A per-cgroup control works perfectly for Google's data center use case as well. But a sysfs interface, such as /sys/kernel/mm/reclaim, that takes a node mask and a memcg id as the arguments can be used by proactive reclaimers on systems that don't use memcg (e.g. some desktop Linux distros) as well, which is more general. A special value for memcg id indicating global reclaim can be passed to support non-memcg use cases. > > Userspace would write the following to this file: > > - nr_to_reclaim pages > > This makes sense, although (and you hinted at this below), I'm > thinking it should be in bytes, especially if part of cgroupfs. > > > - swappiness factor > > This I'm not sure about. > > Mostly because I'm not sure about swappiness in general. It balances > between anon and file, but both of them are aged according to the same > LRU rules. The only reason to prefer one over the other seems to be > when the cost of reloading one (refault vs swapin) isn't the same as > the other. That's usually a hardware property, which in a perfect > world we'd auto-tune inside the kernel based on observed IO > performance. Not sure why you'd want this per reclaim request. The choice between anon and file pages is not only a hardware property, but also a matter of policy decisions. It is useful to allow the userspace policy daemon the flexibility to choose anon pages or file pages or both to reclaim from, for the exact reasons that you have described. This is important for the use cases in Google (where anon pages are the primary focus of proactive reclaim). Maybe instead of the swappiness factor, we can replace this parameter with a page type mask to more explicitly select which types of pages to reclaim. > > - flags to specify context, if any[**] > > > > [**] this is offered for extensibility to specify the context in which > > reclaim is being done (clean file pages only, demotion for memory > > tiering vs eviction, etc), otherwise 0 > > This one is curious. I don't understand the use cases for either of > these examples, and I can't think of other flags a user may pass on a > per-invocation basis. Would you care to elaborate some? One of the flag examples is to control whether the requested proactive reclaim can induce I/Os. This can be especially useful for memory tiering to lower cost memory devices, where I/Os would likely not be preferred for reclaim-based demotion requested proactively. Wei ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 22:53 ` Wei Xu @ 2022-03-08 12:53 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2022-03-08 12:53 UTC (permalink / raw) To: Wei Xu Cc: Johannes Weiner, David Rientjes, Andrew Morton, Yu Zhao, Dave Hansen, Linux MM, Yosry Ahmed, Shakeel Butt, Greg Thelen On Mon 07-03-22 14:53:40, Wei Xu wrote: [...] > The choice between anon and file pages is not only a hardware > property, but also a matter of policy decisions. It is useful to allow > the userspace policy daemon the flexibility to choose anon pages or > file pages or both to reclaim from, for the exact reasons that you > have described. This is important for the use cases in Google (where > anon pages are the primary focus of proactive reclaim). > > Maybe instead of the swappiness factor, we can replace this parameter > with a page type mask to more explicitly select which types of pages > to reclaim. I am concerned this could lead to even more problems. Where do you draw the line? Do you want to control slab reclaim or even shrinkers based reclaim? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 20:50 ` Johannes Weiner 2022-03-07 22:53 ` Wei Xu @ 2022-03-08 14:49 ` Dan Schatzberg 2022-03-08 19:27 ` Johannes Weiner 2022-03-09 22:30 ` David Rientjes 2 siblings, 1 reply; 24+ messages in thread From: Dan Schatzberg @ 2022-03-08 14:49 UTC (permalink / raw) To: Johannes Weiner Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote: > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > > - swappiness factor > > This I'm not sure about. > > Mostly because I'm not sure about swappiness in general. It balances > between anon and file, but both of them are aged according to the same > LRU rules. The only reason to prefer one over the other seems to be > when the cost of reloading one (refault vs swapin) isn't the same as > the other. That's usually a hardware property, which in a perfect > world we'd auto-tune inside the kernel based on observed IO > performance. Not sure why you'd want this per reclaim request. I think this could be useful for budgeting write-endurance. You may want to tune down a workload's swappiness on a per-reclaim basis in order to control how much swap-out (and therefore disk writes) its doing. Right now the only way to control this is by writing to vm.swappiness before doing the explicit reclaim which can momentarily effect other reclaim behavior on the machine. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 14:49 ` Dan Schatzberg @ 2022-03-08 19:27 ` Johannes Weiner 2022-03-08 22:37 ` Dan Schatzberg 0 siblings, 1 reply; 24+ messages in thread From: Johannes Weiner @ 2022-03-08 19:27 UTC (permalink / raw) To: Dan Schatzberg Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Tue, Mar 08, 2022 at 09:49:20AM -0500, Dan Schatzberg wrote: > On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote: > > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > > > - swappiness factor > > > > This I'm not sure about. > > > > Mostly because I'm not sure about swappiness in general. It balances > > between anon and file, but both of them are aged according to the same > > LRU rules. The only reason to prefer one over the other seems to be > > when the cost of reloading one (refault vs swapin) isn't the same as > > the other. That's usually a hardware property, which in a perfect > > world we'd auto-tune inside the kernel based on observed IO > > performance. Not sure why you'd want this per reclaim request. > > I think this could be useful for budgeting write-endurance. You may > want to tune down a workload's swappiness on a per-reclaim basis in > order to control how much swap-out (and therefore disk writes) its > doing. Right now the only way to control this is by writing to > vm.swappiness before doing the explicit reclaim which can momentarily > effect other reclaim behavior on the machine. Yeah the global swappiness setting is not ideal for tuning behavior of individual workloads. On the other hand, flash life and write budget are global resources shared by all workloads on the system. Does it make sense longer term to take a workload-centric approach to that? There are also filesystem writes to think about. If the swappable set has already been swapped and cached, reclaiming it again doesn't require IO. Reclaiming dirty cache OTOH requires IO, and upping reclaim pressure on files will increase the writeback flush rates (which reduces cache effectiveness and increases aggregate writes). I wonder if it would make more sense to recognize the concept of write endurance more broadly in MM code than just swap. Where you specify a rate limit (globally? with per-cgroup shares?), and then, yes, the VM will back away from swap iff it writes too much. But also throttle writeback and push back on file reclaim and dirtying processes in accordance with that policy. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-08 19:27 ` Johannes Weiner @ 2022-03-08 22:37 ` Dan Schatzberg 0 siblings, 0 replies; 24+ messages in thread From: Dan Schatzberg @ 2022-03-08 22:37 UTC (permalink / raw) To: Johannes Weiner Cc: David Rientjes, Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Tue, Mar 08, 2022 at 02:27:49PM -0500, Johannes Weiner wrote: > On Tue, Mar 08, 2022 at 09:49:20AM -0500, Dan Schatzberg wrote: > > On Mon, Mar 07, 2022 at 03:50:36PM -0500, Johannes Weiner wrote: > > > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > > > > - swappiness factor > > > > > > This I'm not sure about. > > > > > > Mostly because I'm not sure about swappiness in general. It balances > > > between anon and file, but both of them are aged according to the same > > > LRU rules. The only reason to prefer one over the other seems to be > > > when the cost of reloading one (refault vs swapin) isn't the same as > > > the other. That's usually a hardware property, which in a perfect > > > world we'd auto-tune inside the kernel based on observed IO > > > performance. Not sure why you'd want this per reclaim request. > > > > I think this could be useful for budgeting write-endurance. You may > > want to tune down a workload's swappiness on a per-reclaim basis in > > order to control how much swap-out (and therefore disk writes) its > > doing. Right now the only way to control this is by writing to > > vm.swappiness before doing the explicit reclaim which can momentarily > > effect other reclaim behavior on the machine. > > Yeah the global swappiness setting is not ideal for tuning behavior of > individual workloads. On the other hand, flash life and write budget > are global resources shared by all workloads on the system. Does it > make sense longer term to take a workload-centric approach to that? Indeed flash life is a global resource, but it may be desireable to budget it on a per-workload basis. Consider a workload with a lot of warm anonymous memory - proactive reclaim of this workload may be able to consume the entire write budget of the machine. This could result in a co-located workload getting reduced reclaim due to insufficient write budget. We'd like some form of isolation here so that the co-located workload receives some fair-share of the write budget which is hard to do without some additional control. > There are also filesystem writes to think about. If the swappable set > has already been swapped and cached, reclaiming it again doesn't > require IO. Reclaiming dirty cache OTOH requires IO, and upping > reclaim pressure on files will increase the writeback flush rates > (which reduces cache effectiveness and increases aggregate writes). > > I wonder if it would make more sense to recognize the concept of write > endurance more broadly in MM code than just swap. Where you specify a > rate limit (globally? with per-cgroup shares?), and then, yes, the VM > will back away from swap iff it writes too much. But also throttle > writeback and push back on file reclaim and dirtying processes in > accordance with that policy. Absolutely, we should discuss details but broadly I agree with the idea that there's more than just per-cgroup swappiness control as a way to gain control over mm-induced write endurance consumption. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-07 20:50 ` Johannes Weiner 2022-03-07 22:53 ` Wei Xu 2022-03-08 14:49 ` Dan Schatzberg @ 2022-03-09 22:30 ` David Rientjes 2022-03-10 16:10 ` Johannes Weiner 2 siblings, 1 reply; 24+ messages in thread From: David Rientjes @ 2022-03-09 22:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Mon, 7 Mar 2022, Johannes Weiner wrote: > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > > each NUMA node N on the system. (It would be similar to the existing > > per-node sysfs "compact" mechanism used to trigger compaction from > > userspace.) > > I generally think a proactive reclaim interface is a good idea. > > A per-cgroup control knob would make more sense to me, as cgroupfs > takes care of delegation, namespacing etc. and so would permit > self-directed proactive reclaim inside containers. > This is an interesting point and something that would need to be decided. There's pros and cons to both approaches, per-cgroup mechanism vs purely a per-node sysfs mechanism that can take a cgroup id. The reason we'd like this in sysfs is because of users who do not enable CONFIG_MEMCG but would still benefit from proactive reclaim. Such users do exist and do not rely on memcg, such as Chrome OS, and from my understanding this is normally done to speed up hibernation. But I note your use of "per-cgroup" control knob and not specifically "per-memcg". Were you considering a proactive reclaim mechanism for a cgroup other than memcg? A new one? I'm wondering if it would make sense for such a cgroup interface, if eventually needed, to be added incrementally on top of a per-node sysfs interface. (We know today that there is a need for proactive reclaim for users who do not use memcg at all.) > > Userspace would write the following to this file: > > - nr_to_reclaim pages > > This makes sense, although (and you hinted at this below), I'm > thinking it should be in bytes, especially if part of cgroupfs. > If we agree upon a sysfs interface I assume there would be no objection to this in nr_to_reclaim pages? I agree if this is to be a memcg knob that it should be expressed in bytes for consistency with other knobs. > > - swappiness factor > > This I'm not sure about. > > Mostly because I'm not sure about swappiness in general. It balances > between anon and file, but both of them are aged according to the same > LRU rules. The only reason to prefer one over the other seems to be > when the cost of reloading one (refault vs swapin) isn't the same as > the other. That's usually a hardware property, which in a perfect > world we'd auto-tune inside the kernel based on observed IO > performance. Not sure why you'd want this per reclaim request. > > > - flags to specify context, if any[**] > > > > [**] this is offered for extensibility to specify the context in which > > reclaim is being done (clean file pages only, demotion for memory > > tiering vs eviction, etc), otherwise 0 > > This one is curious. I don't understand the use cases for either of > these examples, and I can't think of other flags a user may pass on a > per-invocation basis. Would you care to elaborate some? > If we combine the above two concerns, maybe only a flags argument is sufficient where you can specify only anon or only file (and neither means both)? What is controllable by swappiness could be controlled by two different writes to the interface, one for (possibly) anon and one for (possibly) file. There was discussion about treating the two different types of memory differently as a function of reload cost, cost of doing I/O for discard, and how much swap space we want proactive reclaim to take, as well as the only current alternative is to be playing with the global vm.swappiness. Michal asked if this would include slab reclaim or shrinkers, I think the answer is "possibly yes," but no initial use case for this (flags would be extensible to permit the addition of it incrementally). In fact, if you were to pass a cgroup id of 0 to induce global proactive reclaim you could mimic the same control we have with vm.drop_caches today but does not include reclaiming all of a memory type. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC] Mechanism to induce memory reclaim 2022-03-09 22:30 ` David Rientjes @ 2022-03-10 16:10 ` Johannes Weiner 0 siblings, 0 replies; 24+ messages in thread From: Johannes Weiner @ 2022-03-10 16:10 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, Yu Zhao, Dave Hansen, linux-mm, Yosry Ahmed, Wei Xu, Shakeel Butt, Greg Thelen On Wed, Mar 09, 2022 at 02:30:24PM -0800, David Rientjes wrote: > On Mon, 7 Mar 2022, Johannes Weiner wrote: > > > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > > > each NUMA node N on the system. (It would be similar to the existing > > > per-node sysfs "compact" mechanism used to trigger compaction from > > > userspace.) > > > > I generally think a proactive reclaim interface is a good idea. > > > > A per-cgroup control knob would make more sense to me, as cgroupfs > > takes care of delegation, namespacing etc. and so would permit > > self-directed proactive reclaim inside containers. > > > > This is an interesting point and something that would need to be decided. > There's pros and cons to both approaches, per-cgroup mechanism vs purely a > per-node sysfs mechanism that can take a cgroup id. I think we can just add both and avoid the cgroupid quirk. We've done this many times: psi has global and cgroupfs interfaces, so does vmstat, so does (did) swappiness etc. I don't see a problem with adding a system and a cgroup interface for this. > The reason we'd like this in sysfs is because of users who do not enable > CONFIG_MEMCG but would still benefit from proactive reclaim. Such users > do exist and do not rely on memcg, such as Chrome OS, and from my > understanding this is normally done to speed up hibernation. Yes, that makes sense. > But I note your use of "per-cgroup" control knob and not specifically > "per-memcg". Were you considering a proactive reclaim mechanism for a > cgroup other than memcg? A new one? No subtle nuance intended, I'm just using them interchangeably with cgroup2. I meant: a cgroup that has the memory controller enabled :) > I'm wondering if it would make sense for such a cgroup interface, if > eventually needed, to be added incrementally on top of a per-node sysfs > interface. (We know today that there is a need for proactive reclaim for > users who do not use memcg at all.) We've already had delegated deployments as well. Both uses are real. But again, I don't think we have to choose at all. Let's add both! > > > Userspace would write the following to this file: > > > - nr_to_reclaim pages > > > > This makes sense, although (and you hinted at this below), I'm > > thinking it should be in bytes, especially if part of cgroupfs. > > > > If we agree upon a sysfs interface I assume there would be no objection to > this in nr_to_reclaim pages? I agree if this is to be a memcg knob that > it should be expressed in bytes for consistency with other knobs. Pages in general are somewhat fraught as a unit for facing userspace. It requires people to use _SC_PAGESIZE, but they don't: https://twitter.com/marcan42/status/1498710903675842563 Is there an argument *for* using pages? > > > - swappiness factor > > > > This I'm not sure about. > > > > Mostly because I'm not sure about swappiness in general. It balances > > between anon and file, but both of them are aged according to the same > > LRU rules. The only reason to prefer one over the other seems to be > > when the cost of reloading one (refault vs swapin) isn't the same as > > the other. That's usually a hardware property, which in a perfect > > world we'd auto-tune inside the kernel based on observed IO > > performance. Not sure why you'd want this per reclaim request. > > > > > - flags to specify context, if any[**] > > > > > > [**] this is offered for extensibility to specify the context in which > > > reclaim is being done (clean file pages only, demotion for memory > > > tiering vs eviction, etc), otherwise 0 > > > > This one is curious. I don't understand the use cases for either of > > these examples, and I can't think of other flags a user may pass on a > > per-invocation basis. Would you care to elaborate some? > > > > If we combine the above two concerns, maybe only a flags argument is > sufficient where you can specify only anon or only file (and neither means > both)? What is controllable by swappiness could be controlled by two > different writes to the interface, one for (possibly) anon and one for > (possibly) file. > > There was discussion about treating the two different types of memory > differently as a function of reload cost, cost of doing I/O for discard, > and how much swap space we want proactive reclaim to take, as well as the > only current alternative is to be playing with the global vm.swappiness. > > Michal asked if this would include slab reclaim or shrinkers, I think the > answer is "possibly yes," but no initial use case for this (flags would be > extensible to permit the addition of it incrementally). In fact, if you > were to pass a cgroup id of 0 to induce global proactive reclaim you could > mimic the same control we have with vm.drop_caches today but does not > include reclaiming all of a memory type. Ok, I think I see. My impression is that this is mechanism that optimally the kernel's reclaim algorithm should provide, rather than (just) application/setup dependent policy preferences. The cost of reload for example. Yes, it needs to be balanced between anon and file. But is there a target to aim for besides lowest aggregate paging overhead for the application? How much swap space to use is a good point too, but we already have an expression of intended per-cgroup share from the user: memory.swap.high and memory.swap.max. Shouldn't reclaim in general back off gradually from swap as utilization approaches 100%? Is proactive reclaim different from conventional reclaim in this regard? The write endurance question is similar. Policy would be to express a global budget and per-cgroup shares of that budget; mechanism would be to have this inform reclaim and writeback behavior. My question would be why the mechanism *shouldn't* live in the kernel. And then allow userspace to configure it in a way in which most people actually understand: flash write budgets, swap space allowances etc. The interface proposed here strikes me as rather low-level. It's less of a conventional user interface, as much as it is building blocks for implementing parts of the reclaim algorithm in userspace. I'm not necessarily against that. It's just unusual and IMO deserves some more discussion. I want to make sure that if there are shortcomings in the kernel we address them rather than work around. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-03-10 17:42 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-06 23:11 [RFC] Mechanism to induce memory reclaim David Rientjes 2022-03-07 0:49 ` Yu Zhao 2022-03-07 14:41 ` Michal Hocko 2022-03-07 18:31 ` Shakeel Butt 2022-03-07 20:26 ` Johannes Weiner 2022-03-08 12:53 ` Michal Hocko 2022-03-08 14:44 ` Dan Schatzberg 2022-03-08 16:05 ` Michal Hocko 2022-03-08 17:21 ` Wei Xu 2022-03-08 17:23 ` Johannes Weiner 2022-03-08 12:52 ` Michal Hocko 2022-03-09 22:03 ` David Rientjes 2022-03-10 16:58 ` Johannes Weiner 2022-03-10 17:25 ` Shakeel Butt 2022-03-10 17:33 ` Wei Xu 2022-03-10 17:42 ` Johannes Weiner 2022-03-07 20:50 ` Johannes Weiner 2022-03-07 22:53 ` Wei Xu 2022-03-08 12:53 ` Michal Hocko 2022-03-08 14:49 ` Dan Schatzberg 2022-03-08 19:27 ` Johannes Weiner 2022-03-08 22:37 ` Dan Schatzberg 2022-03-09 22:30 ` David Rientjes 2022-03-10 16:10 ` Johannes Weiner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.