* Memcg stat for available memory @ 2020-06-28 22:15 David Rientjes 2020-07-02 15:22 ` Shakeel Butt 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2020-06-28 22:15 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko, Vladimir Davydov Cc: Andrew Morton, Shakeel Butt, cgroups, linux-mm Hi everybody, I'd like to discuss the feasibility of a stat similar to si_mem_available() but at memcg scope which would specify how much memory can be charged without I/O. The si_mem_available() stat is based on heuristics so this does not provide an exact quantity that is actually available at any given time, but can otherwise provide userspace with some guidance on the amount of reclaimable memory. See the description in Documentation/filesystems/proc.rst and its implementation. [ Naturally, userspace would need to understand both the amount of memory that is available for allocation and for charging, separately, on an overcommitted system. I assume this is trivial. (Why don't we provide MemAvailable in per-node meminfo?) ] For such a stat at memcg scope, we can ignore totalreserves and watermarks. We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for both file pages and slab_reclaimable. We can infer lazily free memory by doing file - (active_file + inactive_file) (This is necessary because lazy free memory is anon but on the inactive file lru and we can't infer lazy freeable memory through pglazyfree - pglazyfreed, they are event counters.) We can also infer the number of underlying compound pages that are on deferred split queues but have yet to be split with active_anon - anon (or is this a bug? :) So it *seems* like userspace can make a si_mem_available()-like calculation ("avail") by doing free = memory.high - memory.current lazyfree = file - (active_file + inactive_file) deferred = active_anon - anon avail = free + lazyfree + deferred + (active_file + inactive_file + slab_reclaimable) / 2 For userspace interested in knowing how much memory it can charge without incurring I/O (and assuming it has knowledge of available memory on an overcommitted system), it seems like: (a) it can derive the above avail amount that is at least similar to MemAvailable, (b) it can assume that all reclaim is considered equal so anything more than memory.high - memory.current is disruptive enough that it's a better heuristic than the above, or (c) the kernel provide an "avail" stat in memory.stat based on the above and can evolve as the kernel implementation changes (how lazy free memory impacts anon vs file lru stats, how deferred split memory is handled, any future extensions for "easily reclaimable memory") that userspace can count on to the same degree it can count on MemAvailable. Any thoughts? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-06-28 22:15 Memcg stat for available memory David Rientjes @ 2020-07-02 15:22 ` Shakeel Butt 2020-07-03 8:15 ` Michal Hocko 0 siblings, 1 reply; 7+ messages in thread From: Shakeel Butt @ 2020-07-02 15:22 UTC (permalink / raw) To: David Rientjes, Yang Shi, Roman Gushchin, Greg Thelen Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM (Adding more people who might be interested in this) On Sun, Jun 28, 2020 at 3:15 PM David Rientjes <rientjes@google.com> wrote: > > Hi everybody, > > I'd like to discuss the feasibility of a stat similar to > si_mem_available() but at memcg scope which would specify how much memory > can be charged without I/O. > > The si_mem_available() stat is based on heuristics so this does not > provide an exact quantity that is actually available at any given time, > but can otherwise provide userspace with some guidance on the amount of > reclaimable memory. See the description in > Documentation/filesystems/proc.rst and its implementation. > > [ Naturally, userspace would need to understand both the amount of memory > that is available for allocation and for charging, separately, on an > overcommitted system. I assume this is trivial. (Why don't we provide > MemAvailable in per-node meminfo?) ] > > For such a stat at memcg scope, we can ignore totalreserves and > watermarks. We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for > both file pages and slab_reclaimable. > > We can infer lazily free memory by doing > > file - (active_file + inactive_file) > > (This is necessary because lazy free memory is anon but on the inactive > file lru and we can't infer lazy freeable memory through pglazyfree - > pglazyfreed, they are event counters.) > > We can also infer the number of underlying compound pages that are on > deferred split queues but have yet to be split with active_anon - anon (or > is this a bug? :) > > So it *seems* like userspace can make a si_mem_available()-like > calculation ("avail") by doing > > free = memory.high - memory.current > lazyfree = file - (active_file + inactive_file) > deferred = active_anon - anon > > avail = free + lazyfree + deferred + > (active_file + inactive_file + slab_reclaimable) / 2 > > For userspace interested in knowing how much memory it can charge without > incurring I/O (and assuming it has knowledge of available memory on an > overcommitted system), it seems like: > > (a) it can derive the above avail amount that is at least similar to > MemAvailable, > > (b) it can assume that all reclaim is considered equal so anything more > than memory.high - memory.current is disruptive enough that it's a > better heuristic than the above, or > > (c) the kernel provide an "avail" stat in memory.stat based on the above > and can evolve as the kernel implementation changes (how lazy free > memory impacts anon vs file lru stats, how deferred split memory is > handled, any future extensions for "easily reclaimable memory") that > userspace can count on to the same degree it can count on > MemAvailable. > > Any thoughts? I think we need to answer two questions: 1) What's the use-case? 2) Why is user space calculating their MemAvailable themselves not good? The use case I have in mind is the latency sensitive distributed caching service which would prefer to reduce the amount of its caching over the stalls incurred by hitting the limit. Such applications can monitor their MemAvailable and adjust their caching footprint. For the second, I think it is to hide the internal implementation details of the kernel from the user space. The deferred split queues is an internal detail and we don't want that exposed to the user. Similarly how lazyfree is implemented (i.e. anon pages on file LRU) should not be exposed to the users. Shakeel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-07-02 15:22 ` Shakeel Butt @ 2020-07-03 8:15 ` Michal Hocko 2020-07-07 19:58 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Michal Hocko @ 2020-07-03 8:15 UTC (permalink / raw) To: Shakeel Butt Cc: David Rientjes, Yang Shi, Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM I am sorry I was bussy and didn't get to it sooner] On Thu 02-07-20 08:22:10, Shakeel Butt wrote: > (Adding more people who might be interested in this) > > > On Sun, Jun 28, 2020 at 3:15 PM David Rientjes <rientjes@google.com> wrote: > > > > Hi everybody, > > > > I'd like to discuss the feasibility of a stat similar to > > si_mem_available() but at memcg scope which would specify how much memory > > can be charged without I/O. > > > > The si_mem_available() stat is based on heuristics so this does not > > provide an exact quantity that is actually available at any given time, > > but can otherwise provide userspace with some guidance on the amount of > > reclaimable memory. See the description in > > Documentation/filesystems/proc.rst and its implementation. I have to say I was a fan of this metric when it was introduced mostly because it has removed that nasty subtle detail that Cached value includes the swap backed memory (e.g. shmem) and that has caused a lot of confusion. But I became very skeptical over time because it is really hard to set expectations right when relying on the value for two main reasons - it is a global snapshot value and as such it becomes largely unusable for any decisions which are not implemented right away or if there are multiple uncoordinated consumers. - it is not really hard to trigger "corner" cases where a careful use of MemAvailable still leads to a lot of memory reclaim even for single large consumer. What we consider reclaimable might be pinned for different reasons or situation simply changes. Our documentation claims that following this guidance will help prevent from swapping/reclaim yet this is not true and I have seen bug reports in the past. > > [ Naturally, userspace would need to understand both the amount of memory > > that is available for allocation and for charging, separately, on an > > overcommitted system. I assume this is trivial. (Why don't we provide > > MemAvailable in per-node meminfo?) ] I presume you min the consumer would simply do min(global, memcg) right? Well a proper implementation of the value would have to be hierarchical so it would be minimum over the whole memcg tree up to the root. We cannot expect userspace do do that. While technically possible and not that hard to express I am worried existing problems with the value would be just amplified because there is even more volatility here. Global value mostly depends on consumers, now you have a second source of volatility and that is the memcg limit (hard or high) that can be changed quite dynamically. Sure the global case can have a similar with memory hotplug but realistically this is not by far common. Another complication is that the amount of reclaim for each memcg depends on reclaimability of other reclaim under the global memory pressure (just consider low/min protection as the simplest example). So I expect imprecision will be even harder to predict for per-memcg value. > > For such a stat at memcg scope, we can ignore totalreserves and > > watermarks. We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for > > both file pages and slab_reclaimable. > > > > We can infer lazily free memory by doing > > > > file - (active_file + inactive_file) > > > > (This is necessary because lazy free memory is anon but on the inactive > > file lru and we can't infer lazy freeable memory through pglazyfree - > > pglazyfreed, they are event counters.) > > We can also infer the number of underlying compound pages that are on > > > > deferred split queues but have yet to be split with active_anon - anon (or > > is this a bug? :) > > > > So it *seems* like userspace can make a si_mem_available()-like > > calculation ("avail") by doing > > > > free = memory.high - memory.current min(memory.high, memory.max) > > lazyfree = file - (active_file + inactive_file) > > deferred = active_anon - anon > > > > avail = free + lazyfree + deferred + > > (active_file + inactive_file + slab_reclaimable) / 2 I am not sure why you want to trigger lazy free differently from the global value. But this is really a minor technical thing which is not really all that interesting until we actually can define what would be the real usecase. > > For userspace interested in knowing how much memory it can charge without > > incurring I/O (and assuming it has knowledge of available memory on an > > overcommitted system), it seems like: > > > > (a) it can derive the above avail amount that is at least similar to > > MemAvailable, > > > > (b) it can assume that all reclaim is considered equal so anything more > > than memory.high - memory.current is disruptive enough that it's a > > better heuristic than the above, or > > > > (c) the kernel provide an "avail" stat in memory.stat based on the above > > and can evolve as the kernel implementation changes (how lazy free > > memory impacts anon vs file lru stats, how deferred split memory is > > handled, any future extensions for "easily reclaimable memory") that > > userspace can count on to the same degree it can count on > > MemAvailable. > > > > Any thoughts? > > > I think we need to answer two questions: > > 1) What's the use-case? > 2) Why is user space calculating their MemAvailable themselves not good? These are questions the discussion should have started with. Thanks! > The use case I have in mind is the latency sensitive distributed > caching service which would prefer to reduce the amount of its caching > over the stalls incurred by hitting the limit. Such applications can > monitor their MemAvailable and adjust their caching footprint. Is the value really reliable enough to implement such a logic though? I have mentioned some problems above. The situation might change at any time and the source of that change might be external to the memcg so the value would have to be pro-actively polled all the time. This doesn't sound very viable to me, especially for latency sensitive service. Wouldn't it make more sense to protect the service and dynamically change the low memory protection based on the external memory pressure. There are different ways to achieve that. E.g. watch for LOW event notifications and/or PSI metrics. I believe FB is relying on such a dynamic scaling a lot. > For the second, I think it is to hide the internal implementation > details of the kernel from the user space. The deferred split queues > is an internal detail and we don't want that exposed to the user. > Similarly how lazyfree is implemented (i.e. anon pages on file LRU) > should not be exposed to the users. I would tend to agree that there is a lot of internal logic that can skew existing statistics and that might be confusing. But I am not sure that providing something that aims to hide them yet is hard to use is a proper way forward. But maybe I am just too pessimistic. I would be happy to be convinced otherwise. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-07-03 8:15 ` Michal Hocko @ 2020-07-07 19:58 ` David Rientjes 2020-07-10 19:47 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2020-07-07 19:58 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM On Fri, 3 Jul 2020, Michal Hocko wrote: > > > I'd like to discuss the feasibility of a stat similar to > > > si_mem_available() but at memcg scope which would specify how much memory > > > can be charged without I/O. > > > > > > The si_mem_available() stat is based on heuristics so this does not > > > provide an exact quantity that is actually available at any given time, > > > but can otherwise provide userspace with some guidance on the amount of > > > reclaimable memory. See the description in > > > Documentation/filesystems/proc.rst and its implementation. > > I have to say I was a fan of this metric when it was introduced mostly > because it has removed that nasty subtle detail that Cached value > includes the swap backed memory (e.g. shmem) and that has caused a lot > of confusion. But I became very skeptical over time because it is really > hard to set expectations right when relying on the value for two main > reasons > - it is a global snapshot value and as such it becomes largely > unusable for any decisions which are not implemented right > away or if there are multiple uncoordinated consumers. > - it is not really hard to trigger "corner" cases where a careful > use of MemAvailable still leads to a lot of memory reclaim > even for single large consumer. What we consider reclaimable > might be pinned for different reasons or situation simply > changes. Our documentation claims that following this guidance > will help prevent from swapping/reclaim yet this is not true > and I have seen bug reports in the past. > Hi everybody, I agree that mileage may vary with MemAvailable and that it is only representative of the current state of memory at the time it is grabbed (although with a memcg equivalent we could have more of a guarantee here on a system that is not overcommitted). I think it's best viewed as our best guess of the current amount of free + inexpensively reclaimable memory exists at any point in time. An alternative would be to describe simply the amount of memory that we anticipate is reclaimable. This doesn't get around the pinning issue but does provide, like MemAvailable, a field that can be queried that will be stable over kernel versions for what the kernel perceives as reclaimable and, importantly, makes its reclaim decisions based on. > > > [ Naturally, userspace would need to understand both the amount of memory > > > that is available for allocation and for charging, separately, on an > > > overcommitted system. I assume this is trivial. (Why don't we provide > > > MemAvailable in per-node meminfo?) ] > > I presume you min the consumer would simply do min(global, memcg) right? > Well a proper implementation of the value would have to be hierarchical > so it would be minimum over the whole memcg tree up to the root. We > cannot expect userspace do do that. > > While technically possible and not that hard to express I am worried > existing problems with the value would be just amplified because there > is even more volatility here. Global value mostly depends on consumers, > now you have a second source of volatility and that is the memcg limit > (hard or high) that can be changed quite dynamically. Sure the global > case can have a similar with memory hotplug but realistically this is > not by far common. Another complication is that the amount of reclaim > for each memcg depends on reclaimability of other reclaim under the > global memory pressure (just consider low/min protection as the simplest > example). So I expect imprecision will be even harder to predict for > per-memcg value. > Yeah, I think it's best approached by considering the global MemAvailable separately from any per-memcg metric and they have different scope depending on whether you're the application manager or whether you're the process/library attached to a memcg that is trying to orchestrate its own memory usage. The memcg view of the amount of reclaimable memory (or available [free + reclaimable]) should be specific to that hierarchy without considering the global MemAvaiable, just as we can incur reclaim and/or oom today both from the page allocator and memcg charge path separately. It's two different contexts. Simplest example would be a malloc implementation that derives benefit from keeping as much heap backed by hugepages as possible so attempts to avoid splitting huge pmds whenever possible absent memory pressure. As the amount of available memory decreases for whatever reason (more user or kernel memory charged to the hierarchy), it could start releasing more memory back to the system and splitting these pmds. This may not only be memory.{high,max} - memory.current since it may be preferable for there to be some reclaim activity coupled with "userspace reclaim" at this mark, such as doing MADV_DONTNEED for memory on the malloc freelist. > > > For such a stat at memcg scope, we can ignore totalreserves and > > > watermarks. We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for > > > both file pages and slab_reclaimable. > > > > > > We can infer lazily free memory by doing > > > > > > file - (active_file + inactive_file) > > > > > > (This is necessary because lazy free memory is anon but on the inactive > > > file lru and we can't infer lazy freeable memory through pglazyfree - > > > pglazyfreed, they are event counters.) > > > We can also infer the number of underlying compound pages that are on > > > > > > deferred split queues but have yet to be split with active_anon - anon (or > > > is this a bug? :) > > > > > > So it *seems* like userspace can make a si_mem_available()-like > > > calculation ("avail") by doing > > > > > > free = memory.high - memory.current > > min(memory.high, memory.max) > > > > lazyfree = file - (active_file + inactive_file) > > > deferred = active_anon - anon > > > > > > avail = free + lazyfree + deferred + > > > (active_file + inactive_file + slab_reclaimable) / 2 > > I am not sure why you want to trigger lazy free differently from the > global value. But this is really a minor technical thing which is not > really all that interesting until we actually can define what would be > the real usecase. > > > > For userspace interested in knowing how much memory it can charge without > > > incurring I/O (and assuming it has knowledge of available memory on an > > > overcommitted system), it seems like: > > > > > > (a) it can derive the above avail amount that is at least similar to > > > MemAvailable, > > > > > > (b) it can assume that all reclaim is considered equal so anything more > > > than memory.high - memory.current is disruptive enough that it's a > > > better heuristic than the above, or > > > > > > (c) the kernel provide an "avail" stat in memory.stat based on the above > > > and can evolve as the kernel implementation changes (how lazy free > > > memory impacts anon vs file lru stats, how deferred split memory is > > > handled, any future extensions for "easily reclaimable memory") that > > > userspace can count on to the same degree it can count on > > > MemAvailable. > > > > > > Any thoughts? > > > > > > I think we need to answer two questions: > > > > 1) What's the use-case? > > 2) Why is user space calculating their MemAvailable themselves not good? > > These are questions the discussion should have started with. Thanks! > > > The use case I have in mind is the latency sensitive distributed > > caching service which would prefer to reduce the amount of its caching > > over the stalls incurred by hitting the limit. Such applications can > > monitor their MemAvailable and adjust their caching footprint. > > Is the value really reliable enough to implement such a logic though? I > have mentioned some problems above. The situation might change at any > time and the source of that change might be external to the memcg so the > value would have to be pro-actively polled all the time. This doesn't > sound very viable to me, especially for latency sensitive service. > Wouldn't it make more sense to protect the service and dynamically > change the low memory protection based on the external memory pressure. > There are different ways to achieve that. E.g. watch for LOW event > notifications and/or PSI metrics. I believe FB is relying on such a > dynamic scaling a lot. > Right, and given the limitations imposed by MEMCG_CHARGE_BATCH variance in the stats themselves, for example, this value will not be 100% accurate since none of the stats are 100% accurate :) > > For the second, I think it is to hide the internal implementation > > details of the kernel from the user space. The deferred split queues > > is an internal detail and we don't want that exposed to the user. > > Similarly how lazyfree is implemented (i.e. anon pages on file LRU) > > should not be exposed to the users. > > I would tend to agree that there is a lot of internal logic that can > skew existing statistics and that might be confusing. But I am not sure > that providing something that aims to hide them yet is hard to use is a > proper way forward. But maybe I am just too pessimistic. I would be > happy to be convinced otherwise. > Idea would be to expose what the kernel deems to be available, much like MemAvailable, at a given time without requiring userspace to derive the value for themselves. Certainly they *can*, and I gave an example of how it would do that, but it requires an understanding of the metrics that the kernel exposes, how reclaim behaves, and attempts to be stable over multiple kernel versions which would be the same motivation for the global metric. Assume a memcg hierarchy that is serving that latency sensitive service that has been protected from the affects of global pressure but the amount of memory consumed by that service varies over releases. How the kernel handles lazy free memory, how it handles deferred split queues, etc, are specific details that userspace may not have visiblity into: the metric answers the question of "what can I actually get back if I call into reclaim?". How much memory is on the deferred split queue can be substantial, for example, but userspace would be unaware of this unless they do something like active_anon - anon. Another use case would be motivated by exactly the MemAvailable use case: when bound to a memcg hierarchy, how much memory is available without substantial swap or risk of oom for starting a new process or service? This would not trigger any memory.low or PSI notification but is a heuristic that can be used to determine what can and cannot be started without incurring substantial memory reclaim. I'm indifferent to whether this would be a "reclaimable" or "available" metric, with a slight preference toward making it as similar in calculation to MemAvailable as possible, so I think the question is whether this is something the user should be deriving themselves based on memcg stats that are exported or whether we should solidify this based on how the kernel handles reclaim as a metric that will carry over across kernel vesions? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-07-07 19:58 ` David Rientjes @ 2020-07-10 19:47 ` David Rientjes 2020-07-10 21:04 ` Yang Shi 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2020-07-10 19:47 UTC (permalink / raw) To: Michal Hocko Cc: Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM On Tue, 7 Jul 2020, David Rientjes wrote: > Another use case would be motivated by exactly the MemAvailable use case: > when bound to a memcg hierarchy, how much memory is available without > substantial swap or risk of oom for starting a new process or service? > This would not trigger any memory.low or PSI notification but is a > heuristic that can be used to determine what can and cannot be started > without incurring substantial memory reclaim. > > I'm indifferent to whether this would be a "reclaimable" or "available" > metric, with a slight preference toward making it as similar in > calculation to MemAvailable as possible, so I think the question is > whether this is something the user should be deriving themselves based on > memcg stats that are exported or whether we should solidify this based on > how the kernel handles reclaim as a metric that will carry over across > kernel vesions? > To try to get more discussion on the subject, consider a malloc implementation, like tcmalloc, that does MADV_DONTNEED to free memory back to the system and how this freed memory is then described to userspace depending on the kernel implementation. [ For the sake of this discussion, consider we have precise memcg stats available to us although the actual implementation allows for some variance (MEMCG_CHARGE_BATCH). ] With a 64MB heap backed by thp on x86, for example, the vma starts with an rss of 64MB, all of which is anon and backed by hugepages. Imagine some aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. Before freeing, anon, anon_thp, and active_anon in memory.stat would all be the same for this vma (64MB). 64MB would also be charged to memory.current. That's all working as intended and to the expectation of userspace. After freeing, however, we have the kernel implementation specific detail of how huge pmd splitting is handled (rss) in comparison to the underlying split of the compound page (deferred split queue). The huge pmd is always split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB for this vma and none of it is backed by thp. What is charged to the memcg (memory.current) and what is on active_anon is unchanged, however, because the underlying compound pages are still charged to the memcg. The amount of anon and anon_thp are decreased in compliance with the splitting of the page tables, however. So after freeing, for this vma: anon = 128KB, anon_thp = 0, active_anon = 64MB, memory.current = 64MB. In this case, because of the deferred split queue, which is a kernel implementation detail, userspace may be unclear on what is actually reclaimable -- and this memory is reclaimable under memory pressure. For the motivation of MemAvailable (what amount of memory is available for starting new work), userspace *could* determine this through the aforementioned active_anon - anon (or some combination of memory.current - anon - file - slab), but I think it's a fair point that userspace's view of reclaimable memory as the kernel implementation changes is something that can and should remain consistent between versions. Otherwise, an earlier implementation before deferred split queues could have safely assumed that active_anon was unreclaimable unless swap were enabled. It doesn't have the foresight based on future kernel implementation detail to reconcile what the amount of reclaimable memory actually is. Same discussion could happen for lazy free memory which is anon but now appears on the file lru stats and not the anon lru stats: it's easily reclaimable under memory pressure but you need to reconcile the difference between the anon metric and what is revealed in the anon lru stats. That gave way to my original thought of a si_mem_available()-like calculation ("avail") by doing free = memory.high - memory.current lazyfree = file - (active_file + inactive_file) deferred = active_anon - anon avail = free + lazyfree + deferred + (active_file + inactive_file + slab_reclaimable) / 2 And we have the ability to change this formula based on kernel implementation details as they evolve. Idea is to provide a consistent field that userspace can use to determine the rough amount of reclaimable memory in a MemAvailable-like way. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-07-10 19:47 ` David Rientjes @ 2020-07-10 21:04 ` Yang Shi 2020-07-12 22:02 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Yang Shi @ 2020-07-10 21:04 UTC (permalink / raw) To: David Rientjes Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM On Fri, Jul 10, 2020 at 12:49 PM David Rientjes <rientjes@google.com> wrote: > > On Tue, 7 Jul 2020, David Rientjes wrote: > > > Another use case would be motivated by exactly the MemAvailable use case: > > when bound to a memcg hierarchy, how much memory is available without > > substantial swap or risk of oom for starting a new process or service? > > This would not trigger any memory.low or PSI notification but is a > > heuristic that can be used to determine what can and cannot be started > > without incurring substantial memory reclaim. > > > > I'm indifferent to whether this would be a "reclaimable" or "available" > > metric, with a slight preference toward making it as similar in > > calculation to MemAvailable as possible, so I think the question is > > whether this is something the user should be deriving themselves based on > > memcg stats that are exported or whether we should solidify this based on > > how the kernel handles reclaim as a metric that will carry over across > > kernel vesions? > > > > To try to get more discussion on the subject, consider a malloc > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back > to the system and how this freed memory is then described to userspace > depending on the kernel implementation. > > [ For the sake of this discussion, consider we have precise memcg stats > available to us although the actual implementation allows for some > variance (MEMCG_CHARGE_BATCH). ] > > With a 64MB heap backed by thp on x86, for example, the vma starts with an > rss of 64MB, all of which is anon and backed by hugepages. Imagine some > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page > mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. > > Before freeing, anon, anon_thp, and active_anon in memory.stat would all > be the same for this vma (64MB). 64MB would also be charged to > memory.current. That's all working as intended and to the expectation of > userspace. > > After freeing, however, we have the kernel implementation specific detail > of how huge pmd splitting is handled (rss) in comparison to the underlying > split of the compound page (deferred split queue). The huge pmd is always > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB > for this vma and none of it is backed by thp. > > What is charged to the memcg (memory.current) and what is on active_anon > is unchanged, however, because the underlying compound pages are still > charged to the memcg. The amount of anon and anon_thp are decreased > in compliance with the splitting of the page tables, however. > > So after freeing, for this vma: anon = 128KB, anon_thp = 0, > active_anon = 64MB, memory.current = 64MB. > > In this case, because of the deferred split queue, which is a kernel > implementation detail, userspace may be unclear on what is actually > reclaimable -- and this memory is reclaimable under memory pressure. For > the motivation of MemAvailable (what amount of memory is available for > starting new work), userspace *could* determine this through the > aforementioned active_anon - anon (or some combination of > memory.current - anon - file - slab), but I think it's a fair point that > userspace's view of reclaimable memory as the kernel implementation > changes is something that can and should remain consistent between > versions. > > Otherwise, an earlier implementation before deferred split queues could > have safely assumed that active_anon was unreclaimable unless swap were > enabled. It doesn't have the foresight based on future kernel > implementation detail to reconcile what the amount of reclaimable memory > actually is. > > Same discussion could happen for lazy free memory which is anon but now > appears on the file lru stats and not the anon lru stats: it's easily > reclaimable under memory pressure but you need to reconcile the difference > between the anon metric and what is revealed in the anon lru stats. > > That gave way to my original thought of a si_mem_available()-like > calculation ("avail") by doing > > free = memory.high - memory.current I'm wondering what if high or max is set to max limit. Don't you end up seeing a super large memavail? > lazyfree = file - (active_file + inactive_file) Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE just updates inactive lru size. > deferred = active_anon - anon > > avail = free + lazyfree + deferred + > (active_file + inactive_file + slab_reclaimable) / 2 > > And we have the ability to change this formula based on kernel > implementation details as they evolve. Idea is to provide a consistent > field that userspace can use to determine the rough amount of reclaimable > memory in a MemAvailable-like way. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Memcg stat for available memory 2020-07-10 21:04 ` Yang Shi @ 2020-07-12 22:02 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2020-07-12 22:02 UTC (permalink / raw) To: Yang Shi Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups, Linux MM On Fri, 10 Jul 2020, Yang Shi wrote: > > To try to get more discussion on the subject, consider a malloc > > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back > > to the system and how this freed memory is then described to userspace > > depending on the kernel implementation. > > > > [ For the sake of this discussion, consider we have precise memcg stats > > available to us although the actual implementation allows for some > > variance (MEMCG_CHARGE_BATCH). ] > > > > With a 64MB heap backed by thp on x86, for example, the vma starts with an > > rss of 64MB, all of which is anon and backed by hugepages. Imagine some > > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page > > mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. > > > > Before freeing, anon, anon_thp, and active_anon in memory.stat would all > > be the same for this vma (64MB). 64MB would also be charged to > > memory.current. That's all working as intended and to the expectation of > > userspace. > > > > After freeing, however, we have the kernel implementation specific detail > > of how huge pmd splitting is handled (rss) in comparison to the underlying > > split of the compound page (deferred split queue). The huge pmd is always > > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB > > for this vma and none of it is backed by thp. > > > > What is charged to the memcg (memory.current) and what is on active_anon > > is unchanged, however, because the underlying compound pages are still > > charged to the memcg. The amount of anon and anon_thp are decreased > > in compliance with the splitting of the page tables, however. > > > > So after freeing, for this vma: anon = 128KB, anon_thp = 0, > > active_anon = 64MB, memory.current = 64MB. > > > > In this case, because of the deferred split queue, which is a kernel > > implementation detail, userspace may be unclear on what is actually > > reclaimable -- and this memory is reclaimable under memory pressure. For > > the motivation of MemAvailable (what amount of memory is available for > > starting new work), userspace *could* determine this through the > > aforementioned active_anon - anon (or some combination of > > memory.current - anon - file - slab), but I think it's a fair point that > > userspace's view of reclaimable memory as the kernel implementation > > changes is something that can and should remain consistent between > > versions. > > > > Otherwise, an earlier implementation before deferred split queues could > > have safely assumed that active_anon was unreclaimable unless swap were > > enabled. It doesn't have the foresight based on future kernel > > implementation detail to reconcile what the amount of reclaimable memory > > actually is. > > > > Same discussion could happen for lazy free memory which is anon but now > > appears on the file lru stats and not the anon lru stats: it's easily > > reclaimable under memory pressure but you need to reconcile the difference > > between the anon metric and what is revealed in the anon lru stats. > > > > That gave way to my original thought of a si_mem_available()-like > > calculation ("avail") by doing > > > > free = memory.high - memory.current > > I'm wondering what if high or max is set to max limit. Don't you end > up seeing a super large memavail? > Hi Yang, Yes, this would be the same as seeing a super large limit :) I'm indifferent to whether this is described as an available amount of memory (almost identical to MemAvailable) or a best guess of the reclaimable amount of memory from the memory that is currently charged. Concept is to provide userspace with this best guess like we do for system memory through MemAvailable because it (a) depends on implementation details in the kernel and (b) is the only way to maintain consistency from version to version. > > lazyfree = file - (active_file + inactive_file) > > Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE > just updates inactive lru size. > Yes, you're right, this would be lazyfree = (active_file + inactive_file) - file from memory.stat. Lazy free memory are clean anon pages on the inactive file lru, but we must consider active_file + inactive_file in comparison to "file" for the total amount of lazy free. Another side effect of this is that we'd need anon - lazyfree swap space available for this workload to be swapped. The overall point I'm trying to highlight is that the amount of memory that can be freed under memory pressure, either lazy free or on the deferred split queues, can be substantial. I'd like to discuss the feasibility of adding this as a kernel maintained stat to memory.stat rather than userspace attempting to derive this on its own. > > deferred = active_anon - anon > > > > avail = free + lazyfree + deferred + > > (active_file + inactive_file + slab_reclaimable) / 2 > > > > And we have the ability to change this formula based on kernel > > implementation details as they evolve. Idea is to provide a consistent > > field that userspace can use to determine the rough amount of reclaimable > > memory in a MemAvailable-like way. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-07-12 22:02 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-06-28 22:15 Memcg stat for available memory David Rientjes 2020-07-02 15:22 ` Shakeel Butt 2020-07-03 8:15 ` Michal Hocko 2020-07-07 19:58 ` David Rientjes 2020-07-10 19:47 ` David Rientjes 2020-07-10 21:04 ` Yang Shi 2020-07-12 22:02 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).