linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Memcg stat for available memory
@ 2020-06-28 22:15 David Rientjes
  2020-07-02 15:22 ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2020-06-28 22:15 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov
  Cc: Andrew Morton, Shakeel Butt, cgroups, linux-mm

Hi everybody,

I'd like to discuss the feasibility of a stat similar to 
si_mem_available() but at memcg scope which would specify how much memory 
can be charged without I/O.

The si_mem_available() stat is based on heuristics so this does not 
provide an exact quantity that is actually available at any given time, 
but can otherwise provide userspace with some guidance on the amount of 
reclaimable memory.  See the description in 
Documentation/filesystems/proc.rst and its implementation.

 [ Naturally, userspace would need to understand both the amount of memory 
   that is available for allocation and for charging, separately, on an 
   overcommitted system.  I assume this is trivial.  (Why don't we provide 
   MemAvailable in per-node meminfo?) ]

For such a stat at memcg scope, we can ignore totalreserves and 
watermarks.  We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for 
both file pages and slab_reclaimable.

We can infer lazily free memory by doing

	file - (active_file + inactive_file)

(This is necessary because lazy free memory is anon but on the inactive 
 file lru and we can't infer lazy freeable memory through pglazyfree -
 pglazyfreed, they are event counters.)

We can also infer the number of underlying compound pages that are on 
deferred split queues but have yet to be split with active_anon - anon (or
is this a bug? :)

So it *seems* like userspace can make a si_mem_available()-like 
calculation ("avail") by doing

	free = memory.high - memory.current
	lazyfree = file - (active_file + inactive_file)
	deferred = active_anon - anon

	avail = free + lazyfree + deferred +
		(active_file + inactive_file + slab_reclaimable) / 2

For userspace interested in knowing how much memory it can charge without 
incurring I/O (and assuming it has knowledge of available memory on an 
overcommitted system), it seems like:

 (a) it can derive the above avail amount that is at least similar to
     MemAvailable,

 (b) it can assume that all reclaim is considered equal so anything more
     than memory.high - memory.current is disruptive enough that it's a
     better heuristic than the above, or

 (c) the kernel provide an "avail" stat in memory.stat based on the above 
     and can evolve as the kernel implementation changes (how lazy free 
     memory impacts anon vs file lru stats, how deferred split memory is 
     handled, any future extensions for "easily reclaimable memory") that 
     userspace can count on to the same degree it can count on 
     MemAvailable.

Any thoughts?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-06-28 22:15 Memcg stat for available memory David Rientjes
@ 2020-07-02 15:22 ` Shakeel Butt
  2020-07-03  8:15   ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2020-07-02 15:22 UTC (permalink / raw)
  To: David Rientjes, Yang Shi, Roman Gushchin, Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Cgroups, Linux MM

(Adding more people who might be interested in this)


On Sun, Jun 28, 2020 at 3:15 PM David Rientjes <rientjes@google.com> wrote:
>
> Hi everybody,
>
> I'd like to discuss the feasibility of a stat similar to
> si_mem_available() but at memcg scope which would specify how much memory
> can be charged without I/O.
>
> The si_mem_available() stat is based on heuristics so this does not
> provide an exact quantity that is actually available at any given time,
> but can otherwise provide userspace with some guidance on the amount of
> reclaimable memory.  See the description in
> Documentation/filesystems/proc.rst and its implementation.
>
>  [ Naturally, userspace would need to understand both the amount of memory
>    that is available for allocation and for charging, separately, on an
>    overcommitted system.  I assume this is trivial.  (Why don't we provide
>    MemAvailable in per-node meminfo?) ]
>
> For such a stat at memcg scope, we can ignore totalreserves and
> watermarks.  We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for
> both file pages and slab_reclaimable.
>
> We can infer lazily free memory by doing
>
>         file - (active_file + inactive_file)
>
> (This is necessary because lazy free memory is anon but on the inactive
>  file lru and we can't infer lazy freeable memory through pglazyfree -
>  pglazyfreed, they are event counters.)
>
> We can also infer the number of underlying compound pages that are on
> deferred split queues but have yet to be split with active_anon - anon (or
> is this a bug? :)
>
> So it *seems* like userspace can make a si_mem_available()-like
> calculation ("avail") by doing
>
>         free = memory.high - memory.current
>         lazyfree = file - (active_file + inactive_file)
>         deferred = active_anon - anon
>
>         avail = free + lazyfree + deferred +
>                 (active_file + inactive_file + slab_reclaimable) / 2
>
> For userspace interested in knowing how much memory it can charge without
> incurring I/O (and assuming it has knowledge of available memory on an
> overcommitted system), it seems like:
>
>  (a) it can derive the above avail amount that is at least similar to
>      MemAvailable,
>
>  (b) it can assume that all reclaim is considered equal so anything more
>      than memory.high - memory.current is disruptive enough that it's a
>      better heuristic than the above, or
>
>  (c) the kernel provide an "avail" stat in memory.stat based on the above
>      and can evolve as the kernel implementation changes (how lazy free
>      memory impacts anon vs file lru stats, how deferred split memory is
>      handled, any future extensions for "easily reclaimable memory") that
>      userspace can count on to the same degree it can count on
>      MemAvailable.
>
> Any thoughts?


I think we need to answer two questions:

1) What's the use-case?
2) Why is user space calculating their MemAvailable themselves not good?

The use case I have in mind is the latency sensitive distributed
caching service which would prefer to reduce the amount of its caching
over the stalls incurred by hitting the limit. Such applications can
monitor their MemAvailable and adjust their caching footprint.

For the second, I think it is to hide the internal implementation
details of the kernel from the user space. The deferred split queues
is an internal detail and we don't want that exposed to the user.
Similarly how lazyfree is implemented (i.e. anon pages on file LRU)
should not be exposed to the users.

Shakeel


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-07-02 15:22 ` Shakeel Butt
@ 2020-07-03  8:15   ` Michal Hocko
  2020-07-07 19:58     ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2020-07-03  8:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Rientjes, Yang Shi, Roman Gushchin, Greg Thelen,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux MM

I am sorry I was bussy and didn't get to it sooner]

On Thu 02-07-20 08:22:10, Shakeel Butt wrote:
> (Adding more people who might be interested in this)
> 
> 
> On Sun, Jun 28, 2020 at 3:15 PM David Rientjes <rientjes@google.com> wrote:
> >
> > Hi everybody,
> >
> > I'd like to discuss the feasibility of a stat similar to
> > si_mem_available() but at memcg scope which would specify how much memory
> > can be charged without I/O.
> >
> > The si_mem_available() stat is based on heuristics so this does not
> > provide an exact quantity that is actually available at any given time,
> > but can otherwise provide userspace with some guidance on the amount of
> > reclaimable memory.  See the description in
> > Documentation/filesystems/proc.rst and its implementation.

I have to say I was a fan of this metric when it was introduced mostly
because it has removed that nasty subtle detail that Cached value
includes the swap backed memory (e.g. shmem) and that has caused a lot
of confusion. But I became very skeptical over time because it is really
hard to set expectations right when relying on the value for two main
reasons
	- it is a global snapshot value and as such it becomes largely
	  unusable for any decisions which are not implemented right
	  away or if there are multiple uncoordinated consumers.
	- it is not really hard to trigger "corner" cases where a careful
	  use of MemAvailable still leads to a lot of memory reclaim
	  even for single large consumer. What we consider reclaimable
	  might be pinned for different reasons or situation simply
	  changes. Our documentation claims that following this guidance
	  will help prevent from swapping/reclaim yet this is not true
	  and I have seen bug reports in the past.

> >  [ Naturally, userspace would need to understand both the amount of memory
> >    that is available for allocation and for charging, separately, on an
> >    overcommitted system.  I assume this is trivial.  (Why don't we provide
> >    MemAvailable in per-node meminfo?) ]

I presume you min the consumer would simply do min(global, memcg) right?
Well a proper implementation of the value would have to be hierarchical
so it would be minimum over the whole memcg tree up to the root. We
cannot expect userspace do do that.

While technically possible and not that hard to express I am worried
existing problems with the value would be just amplified because there
is even more volatility here. Global value mostly depends on consumers,
now you have a second source of volatility and that is the memcg limit
(hard or high) that can be changed quite dynamically. Sure the global
case can have a similar with memory hotplug but realistically this is
not by far common. Another complication is that the amount of reclaim
for each memcg depends on reclaimability of other reclaim under the
global memory pressure (just consider low/min protection as the simplest
example). So I expect imprecision will be even harder to predict for
per-memcg value.

> > For such a stat at memcg scope, we can ignore totalreserves and
> > watermarks.  We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for
> > both file pages and slab_reclaimable.
> >
> > We can infer lazily free memory by doing
> >
> >         file - (active_file + inactive_file)
> >
> > (This is necessary because lazy free memory is anon but on the inactive
> >  file lru and we can't infer lazy freeable memory through pglazyfree -
> >  pglazyfreed, they are event counters.)
> > We can also infer the number of underlying compound pages that are on
> >
> > deferred split queues but have yet to be split with active_anon - anon (or
> > is this a bug? :)
> >
> > So it *seems* like userspace can make a si_mem_available()-like
> > calculation ("avail") by doing
> >
> >         free = memory.high - memory.current

min(memory.high, memory.max)

> >         lazyfree = file - (active_file + inactive_file)
> >         deferred = active_anon - anon
> >
> >         avail = free + lazyfree + deferred +
> >                 (active_file + inactive_file + slab_reclaimable) / 2

I am not sure why you want to trigger lazy free differently from the
global value. But this is really a minor technical thing which is not
really all that interesting until we actually can define what would be
the real usecase.

> > For userspace interested in knowing how much memory it can charge without
> > incurring I/O (and assuming it has knowledge of available memory on an
> > overcommitted system), it seems like:
> >
> >  (a) it can derive the above avail amount that is at least similar to
> >      MemAvailable,
> >
> >  (b) it can assume that all reclaim is considered equal so anything more
> >      than memory.high - memory.current is disruptive enough that it's a
> >      better heuristic than the above, or
> >
> >  (c) the kernel provide an "avail" stat in memory.stat based on the above
> >      and can evolve as the kernel implementation changes (how lazy free
> >      memory impacts anon vs file lru stats, how deferred split memory is
> >      handled, any future extensions for "easily reclaimable memory") that
> >      userspace can count on to the same degree it can count on
> >      MemAvailable.
> >
> > Any thoughts?
> 
> 
> I think we need to answer two questions:
> 
> 1) What's the use-case?
> 2) Why is user space calculating their MemAvailable themselves not good?

These are questions the discussion should have started with. Thanks!

> The use case I have in mind is the latency sensitive distributed
> caching service which would prefer to reduce the amount of its caching
> over the stalls incurred by hitting the limit. Such applications can
> monitor their MemAvailable and adjust their caching footprint.

Is the value really reliable enough to implement such a logic though? I
have mentioned some problems above. The situation might change at any
time and the source of that change might be external to the memcg so the
value would have to be pro-actively polled all the time. This doesn't
sound very viable to me, especially for latency sensitive service.
Wouldn't it make more sense to protect the service and dynamically
change the low memory protection based on the external memory pressure.
There are different ways to achieve that. E.g. watch for LOW event
notifications and/or PSI metrics. I believe FB is relying on such a
dynamic scaling a lot.
 
> For the second, I think it is to hide the internal implementation
> details of the kernel from the user space. The deferred split queues
> is an internal detail and we don't want that exposed to the user.
> Similarly how lazyfree is implemented (i.e. anon pages on file LRU)
> should not be exposed to the users.

I would tend to agree that there is a lot of internal logic that can
skew existing statistics and that might be confusing. But I am not sure
that providing something that aims to hide them yet is hard to use is a
proper way forward. But maybe I am just too pessimistic. I would be
happy to be convinced otherwise.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-07-03  8:15   ` Michal Hocko
@ 2020-07-07 19:58     ` David Rientjes
  2020-07-10 19:47       ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2020-07-07 19:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux MM

On Fri, 3 Jul 2020, Michal Hocko wrote:

> > > I'd like to discuss the feasibility of a stat similar to
> > > si_mem_available() but at memcg scope which would specify how much memory
> > > can be charged without I/O.
> > >
> > > The si_mem_available() stat is based on heuristics so this does not
> > > provide an exact quantity that is actually available at any given time,
> > > but can otherwise provide userspace with some guidance on the amount of
> > > reclaimable memory.  See the description in
> > > Documentation/filesystems/proc.rst and its implementation.
> 
> I have to say I was a fan of this metric when it was introduced mostly
> because it has removed that nasty subtle detail that Cached value
> includes the swap backed memory (e.g. shmem) and that has caused a lot
> of confusion. But I became very skeptical over time because it is really
> hard to set expectations right when relying on the value for two main
> reasons
> 	- it is a global snapshot value and as such it becomes largely
> 	  unusable for any decisions which are not implemented right
> 	  away or if there are multiple uncoordinated consumers.
> 	- it is not really hard to trigger "corner" cases where a careful
> 	  use of MemAvailable still leads to a lot of memory reclaim
> 	  even for single large consumer. What we consider reclaimable
> 	  might be pinned for different reasons or situation simply
> 	  changes. Our documentation claims that following this guidance
> 	  will help prevent from swapping/reclaim yet this is not true
> 	  and I have seen bug reports in the past.
> 

Hi everybody,

I agree that mileage may vary with MemAvailable and that it is only 
representative of the current state of memory at the time it is grabbed 
(although with a memcg equivalent we could have more of a guarantee here 
on a system that is not overcommitted).  I think it's best viewed as our 
best guess of the current amount of free + inexpensively reclaimable 
memory exists at any point in time.

An alternative would be to describe simply the amount of memory that we 
anticipate is reclaimable.  This doesn't get around the pinning issue but 
does provide, like MemAvailable, a field that can be queried that will be 
stable over kernel versions for what the kernel perceives as reclaimable 
and, importantly, makes its reclaim decisions based on.

> > >  [ Naturally, userspace would need to understand both the amount of memory
> > >    that is available for allocation and for charging, separately, on an
> > >    overcommitted system.  I assume this is trivial.  (Why don't we provide
> > >    MemAvailable in per-node meminfo?) ]
> 
> I presume you min the consumer would simply do min(global, memcg) right?
> Well a proper implementation of the value would have to be hierarchical
> so it would be minimum over the whole memcg tree up to the root. We
> cannot expect userspace do do that.
> 
> While technically possible and not that hard to express I am worried
> existing problems with the value would be just amplified because there
> is even more volatility here. Global value mostly depends on consumers,
> now you have a second source of volatility and that is the memcg limit
> (hard or high) that can be changed quite dynamically. Sure the global
> case can have a similar with memory hotplug but realistically this is
> not by far common. Another complication is that the amount of reclaim
> for each memcg depends on reclaimability of other reclaim under the
> global memory pressure (just consider low/min protection as the simplest
> example). So I expect imprecision will be even harder to predict for
> per-memcg value.
> 

Yeah, I think it's best approached by considering the global MemAvailable 
separately from any per-memcg metric and they have different scope 
depending on whether you're the application manager or whether you're the 
process/library attached to a memcg that is trying to orchestrate its own 
memory usage.  The memcg view of the amount of reclaimable memory (or 
available [free + reclaimable]) should be specific to that hierarchy 
without considering the global MemAvaiable, just as we can incur reclaim 
and/or oom today both from the page allocator and memcg charge path 
separately.  It's two different contexts.

Simplest example would be a malloc implementation that derives benefit 
from keeping as much heap backed by hugepages as possible so attempts to 
avoid splitting huge pmds whenever possible absent memory pressure.  As 
the amount of available memory decreases for whatever reason (more user 
or kernel memory charged to the hierarchy), it could start releasing more 
memory back to the system and splitting these pmds.  This may not only be 
memory.{high,max} - memory.current since it may be preferable for there to 
be some reclaim activity coupled with "userspace reclaim" at this mark, 
such as doing MADV_DONTNEED for memory on the malloc freelist.

> > > For such a stat at memcg scope, we can ignore totalreserves and
> > > watermarks.  We already have ~precise (modulo MEMCG_CHARGE_BATCH) data for
> > > both file pages and slab_reclaimable.
> > >
> > > We can infer lazily free memory by doing
> > >
> > >         file - (active_file + inactive_file)
> > >
> > > (This is necessary because lazy free memory is anon but on the inactive
> > >  file lru and we can't infer lazy freeable memory through pglazyfree -
> > >  pglazyfreed, they are event counters.)
> > > We can also infer the number of underlying compound pages that are on
> > >
> > > deferred split queues but have yet to be split with active_anon - anon (or
> > > is this a bug? :)
> > >
> > > So it *seems* like userspace can make a si_mem_available()-like
> > > calculation ("avail") by doing
> > >
> > >         free = memory.high - memory.current
> 
> min(memory.high, memory.max)
> 
> > >         lazyfree = file - (active_file + inactive_file)
> > >         deferred = active_anon - anon
> > >
> > >         avail = free + lazyfree + deferred +
> > >                 (active_file + inactive_file + slab_reclaimable) / 2
> 
> I am not sure why you want to trigger lazy free differently from the
> global value. But this is really a minor technical thing which is not
> really all that interesting until we actually can define what would be
> the real usecase.
> 
> > > For userspace interested in knowing how much memory it can charge without
> > > incurring I/O (and assuming it has knowledge of available memory on an
> > > overcommitted system), it seems like:
> > >
> > >  (a) it can derive the above avail amount that is at least similar to
> > >      MemAvailable,
> > >
> > >  (b) it can assume that all reclaim is considered equal so anything more
> > >      than memory.high - memory.current is disruptive enough that it's a
> > >      better heuristic than the above, or
> > >
> > >  (c) the kernel provide an "avail" stat in memory.stat based on the above
> > >      and can evolve as the kernel implementation changes (how lazy free
> > >      memory impacts anon vs file lru stats, how deferred split memory is
> > >      handled, any future extensions for "easily reclaimable memory") that
> > >      userspace can count on to the same degree it can count on
> > >      MemAvailable.
> > >
> > > Any thoughts?
> > 
> > 
> > I think we need to answer two questions:
> > 
> > 1) What's the use-case?
> > 2) Why is user space calculating their MemAvailable themselves not good?
> 
> These are questions the discussion should have started with. Thanks!
> 
> > The use case I have in mind is the latency sensitive distributed
> > caching service which would prefer to reduce the amount of its caching
> > over the stalls incurred by hitting the limit. Such applications can
> > monitor their MemAvailable and adjust their caching footprint.
> 
> Is the value really reliable enough to implement such a logic though? I
> have mentioned some problems above. The situation might change at any
> time and the source of that change might be external to the memcg so the
> value would have to be pro-actively polled all the time. This doesn't
> sound very viable to me, especially for latency sensitive service.
> Wouldn't it make more sense to protect the service and dynamically
> change the low memory protection based on the external memory pressure.
> There are different ways to achieve that. E.g. watch for LOW event
> notifications and/or PSI metrics. I believe FB is relying on such a
> dynamic scaling a lot.
>  

Right, and given the limitations imposed by MEMCG_CHARGE_BATCH variance in 
the stats themselves, for example, this value will not be 100% accurate 
since none of the stats are 100% accurate :)

> > For the second, I think it is to hide the internal implementation
> > details of the kernel from the user space. The deferred split queues
> > is an internal detail and we don't want that exposed to the user.
> > Similarly how lazyfree is implemented (i.e. anon pages on file LRU)
> > should not be exposed to the users.
> 
> I would tend to agree that there is a lot of internal logic that can
> skew existing statistics and that might be confusing. But I am not sure
> that providing something that aims to hide them yet is hard to use is a
> proper way forward. But maybe I am just too pessimistic. I would be
> happy to be convinced otherwise.
> 

Idea would be to expose what the kernel deems to be available, much like 
MemAvailable, at a given time without requiring userspace to derive the 
value for themselves.  Certainly they *can*, and I gave an example of how 
it would do that, but it requires an understanding of the metrics that the 
kernel exposes, how reclaim behaves, and attempts to be stable over 
multiple kernel versions which would be the same motivation for the global 
metric.

Assume a memcg hierarchy that is serving that latency sensitive service 
that has been protected from the affects of global pressure but the amount 
of memory consumed by that service varies over releases.  How the kernel 
handles lazy free memory, how it handles deferred split queues, etc, are 
specific details that userspace may not have visiblity into: the metric 
answers the question of "what can I actually get back if I call into 
reclaim?".  How much memory is on the deferred split queue can be 
substantial, for example, but userspace would be unaware of this unless 
they do something like active_anon - anon.

Another use case would be motivated by exactly the MemAvailable use case: 
when bound to a memcg hierarchy, how much memory is available without 
substantial swap or risk of oom for starting a new process or service?  
This would not trigger any memory.low or PSI notification but is a 
heuristic that can be used to determine what can and cannot be started 
without incurring substantial memory reclaim.

I'm indifferent to whether this would be a "reclaimable" or "available" 
metric, with a slight preference toward making it as similar in 
calculation to MemAvailable as possible, so I think the question is 
whether this is something the user should be deriving themselves based on 
memcg stats that are exported or whether we should solidify this based on 
how the kernel handles reclaim as a metric that will carry over across 
kernel vesions?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-07-07 19:58     ` David Rientjes
@ 2020-07-10 19:47       ` David Rientjes
  2020-07-10 21:04         ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2020-07-10 19:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux MM

On Tue, 7 Jul 2020, David Rientjes wrote:

> Another use case would be motivated by exactly the MemAvailable use case: 
> when bound to a memcg hierarchy, how much memory is available without 
> substantial swap or risk of oom for starting a new process or service?  
> This would not trigger any memory.low or PSI notification but is a 
> heuristic that can be used to determine what can and cannot be started 
> without incurring substantial memory reclaim.
> 
> I'm indifferent to whether this would be a "reclaimable" or "available" 
> metric, with a slight preference toward making it as similar in 
> calculation to MemAvailable as possible, so I think the question is 
> whether this is something the user should be deriving themselves based on 
> memcg stats that are exported or whether we should solidify this based on 
> how the kernel handles reclaim as a metric that will carry over across 
> kernel vesions?
> 

To try to get more discussion on the subject, consider a malloc 
implementation, like tcmalloc, that does MADV_DONTNEED to free memory back 
to the system and how this freed memory is then described to userspace 
depending on the kernel implementation.

 [ For the sake of this discussion, consider we have precise memcg stats 
   available to us although the actual implementation allows for some
   variance (MEMCG_CHARGE_BATCH). ]

With a 64MB heap backed by thp on x86, for example, the vma starts with an 
rss of 64MB, all of which is anon and backed by hugepages.  Imagine some 
aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page 
mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.

Before freeing, anon, anon_thp, and active_anon in memory.stat would all 
be the same for this vma (64MB).  64MB would also be charged to 
memory.current.  That's all working as intended and to the expectation of 
userspace.

After freeing, however, we have the kernel implementation specific detail 
of how huge pmd splitting is handled (rss) in comparison to the underlying 
split of the compound page (deferred split queue).  The huge pmd is always 
split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB 
for this vma and none of it is backed by thp.

What is charged to the memcg (memory.current) and what is on active_anon 
is unchanged, however, because the underlying compound pages are still 
charged to the memcg.  The amount of anon and anon_thp are decreased 
in compliance with the splitting of the page tables, however.

So after freeing, for this vma: anon = 128KB, anon_thp = 0, 
active_anon = 64MB, memory.current = 64MB.

In this case, because of the deferred split queue, which is a kernel 
implementation detail, userspace may be unclear on what is actually 
reclaimable -- and this memory is reclaimable under memory pressure.  For 
the motivation of MemAvailable (what amount of memory is available for 
starting new work), userspace *could* determine this through the 
aforementioned active_anon - anon (or some combination of
memory.current - anon - file - slab), but I think it's a fair point that 
userspace's view of reclaimable memory as the kernel implementation 
changes is something that can and should remain consistent between 
versions.

Otherwise, an earlier implementation before deferred split queues could 
have safely assumed that active_anon was unreclaimable unless swap were 
enabled.  It doesn't have the foresight based on future kernel 
implementation detail to reconcile what the amount of reclaimable memory 
actually is.

Same discussion could happen for lazy free memory which is anon but now 
appears on the file lru stats and not the anon lru stats: it's easily 
reclaimable under memory pressure but you need to reconcile the difference 
between the anon metric and what is revealed in the anon lru stats.

That gave way to my original thought of a si_mem_available()-like 
calculation ("avail") by doing

	free = memory.high - memory.current
	lazyfree = file - (active_file + inactive_file)
	deferred = active_anon - anon

	avail = free + lazyfree + deferred +
		(active_file + inactive_file + slab_reclaimable) / 2

And we have the ability to change this formula based on kernel 
implementation details as they evolve.  Idea is to provide a consistent 
field that userspace can use to determine the rough amount of reclaimable 
memory in a MemAvailable-like way.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-07-10 19:47       ` David Rientjes
@ 2020-07-10 21:04         ` Yang Shi
  2020-07-12 22:02           ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2020-07-10 21:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Cgroups, Linux MM

On Fri, Jul 10, 2020 at 12:49 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 7 Jul 2020, David Rientjes wrote:
>
> > Another use case would be motivated by exactly the MemAvailable use case:
> > when bound to a memcg hierarchy, how much memory is available without
> > substantial swap or risk of oom for starting a new process or service?
> > This would not trigger any memory.low or PSI notification but is a
> > heuristic that can be used to determine what can and cannot be started
> > without incurring substantial memory reclaim.
> >
> > I'm indifferent to whether this would be a "reclaimable" or "available"
> > metric, with a slight preference toward making it as similar in
> > calculation to MemAvailable as possible, so I think the question is
> > whether this is something the user should be deriving themselves based on
> > memcg stats that are exported or whether we should solidify this based on
> > how the kernel handles reclaim as a metric that will carry over across
> > kernel vesions?
> >
>
> To try to get more discussion on the subject, consider a malloc
> implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
> to the system and how this freed memory is then described to userspace
> depending on the kernel implementation.
>
>  [ For the sake of this discussion, consider we have precise memcg stats
>    available to us although the actual implementation allows for some
>    variance (MEMCG_CHARGE_BATCH). ]
>
> With a 64MB heap backed by thp on x86, for example, the vma starts with an
> rss of 64MB, all of which is anon and backed by hugepages.  Imagine some
> aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
> mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.
>
> Before freeing, anon, anon_thp, and active_anon in memory.stat would all
> be the same for this vma (64MB).  64MB would also be charged to
> memory.current.  That's all working as intended and to the expectation of
> userspace.
>
> After freeing, however, we have the kernel implementation specific detail
> of how huge pmd splitting is handled (rss) in comparison to the underlying
> split of the compound page (deferred split queue).  The huge pmd is always
> split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
> for this vma and none of it is backed by thp.
>
> What is charged to the memcg (memory.current) and what is on active_anon
> is unchanged, however, because the underlying compound pages are still
> charged to the memcg.  The amount of anon and anon_thp are decreased
> in compliance with the splitting of the page tables, however.
>
> So after freeing, for this vma: anon = 128KB, anon_thp = 0,
> active_anon = 64MB, memory.current = 64MB.
>
> In this case, because of the deferred split queue, which is a kernel
> implementation detail, userspace may be unclear on what is actually
> reclaimable -- and this memory is reclaimable under memory pressure.  For
> the motivation of MemAvailable (what amount of memory is available for
> starting new work), userspace *could* determine this through the
> aforementioned active_anon - anon (or some combination of
> memory.current - anon - file - slab), but I think it's a fair point that
> userspace's view of reclaimable memory as the kernel implementation
> changes is something that can and should remain consistent between
> versions.
>
> Otherwise, an earlier implementation before deferred split queues could
> have safely assumed that active_anon was unreclaimable unless swap were
> enabled.  It doesn't have the foresight based on future kernel
> implementation detail to reconcile what the amount of reclaimable memory
> actually is.
>
> Same discussion could happen for lazy free memory which is anon but now
> appears on the file lru stats and not the anon lru stats: it's easily
> reclaimable under memory pressure but you need to reconcile the difference
> between the anon metric and what is revealed in the anon lru stats.
>
> That gave way to my original thought of a si_mem_available()-like
> calculation ("avail") by doing
>
>         free = memory.high - memory.current

I'm wondering what if high or max is set to max limit. Don't you end
up seeing a super large memavail?

>         lazyfree = file - (active_file + inactive_file)

Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE
just updates inactive lru size.

>         deferred = active_anon - anon
>
>         avail = free + lazyfree + deferred +
>                 (active_file + inactive_file + slab_reclaimable) / 2
>
> And we have the ability to change this formula based on kernel
> implementation details as they evolve.  Idea is to provide a consistent
> field that userspace can use to determine the rough amount of reclaimable
> memory in a MemAvailable-like way.
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Memcg stat for available memory
  2020-07-10 21:04         ` Yang Shi
@ 2020-07-12 22:02           ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2020-07-12 22:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Cgroups, Linux MM

On Fri, 10 Jul 2020, Yang Shi wrote:

> > To try to get more discussion on the subject, consider a malloc
> > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
> > to the system and how this freed memory is then described to userspace
> > depending on the kernel implementation.
> >
> >  [ For the sake of this discussion, consider we have precise memcg stats
> >    available to us although the actual implementation allows for some
> >    variance (MEMCG_CHARGE_BATCH). ]
> >
> > With a 64MB heap backed by thp on x86, for example, the vma starts with an
> > rss of 64MB, all of which is anon and backed by hugepages.  Imagine some
> > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
> > mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.
> >
> > Before freeing, anon, anon_thp, and active_anon in memory.stat would all
> > be the same for this vma (64MB).  64MB would also be charged to
> > memory.current.  That's all working as intended and to the expectation of
> > userspace.
> >
> > After freeing, however, we have the kernel implementation specific detail
> > of how huge pmd splitting is handled (rss) in comparison to the underlying
> > split of the compound page (deferred split queue).  The huge pmd is always
> > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
> > for this vma and none of it is backed by thp.
> >
> > What is charged to the memcg (memory.current) and what is on active_anon
> > is unchanged, however, because the underlying compound pages are still
> > charged to the memcg.  The amount of anon and anon_thp are decreased
> > in compliance with the splitting of the page tables, however.
> >
> > So after freeing, for this vma: anon = 128KB, anon_thp = 0,
> > active_anon = 64MB, memory.current = 64MB.
> >
> > In this case, because of the deferred split queue, which is a kernel
> > implementation detail, userspace may be unclear on what is actually
> > reclaimable -- and this memory is reclaimable under memory pressure.  For
> > the motivation of MemAvailable (what amount of memory is available for
> > starting new work), userspace *could* determine this through the
> > aforementioned active_anon - anon (or some combination of
> > memory.current - anon - file - slab), but I think it's a fair point that
> > userspace's view of reclaimable memory as the kernel implementation
> > changes is something that can and should remain consistent between
> > versions.
> >
> > Otherwise, an earlier implementation before deferred split queues could
> > have safely assumed that active_anon was unreclaimable unless swap were
> > enabled.  It doesn't have the foresight based on future kernel
> > implementation detail to reconcile what the amount of reclaimable memory
> > actually is.
> >
> > Same discussion could happen for lazy free memory which is anon but now
> > appears on the file lru stats and not the anon lru stats: it's easily
> > reclaimable under memory pressure but you need to reconcile the difference
> > between the anon metric and what is revealed in the anon lru stats.
> >
> > That gave way to my original thought of a si_mem_available()-like
> > calculation ("avail") by doing
> >
> >         free = memory.high - memory.current
> 
> I'm wondering what if high or max is set to max limit. Don't you end
> up seeing a super large memavail?
> 

Hi Yang,

Yes, this would be the same as seeing a super large limit :)

I'm indifferent to whether this is described as an available amount of 
memory (almost identical to MemAvailable) or a best guess of the 
reclaimable amount of memory from the memory that is currently charged.  
Concept is to provide userspace with this best guess like we do for system 
memory through MemAvailable because it (a) depends on implementation 
details in the kernel and (b) is the only way to maintain consistency from 
version to version.

> >         lazyfree = file - (active_file + inactive_file)
> 
> Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE
> just updates inactive lru size.
> 

Yes, you're right, this would be

	lazyfree = (active_file + inactive_file) - file

from memory.stat.  Lazy free memory are clean anon pages on the 
inactive file lru, but we must consider active_file + inactive_file in 
comparison to "file" for the total amount of lazy free.

Another side effect of this is that we'd need anon - lazyfree swap space 
available for this workload to be swapped.

The overall point I'm trying to highlight is that the amount of memory 
that can be freed under memory pressure, either lazy free or on the 
deferred split queues, can be substantial.  I'd like to discuss the 
feasibility of adding this as a kernel maintained stat to memory.stat 
rather than userspace attempting to derive this on its own.

> >         deferred = active_anon - anon
> >
> >         avail = free + lazyfree + deferred +
> >                 (active_file + inactive_file + slab_reclaimable) / 2
> >
> > And we have the ability to change this formula based on kernel
> > implementation details as they evolve.  Idea is to provide a consistent
> > field that userspace can use to determine the rough amount of reclaimable
> > memory in a MemAvailable-like way.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-12 22:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-28 22:15 Memcg stat for available memory David Rientjes
2020-07-02 15:22 ` Shakeel Butt
2020-07-03  8:15   ` Michal Hocko
2020-07-07 19:58     ` David Rientjes
2020-07-10 19:47       ` David Rientjes
2020-07-10 21:04         ` Yang Shi
2020-07-12 22:02           ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).