All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  3:18 ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15  3:18 UTC (permalink / raw)
  To: Andrew Morton, Yang Shi
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov, cgroups,
	linux-mm

MemAvailable in /proc/meminfo provides some guidance on the amount of
memory that can be made available for starting new applications (see
Documentation/filesystems/proc.rst).

Userspace can lack insight into the amount of memory that can be reclaimed
from a memcg based on values from memory.stat, however.  Two specific
examples:

 - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
   inactive file LRU that can be quickly reclaimed under memory pressure
   but otherwise shows up as mapped anon in memory.stat, and

 - Memory on deferred split queues (thp) that are compound pages that can
   be split and uncharged from the memcg under memory pressure, but
   otherwise shows up as charged anon LRU memory in memory.stat.

Userspace can currently derive this information and use the same heuristic
as MemAvailable by doing this:

	deferred = (active_anon + inactive_anon) - anon
	lazyfree = (active_file + inactive_file) - file

	avail = deferred + lazyfree + (file + slab_reclaimable) / 2

But this depends on implementation details for how this memory is handled
in the kernel for the purposes of reclaim (anon on inactive file LRU or
unmapped anon on the LRU).

For the purposes of writing portable userspace code that does not need to
have insight into the kernel implementation for reclaimable memory, this
exports a metric that can provide an estimate of the amount of memory that
can be reclaimed and uncharged from the memcg to start new applications.

As the kernel implementation evolves for memory that can be reclaimed
under memory pressure, this metric can be kept consistent.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 12 +++++++++
 mm/memcontrol.c                         | 35 +++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back.
 		Part of "slab" that cannot be reclaimed on memory
 		pressure.
 
+	  avail
+		An estimate of how much memory can be made available for
+		starting new applications, similar to MemAvailable from
+		/proc/meminfo (Documentation/filesystems/proc.rst).
+
+		This is derived by assuming that half of page cahce and
+		reclaimable slab can be uncharged without significantly
+		impacting the workload, similar to MemAvailable.  It also
+		factors in the amount of lazy freeable memory (MADV_FREE) and
+		compound pages that can be split and uncharged under memory
+		pressure.
+
 	  pgfault
 		Total number of page faults incurred
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 	return false;
 }
 
+/*
+ * Returns an estimate of the amount of available memory that can be reclaimed
+ * for a memcg, in pages.
+ */
+static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg)
+{
+	long deferred, lazyfree;
+
+	/*
+	 * Deferred pages are charged anonymous pages that are on the LRU but
+	 * are unmapped.  These compound pages are split under memory pressure.
+	 */
+	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
+			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
+			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
+	/*
+	 * Lazyfree pages are charged clean anonymous pages that are on the file
+	 * LRU and can be reclaimed under memory pressure.
+	 */
+	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
+			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
+			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
+
+	/* Using same heuristic as si_mem_available() */
+	return (unsigned long)deferred + (unsigned long)lazyfree +
+	       (memcg_page_state(memcg, NR_FILE_PAGES) +
+		memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2;
+}
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
 		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
 		       PAGE_SIZE);
+	/*
+	 * All values in this buffer are read individually, no implied
+	 * consistency amongst them.
+	 */
+	seq_buf_printf(&s, "avail %llu\n",
+		       (u64)mem_cgroup_avail(memcg) * PAGE_SIZE);
 
 	/* Accumulated memory events */
 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  3:18 ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15  3:18 UTC (permalink / raw)
  To: Andrew Morton, Yang Shi
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

MemAvailable in /proc/meminfo provides some guidance on the amount of
memory that can be made available for starting new applications (see
Documentation/filesystems/proc.rst).

Userspace can lack insight into the amount of memory that can be reclaimed
from a memcg based on values from memory.stat, however.  Two specific
examples:

 - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
   inactive file LRU that can be quickly reclaimed under memory pressure
   but otherwise shows up as mapped anon in memory.stat, and

 - Memory on deferred split queues (thp) that are compound pages that can
   be split and uncharged from the memcg under memory pressure, but
   otherwise shows up as charged anon LRU memory in memory.stat.

Userspace can currently derive this information and use the same heuristic
as MemAvailable by doing this:

	deferred = (active_anon + inactive_anon) - anon
	lazyfree = (active_file + inactive_file) - file

	avail = deferred + lazyfree + (file + slab_reclaimable) / 2

But this depends on implementation details for how this memory is handled
in the kernel for the purposes of reclaim (anon on inactive file LRU or
unmapped anon on the LRU).

For the purposes of writing portable userspace code that does not need to
have insight into the kernel implementation for reclaimable memory, this
exports a metric that can provide an estimate of the amount of memory that
can be reclaimed and uncharged from the memcg to start new applications.

As the kernel implementation evolves for memory that can be reclaimed
under memory pressure, this metric can be kept consistent.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 12 +++++++++
 mm/memcontrol.c                         | 35 +++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back.
 		Part of "slab" that cannot be reclaimed on memory
 		pressure.
 
+	  avail
+		An estimate of how much memory can be made available for
+		starting new applications, similar to MemAvailable from
+		/proc/meminfo (Documentation/filesystems/proc.rst).
+
+		This is derived by assuming that half of page cahce and
+		reclaimable slab can be uncharged without significantly
+		impacting the workload, similar to MemAvailable.  It also
+		factors in the amount of lazy freeable memory (MADV_FREE) and
+		compound pages that can be split and uncharged under memory
+		pressure.
+
 	  pgfault
 		Total number of page faults incurred
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 	return false;
 }
 
+/*
+ * Returns an estimate of the amount of available memory that can be reclaimed
+ * for a memcg, in pages.
+ */
+static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg)
+{
+	long deferred, lazyfree;
+
+	/*
+	 * Deferred pages are charged anonymous pages that are on the LRU but
+	 * are unmapped.  These compound pages are split under memory pressure.
+	 */
+	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
+			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
+			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
+	/*
+	 * Lazyfree pages are charged clean anonymous pages that are on the file
+	 * LRU and can be reclaimed under memory pressure.
+	 */
+	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
+			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
+			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
+
+	/* Using same heuristic as si_mem_available() */
+	return (unsigned long)deferred + (unsigned long)lazyfree +
+	       (memcg_page_state(memcg, NR_FILE_PAGES) +
+		memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2;
+}
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
 		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
 		       PAGE_SIZE);
+	/*
+	 * All values in this buffer are read individually, no implied
+	 * consistency amongst them.
+	 */
+	seq_buf_printf(&s, "avail %llu\n",
+		       (u64)mem_cgroup_avail(memcg) * PAGE_SIZE);
 
 	/* Accumulated memory events */
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  7:00   ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15  7:00 UTC (permalink / raw)
  To: Andrew Morton, Yang Shi
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov, cgroups,
	linux-mm

On Tue, 14 Jul 2020, David Rientjes wrote:

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back.
>  		Part of "slab" that cannot be reclaimed on memory
>  		pressure.
>  
> +	  avail
> +		An estimate of how much memory can be made available for
> +		starting new applications, similar to MemAvailable from
> +		/proc/meminfo (Documentation/filesystems/proc.rst).
> +
> +		This is derived by assuming that half of page cahce and
> +		reclaimable slab can be uncharged without significantly
> +		impacting the workload, similar to MemAvailable.  It also
> +		factors in the amount of lazy freeable memory (MADV_FREE) and
> +		compound pages that can be split and uncharged under memory
> +		pressure.
> +
>  	  pgfault
>  		Total number of page faults incurred
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns an estimate of the amount of available memory that can be reclaimed
> + * for a memcg, in pages.
> + */
> +static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
> +
> +	/* Using same heuristic as si_mem_available() */
> +	return (unsigned long)deferred + (unsigned long)lazyfree +
> +	       (memcg_page_state(memcg, NR_FILE_PAGES) +
> +		memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2;
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>  	struct seq_buf s;
> @@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
>  		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
>  		       PAGE_SIZE);
> +	/*
> +	 * All values in this buffer are read individually, no implied
> +	 * consistency amongst them.
> +	 */
> +	seq_buf_printf(&s, "avail %llu\n",
> +		       (u64)mem_cgroup_avail(memcg) * PAGE_SIZE);
>  
>  	/* Accumulated memory events */
>  
> 

An alternative to this would also be to change from an "available" metric 
to an "anon_reclaimable" metric since both the deferred split queues and 
lazy freeable memory would pertain to anon.  This would no longer attempt 
to mimic MemAvailable and leave any such calculation to userspace
(anon_reclaimable + (file + slab_reclaimable) / 2).

With this route, care would need to be taken to clearly indicate that 
anon_reclaimable is not necessarily a subset of the "anon" metric since 
reclaimable memory from compound pages on deferred split queues is not 
mapped, so it doesn't show up in NR_ANON_MAPPED.

I'm indifferent to either approach and would be happy to switch to 
anon_reclaimable if others agree and doesn't foresee any extensibility 
issues.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  7:00   ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15  7:00 UTC (permalink / raw)
  To: Andrew Morton, Yang Shi
  Cc: Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, 14 Jul 2020, David Rientjes wrote:

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1314,6 +1314,18 @@ PAGE_SIZE multiple when read back.
>  		Part of "slab" that cannot be reclaimed on memory
>  		pressure.
>  
> +	  avail
> +		An estimate of how much memory can be made available for
> +		starting new applications, similar to MemAvailable from
> +		/proc/meminfo (Documentation/filesystems/proc.rst).
> +
> +		This is derived by assuming that half of page cahce and
> +		reclaimable slab can be uncharged without significantly
> +		impacting the workload, similar to MemAvailable.  It also
> +		factors in the amount of lazy freeable memory (MADV_FREE) and
> +		compound pages that can be split and uncharged under memory
> +		pressure.
> +
>  	  pgfault
>  		Total number of page faults incurred
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,35 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns an estimate of the amount of available memory that can be reclaimed
> + * for a memcg, in pages.
> + */
> +static unsigned long mem_cgroup_avail(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
> +
> +	/* Using same heuristic as si_mem_available() */
> +	return (unsigned long)deferred + (unsigned long)lazyfree +
> +	       (memcg_page_state(memcg, NR_FILE_PAGES) +
> +		memcg_page_state(memcg, NR_SLAB_RECLAIMABLE)) / 2;
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>  	struct seq_buf s;
> @@ -1417,6 +1446,12 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
>  		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
>  		       PAGE_SIZE);
> +	/*
> +	 * All values in this buffer are read individually, no implied
> +	 * consistency amongst them.
> +	 */
> +	seq_buf_printf(&s, "avail %llu\n",
> +		       (u64)mem_cgroup_avail(memcg) * PAGE_SIZE);
>  
>  	/* Accumulated memory events */
>  
> 

An alternative to this would also be to change from an "available" metric 
to an "anon_reclaimable" metric since both the deferred split queues and 
lazy freeable memory would pertain to anon.  This would no longer attempt 
to mimic MemAvailable and leave any such calculation to userspace
(anon_reclaimable + (file + slab_reclaimable) / 2).

With this route, care would need to be taken to clearly indicate that 
anon_reclaimable is not necessarily a subset of the "anon" metric since 
reclaimable memory from compound pages on deferred split queues is not 
mapped, so it doesn't show up in NR_ANON_MAPPED.

I'm indifferent to either approach and would be happy to switch to 
anon_reclaimable if others agree and doesn't foresee any extensibility 
issues.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  7:15     ` SeongJae Park
  0 siblings, 0 replies; 29+ messages in thread
From: SeongJae Park @ 2020-07-15  7:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

Hello David,

On Wed, 15 Jul 2020 00:00:03 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Tue, 14 Jul 2020, David Rientjes wrote:
> 
[...]
> 
> An alternative to this would also be to change from an "available" metric 
> to an "anon_reclaimable" metric since both the deferred split queues and 
> lazy freeable memory would pertain to anon.  This would no longer attempt 
> to mimic MemAvailable and leave any such calculation to userspace
> (anon_reclaimable + (file + slab_reclaimable) / 2).
> 
> With this route, care would need to be taken to clearly indicate that 
> anon_reclaimable is not necessarily a subset of the "anon" metric since 
> reclaimable memory from compound pages on deferred split queues is not 
> mapped, so it doesn't show up in NR_ANON_MAPPED.
> 
> I'm indifferent to either approach and would be happy to switch to 
> anon_reclaimable if others agree and doesn't foresee any extensibility 
> issues.

Agreed, I was also once confused about the 'MemAvailable'.  The 'reclaimable'
might be better to understand.


Thanks,
SeongJae Park


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15  7:15     ` SeongJae Park
  0 siblings, 0 replies; 29+ messages in thread
From: SeongJae Park @ 2020-07-15  7:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello David,

On Wed, 15 Jul 2020 00:00:03 -0700 (PDT) David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On Tue, 14 Jul 2020, David Rientjes wrote:
> 
[...]
> 
> An alternative to this would also be to change from an "available" metric 
> to an "anon_reclaimable" metric since both the deferred split queues and 
> lazy freeable memory would pertain to anon.  This would no longer attempt 
> to mimic MemAvailable and leave any such calculation to userspace
> (anon_reclaimable + (file + slab_reclaimable) / 2).
> 
> With this route, care would need to be taken to clearly indicate that 
> anon_reclaimable is not necessarily a subset of the "anon" metric since 
> reclaimable memory from compound pages on deferred split queues is not 
> mapped, so it doesn't show up in NR_ANON_MAPPED.
> 
> I'm indifferent to either approach and would be happy to switch to 
> anon_reclaimable if others agree and doesn't foresee any extensibility 
> issues.

Agreed, I was also once confused about the 'MemAvailable'.  The 'reclaimable'
might be better to understand.


Thanks,
SeongJae Park

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15 13:10   ` Chris Down
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Down @ 2020-07-15 13:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

Hi David,

I'm somewhat against adding more metrics which try to approximate availability 
of memory when we already know it not to generally manifest very well in 
practice, especially since this *is* calculable by userspace (albeit with some 
knowledge of mm internals). Users and applications often vastly overestimate 
the reliability of these metrics, especially since they heavily depend on 
transient page states and whatever reclaim efficacy happens to be achieved at 
the time there is demand.

What do you intend to do with these metrics and how do you envisage other users 
should use them? Is it not possible to rework the strategy to use pressure 
information and/or workingset pressurisation instead?

Thanks,

Chris


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15 13:10   ` Chris Down
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Down @ 2020-07-15 13:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hi David,

I'm somewhat against adding more metrics which try to approximate availability 
of memory when we already know it not to generally manifest very well in 
practice, especially since this *is* calculable by userspace (albeit with some 
knowledge of mm internals). Users and applications often vastly overestimate 
the reliability of these metrics, especially since they heavily depend on 
transient page states and whatever reclaim efficacy happens to be achieved at 
the time there is demand.

What do you intend to do with these metrics and how do you envisage other users 
should use them? Is it not possible to rework the strategy to use pressure 
information and/or workingset pressurisation instead?

Thanks,

Chris

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15 17:33       ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15 17:33 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

On Wed, 15 Jul 2020, SeongJae Park wrote:

> > An alternative to this would also be to change from an "available" metric 
> > to an "anon_reclaimable" metric since both the deferred split queues and 
> > lazy freeable memory would pertain to anon.  This would no longer attempt 
> > to mimic MemAvailable and leave any such calculation to userspace
> > (anon_reclaimable + (file + slab_reclaimable) / 2).
> > 
> > With this route, care would need to be taken to clearly indicate that 
> > anon_reclaimable is not necessarily a subset of the "anon" metric since 
> > reclaimable memory from compound pages on deferred split queues is not 
> > mapped, so it doesn't show up in NR_ANON_MAPPED.
> > 
> > I'm indifferent to either approach and would be happy to switch to 
> > anon_reclaimable if others agree and doesn't foresee any extensibility 
> > issues.
> 
> Agreed, I was also once confused about the 'MemAvailable'.  The 'reclaimable'
> might be better to understand.
> 

Hi SeongJae,

I'm leaning in that direction now too, actually, because I reasoned that 
determining the precise amount of anon that can be reclaimed would require 
subtracting (file + slab_reclaimable) / 2, which is awkward :)

So I'll send a follow-up patch to add only an anon_reclaimable field which 
is good enough for our purposes unless others would like to have more 
discussion.

Thanks!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-15 17:33       ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-15 17:33 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, 15 Jul 2020, SeongJae Park wrote:

> > An alternative to this would also be to change from an "available" metric 
> > to an "anon_reclaimable" metric since both the deferred split queues and 
> > lazy freeable memory would pertain to anon.  This would no longer attempt 
> > to mimic MemAvailable and leave any such calculation to userspace
> > (anon_reclaimable + (file + slab_reclaimable) / 2).
> > 
> > With this route, care would need to be taken to clearly indicate that 
> > anon_reclaimable is not necessarily a subset of the "anon" metric since 
> > reclaimable memory from compound pages on deferred split queues is not 
> > mapped, so it doesn't show up in NR_ANON_MAPPED.
> > 
> > I'm indifferent to either approach and would be happy to switch to 
> > anon_reclaimable if others agree and doesn't foresee any extensibility 
> > issues.
> 
> Agreed, I was also once confused about the 'MemAvailable'.  The 'reclaimable'
> might be better to understand.
> 

Hi SeongJae,

I'm leaning in that direction now too, actually, because I reasoned that 
determining the precise amount of anon that can be reclaimed would require 
subtracting (file + slab_reclaimable) / 2, which is awkward :)

So I'll send a follow-up patch to add only an anon_reclaimable field which 
is good enough for our purposes unless others would like to have more 
discussion.

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
       [not found]   ` <20200715131048.GA176092-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
@ 2020-07-15 18:02     ` David Rientjes
  2020-07-17 12:17         ` Chris Down
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2020-07-15 18:02 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, 15 Jul 2020, Chris Down wrote:

> Hi David,
> 
> I'm somewhat against adding more metrics which try to approximate availability
> of memory when we already know it not to generally manifest very well in
> practice, especially since this *is* calculable by userspace (albeit with some
> knowledge of mm internals). Users and applications often vastly overestimate
> the reliability of these metrics, especially since they heavily depend on
> transient page states and whatever reclaim efficacy happens to be achieved at
> the time there is demand.
> 

Hi Chris,

With the proposed anon_reclaimable, do you have any reliability concerns?  
This would be the amount of lazy freeable memory and memory that can be 
uncharged if compound pages from the deferred split queue are split under 
memory pressure.  It seems to be a very precise value (as slab_reclaimable 
already in memory.stat is), so I'm not sure why there is a reliability 
concern.  Maybe you can elaborate?

Today, this information is indeed possible to calculate from userspace.  
The idea is to present this information that will be backwards compatible, 
however, as the kernel implementation changes.  When lazy freeable memory 
was added, for instance, userspace likely would not have preemptively been 
doing an "active_file + inactive_file - file" calculation to factor that 
in as reclaimable anon :)

> What do you intend to do with these metrics and how do you envisage other
> users should use them? Is it not possible to rework the strategy to use
> pressure information and/or workingset pressurisation instead?
> 

Previously, users would interpret their anon usage as non reclaimable if 
swap is disabled and now that value can include a *lot* of easily 
reclaimable memory.  Our users would also carefully monitor their current 
memcg usage and/or anon usage to detect abnormalities without concern for 
what is reclaimable, especially for things like deferred split queues that 
was purely a kernel implementation change.  Memcg usage and anon usag then 
becomes wildly different between kernel versions and our users alert on 
that abnormality.

The example I gave earlier in the thread showed how dramatically different 
memory.current is before and after the introduction of deferred split 
queues.  Userspace sees ballooning memcg usage and alerts on it (suspects 
a memory leak, for example) when in reality this is purely reclaimable 
memory under pressure and is the result of a kernel implementation detail.

We plan on factoring this information in when determining what the actual 
amount of memory that can and cannot be reclaimed from a memcg hierarchy 
at any given time.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 20:58         ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-16 20:58 UTC (permalink / raw)
  To: SeongJae Park, Andrew Morton
  Cc: Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov, cgroups,
	linux-mm

Userspace can lack insight into the amount of memory that can be reclaimed
from a memcg based on values from memory.stat.  Two specific examples:

 - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
   inactive file LRU that can be quickly reclaimed under memory pressure
   but otherwise shows up as mapped anon in memory.stat, and

 - Memory on deferred split queues (thp) that are compound pages that can
   be split and uncharged from the memcg under memory pressure, but
   otherwise shows up as charged anon LRU memory in memory.stat.

Both of this anonymous usage is also charged to memory.current.

Userspace can currently derive this information but it depends on kernel
implementation details for how this memory is handled for the purposes of
reclaim (anon on inactive file LRU or unmapped anon on the LRU).

For the purposes of writing portable userspace code that does not need to
have insight into the kernel implementation for reclaimable memory, this
exports a stat that reveals the amount of anonymous memory that can be
reclaimed and uncharged from the memcg to start new applications.

As the kernel implementation evolves for memory that can be reclaimed
under memory pressure, this stat can be kept consistent.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++++
 mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used in anonymous mappings backed by
 		transparent hugepages
 
+	  anon_reclaimable
+		The amount of charged anonymous memory that can be reclaimed
+		under memory pressure without swap.  This currently includes
+		lazy freeable memory (MADV_FREE) and compound pages that can be
+		split and uncharged.
+
 	  inactive_anon, active_anon, inactive_file, active_file, unevictable
 		Amount of memory, swap-backed and filesystem-backed,
 		on the internal memory management lists used by the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 	return false;
 }
 
+/*
+ * Returns the amount of anon memory that is charged to the memcg that is
+ * reclaimable under memory pressure without swap, in pages.
+ */
+static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
+{
+	long deferred, lazyfree;
+
+	/*
+	 * Deferred pages are charged anonymous pages that are on the LRU but
+	 * are unmapped.  These compound pages are split under memory pressure.
+	 */
+	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
+			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
+			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
+	/*
+	 * Lazyfree pages are charged clean anonymous pages that are on the file
+	 * LRU and can be reclaimed under memory pressure.
+	 */
+	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
+			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
+			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
+
+	return deferred + lazyfree;
+}
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1363,6 +1389,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	 * Provide statistics on the state of the memory subsystem as
 	 * well as cumulative event counters that show past behavior.
 	 *
+	 * All values in this buffer are read individually, so no implied
+	 * consistency amongst them.
+	 *
 	 * This list is ordered following a combination of these gradients:
 	 * 1) generic big picture -> specifics and details
 	 * 2) reflecting userspace activity -> reflecting kernel heuristics
@@ -1405,6 +1434,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, NR_ANON_THPS) *
 		       HPAGE_PMD_SIZE);
 #endif
+	seq_buf_printf(&s, "anon_reclaimable %llu\n",
+		       (u64)memcg_anon_reclaimable(memcg) * PAGE_SIZE);
 
 	for (i = 0; i < NR_LRU_LISTS; i++)
 		seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 20:58         ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-16 20:58 UTC (permalink / raw)
  To: SeongJae Park, Andrew Morton
  Cc: Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi, Roman Gushchin,
	Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Userspace can lack insight into the amount of memory that can be reclaimed
from a memcg based on values from memory.stat.  Two specific examples:

 - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
   inactive file LRU that can be quickly reclaimed under memory pressure
   but otherwise shows up as mapped anon in memory.stat, and

 - Memory on deferred split queues (thp) that are compound pages that can
   be split and uncharged from the memcg under memory pressure, but
   otherwise shows up as charged anon LRU memory in memory.stat.

Both of this anonymous usage is also charged to memory.current.

Userspace can currently derive this information but it depends on kernel
implementation details for how this memory is handled for the purposes of
reclaim (anon on inactive file LRU or unmapped anon on the LRU).

For the purposes of writing portable userspace code that does not need to
have insight into the kernel implementation for reclaimable memory, this
exports a stat that reveals the amount of anonymous memory that can be
reclaimed and uncharged from the memcg to start new applications.

As the kernel implementation evolves for memory that can be reclaimed
under memory pressure, this stat can be kept consistent.

Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++++
 mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
 		Amount of memory used in anonymous mappings backed by
 		transparent hugepages
 
+	  anon_reclaimable
+		The amount of charged anonymous memory that can be reclaimed
+		under memory pressure without swap.  This currently includes
+		lazy freeable memory (MADV_FREE) and compound pages that can be
+		split and uncharged.
+
 	  inactive_anon, active_anon, inactive_file, active_file, unevictable
 		Amount of memory, swap-backed and filesystem-backed,
 		on the internal memory management lists used by the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 	return false;
 }
 
+/*
+ * Returns the amount of anon memory that is charged to the memcg that is
+ * reclaimable under memory pressure without swap, in pages.
+ */
+static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
+{
+	long deferred, lazyfree;
+
+	/*
+	 * Deferred pages are charged anonymous pages that are on the LRU but
+	 * are unmapped.  These compound pages are split under memory pressure.
+	 */
+	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
+			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
+			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
+	/*
+	 * Lazyfree pages are charged clean anonymous pages that are on the file
+	 * LRU and can be reclaimed under memory pressure.
+	 */
+	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
+			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
+			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
+
+	return deferred + lazyfree;
+}
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1363,6 +1389,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	 * Provide statistics on the state of the memory subsystem as
 	 * well as cumulative event counters that show past behavior.
 	 *
+	 * All values in this buffer are read individually, so no implied
+	 * consistency amongst them.
+	 *
 	 * This list is ordered following a combination of these gradients:
 	 * 1) generic big picture -> specifics and details
 	 * 2) reflecting userspace activity -> reflecting kernel heuristics
@@ -1405,6 +1434,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 		       (u64)memcg_page_state(memcg, NR_ANON_THPS) *
 		       HPAGE_PMD_SIZE);
 #endif
+	seq_buf_printf(&s, "anon_reclaimable %llu\n",
+		       (u64)memcg_anon_reclaimable(memcg) * PAGE_SIZE);
 
 	for (i = 0; i < NR_LRU_LISTS; i++)
 		seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 21:07           ` Shakeel Butt
  0 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2020-07-16 21:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, Jul 16, 2020 at 1:58 PM David Rientjes <rientjes@google.com> wrote:
>
> Userspace can lack insight into the amount of memory that can be reclaimed
> from a memcg based on values from memory.stat.  Two specific examples:
>
>  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
>    inactive file LRU that can be quickly reclaimed under memory pressure
>    but otherwise shows up as mapped anon in memory.stat, and
>
>  - Memory on deferred split queues (thp) that are compound pages that can
>    be split and uncharged from the memcg under memory pressure, but
>    otherwise shows up as charged anon LRU memory in memory.stat.
>
> Both of this anonymous usage is also charged to memory.current.
>
> Userspace can currently derive this information but it depends on kernel
> implementation details for how this memory is handled for the purposes of
> reclaim (anon on inactive file LRU or unmapped anon on the LRU).
>
> For the purposes of writing portable userspace code that does not need to
> have insight into the kernel implementation for reclaimable memory, this
> exports a stat that reveals the amount of anonymous memory that can be
> reclaimed and uncharged from the memcg to start new applications.
>
> As the kernel implementation evolves for memory that can be reclaimed
> under memory pressure, this stat can be kept consistent.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
>  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
>                 Amount of memory used in anonymous mappings backed by
>                 transparent hugepages
>
> +         anon_reclaimable
> +               The amount of charged anonymous memory that can be reclaimed
> +               under memory pressure without swap.  This currently includes
> +               lazy freeable memory (MADV_FREE) and compound pages that can be
> +               split and uncharged.
> +
>           inactive_anon, active_anon, inactive_file, active_file, unevictable
>                 Amount of memory, swap-backed and filesystem-backed,
>                 on the internal memory management lists used by the
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>         return false;
>  }
>
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +       long deferred, lazyfree;
> +
> +       /*
> +        * Deferred pages are charged anonymous pages that are on the LRU but
> +        * are unmapped.  These compound pages are split under memory pressure.
> +        */
> +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);

Please note that the NR_ANON_MAPPED does not include tmpfs memory but
NR_[IN]ACTIVE_ANON does include the tmpfs.

> +       /*
> +        * Lazyfree pages are charged clean anonymous pages that are on the file
> +        * LRU and can be reclaimed under memory pressure.
> +        */
> +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);

Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 21:07           ` Shakeel Butt
  0 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2020-07-16 21:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, Jul 16, 2020 at 1:58 PM David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> Userspace can lack insight into the amount of memory that can be reclaimed
> from a memcg based on values from memory.stat.  Two specific examples:
>
>  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
>    inactive file LRU that can be quickly reclaimed under memory pressure
>    but otherwise shows up as mapped anon in memory.stat, and
>
>  - Memory on deferred split queues (thp) that are compound pages that can
>    be split and uncharged from the memcg under memory pressure, but
>    otherwise shows up as charged anon LRU memory in memory.stat.
>
> Both of this anonymous usage is also charged to memory.current.
>
> Userspace can currently derive this information but it depends on kernel
> implementation details for how this memory is handled for the purposes of
> reclaim (anon on inactive file LRU or unmapped anon on the LRU).
>
> For the purposes of writing portable userspace code that does not need to
> have insight into the kernel implementation for reclaimable memory, this
> exports a stat that reveals the amount of anonymous memory that can be
> reclaimed and uncharged from the memcg to start new applications.
>
> As the kernel implementation evolves for memory that can be reclaimed
> under memory pressure, this stat can be kept consistent.
>
> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
>  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
>                 Amount of memory used in anonymous mappings backed by
>                 transparent hugepages
>
> +         anon_reclaimable
> +               The amount of charged anonymous memory that can be reclaimed
> +               under memory pressure without swap.  This currently includes
> +               lazy freeable memory (MADV_FREE) and compound pages that can be
> +               split and uncharged.
> +
>           inactive_anon, active_anon, inactive_file, active_file, unevictable
>                 Amount of memory, swap-backed and filesystem-backed,
>                 on the internal memory management lists used by the
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>         return false;
>  }
>
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +       long deferred, lazyfree;
> +
> +       /*
> +        * Deferred pages are charged anonymous pages that are on the LRU but
> +        * are unmapped.  These compound pages are split under memory pressure.
> +        */
> +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);

Please note that the NR_ANON_MAPPED does not include tmpfs memory but
NR_[IN]ACTIVE_ANON does include the tmpfs.

> +       /*
> +        * Lazyfree pages are charged clean anonymous pages that are on the file
> +        * LRU and can be reclaimed under memory pressure.
> +        */
> +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);

Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 21:28             ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-16 21:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, 16 Jul 2020, Shakeel Butt wrote:

> > Userspace can lack insight into the amount of memory that can be reclaimed
> > from a memcg based on values from memory.stat.  Two specific examples:
> >
> >  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
> >    inactive file LRU that can be quickly reclaimed under memory pressure
> >    but otherwise shows up as mapped anon in memory.stat, and
> >
> >  - Memory on deferred split queues (thp) that are compound pages that can
> >    be split and uncharged from the memcg under memory pressure, but
> >    otherwise shows up as charged anon LRU memory in memory.stat.
> >
> > Both of this anonymous usage is also charged to memory.current.
> >
> > Userspace can currently derive this information but it depends on kernel
> > implementation details for how this memory is handled for the purposes of
> > reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> >
> > For the purposes of writing portable userspace code that does not need to
> > have insight into the kernel implementation for reclaimable memory, this
> > exports a stat that reveals the amount of anonymous memory that can be
> > reclaimed and uncharged from the memcg to start new applications.
> >
> > As the kernel implementation evolves for memory that can be reclaimed
> > under memory pressure, this stat can be kept consistent.
> >
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
> >  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
> >  2 files changed, 37 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
> >                 Amount of memory used in anonymous mappings backed by
> >                 transparent hugepages
> >
> > +         anon_reclaimable
> > +               The amount of charged anonymous memory that can be reclaimed
> > +               under memory pressure without swap.  This currently includes
> > +               lazy freeable memory (MADV_FREE) and compound pages that can be
> > +               split and uncharged.
> > +
> >           inactive_anon, active_anon, inactive_file, active_file, unevictable
> >                 Amount of memory, swap-backed and filesystem-backed,
> >                 on the internal memory management lists used by the
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> >         return false;
> >  }
> >
> > +/*
> > + * Returns the amount of anon memory that is charged to the memcg that is
> > + * reclaimable under memory pressure without swap, in pages.
> > + */
> > +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> > +{
> > +       long deferred, lazyfree;
> > +
> > +       /*
> > +        * Deferred pages are charged anonymous pages that are on the LRU but
> > +        * are unmapped.  These compound pages are split under memory pressure.
> > +        */
> > +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> > +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> > +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> 
> Please note that the NR_ANON_MAPPED does not include tmpfs memory but
> NR_[IN]ACTIVE_ANON does include the tmpfs.
> 
> > +       /*
> > +        * Lazyfree pages are charged clean anonymous pages that are on the file
> > +        * LRU and can be reclaimed under memory pressure.
> > +        */
> > +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> > +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> > +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);
> 
> Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.
> 

Ah, so this adds to the motivation of providing the anon_reclaimable stat 
because the calculation becomes even more convoluted and completely based 
on the kernel implementation details for both lazyfree memory and deferred 
split queues.  Did you have a calculation in mind for 
memcg_anon_reclaimable()?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-16 21:28             ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-16 21:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, 16 Jul 2020, Shakeel Butt wrote:

> > Userspace can lack insight into the amount of memory that can be reclaimed
> > from a memcg based on values from memory.stat.  Two specific examples:
> >
> >  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
> >    inactive file LRU that can be quickly reclaimed under memory pressure
> >    but otherwise shows up as mapped anon in memory.stat, and
> >
> >  - Memory on deferred split queues (thp) that are compound pages that can
> >    be split and uncharged from the memcg under memory pressure, but
> >    otherwise shows up as charged anon LRU memory in memory.stat.
> >
> > Both of this anonymous usage is also charged to memory.current.
> >
> > Userspace can currently derive this information but it depends on kernel
> > implementation details for how this memory is handled for the purposes of
> > reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> >
> > For the purposes of writing portable userspace code that does not need to
> > have insight into the kernel implementation for reclaimable memory, this
> > exports a stat that reveals the amount of anonymous memory that can be
> > reclaimed and uncharged from the memcg to start new applications.
> >
> > As the kernel implementation evolves for memory that can be reclaimed
> > under memory pressure, this stat can be kept consistent.
> >
> > Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
> >  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
> >  2 files changed, 37 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
> >                 Amount of memory used in anonymous mappings backed by
> >                 transparent hugepages
> >
> > +         anon_reclaimable
> > +               The amount of charged anonymous memory that can be reclaimed
> > +               under memory pressure without swap.  This currently includes
> > +               lazy freeable memory (MADV_FREE) and compound pages that can be
> > +               split and uncharged.
> > +
> >           inactive_anon, active_anon, inactive_file, active_file, unevictable
> >                 Amount of memory, swap-backed and filesystem-backed,
> >                 on the internal memory management lists used by the
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> >         return false;
> >  }
> >
> > +/*
> > + * Returns the amount of anon memory that is charged to the memcg that is
> > + * reclaimable under memory pressure without swap, in pages.
> > + */
> > +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> > +{
> > +       long deferred, lazyfree;
> > +
> > +       /*
> > +        * Deferred pages are charged anonymous pages that are on the LRU but
> > +        * are unmapped.  These compound pages are split under memory pressure.
> > +        */
> > +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> > +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> > +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> 
> Please note that the NR_ANON_MAPPED does not include tmpfs memory but
> NR_[IN]ACTIVE_ANON does include the tmpfs.
> 
> > +       /*
> > +        * Lazyfree pages are charged clean anonymous pages that are on the file
> > +        * LRU and can be reclaimed under memory pressure.
> > +        */
> > +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> > +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> > +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);
> 
> Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.
> 

Ah, so this adds to the motivation of providing the anon_reclaimable stat 
because the calculation becomes even more convoluted and completely based 
on the kernel implementation details for both lazyfree memory and deferred 
split queues.  Did you have a calculation in mind for 
memcg_anon_reclaimable()?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17  1:37               ` Shakeel Butt
  0 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2020-07-17  1:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, Jul 16, 2020 at 2:28 PM David Rientjes <rientjes@google.com> wrote:
>
> On Thu, 16 Jul 2020, Shakeel Butt wrote:
>
> > > Userspace can lack insight into the amount of memory that can be reclaimed
> > > from a memcg based on values from memory.stat.  Two specific examples:
> > >
> > >  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
> > >    inactive file LRU that can be quickly reclaimed under memory pressure
> > >    but otherwise shows up as mapped anon in memory.stat, and
> > >
> > >  - Memory on deferred split queues (thp) that are compound pages that can
> > >    be split and uncharged from the memcg under memory pressure, but
> > >    otherwise shows up as charged anon LRU memory in memory.stat.
> > >
> > > Both of this anonymous usage is also charged to memory.current.
> > >
> > > Userspace can currently derive this information but it depends on kernel
> > > implementation details for how this memory is handled for the purposes of
> > > reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> > >
> > > For the purposes of writing portable userspace code that does not need to
> > > have insight into the kernel implementation for reclaimable memory, this
> > > exports a stat that reveals the amount of anonymous memory that can be
> > > reclaimed and uncharged from the memcg to start new applications.
> > >
> > > As the kernel implementation evolves for memory that can be reclaimed
> > > under memory pressure, this stat can be kept consistent.
> > >
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
> > >  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
> > >  2 files changed, 37 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
> > >                 Amount of memory used in anonymous mappings backed by
> > >                 transparent hugepages
> > >
> > > +         anon_reclaimable
> > > +               The amount of charged anonymous memory that can be reclaimed
> > > +               under memory pressure without swap.  This currently includes
> > > +               lazy freeable memory (MADV_FREE) and compound pages that can be
> > > +               split and uncharged.
> > > +
> > >           inactive_anon, active_anon, inactive_file, active_file, unevictable
> > >                 Amount of memory, swap-backed and filesystem-backed,
> > >                 on the internal memory management lists used by the
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> > >         return false;
> > >  }
> > >
> > > +/*
> > > + * Returns the amount of anon memory that is charged to the memcg that is
> > > + * reclaimable under memory pressure without swap, in pages.
> > > + */
> > > +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> > > +{
> > > +       long deferred, lazyfree;
> > > +
> > > +       /*
> > > +        * Deferred pages are charged anonymous pages that are on the LRU but
> > > +        * are unmapped.  These compound pages are split under memory pressure.
> > > +        */
> > > +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> > > +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> > > +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> >
> > Please note that the NR_ANON_MAPPED does not include tmpfs memory but
> > NR_[IN]ACTIVE_ANON does include the tmpfs.
> >
> > > +       /*
> > > +        * Lazyfree pages are charged clean anonymous pages that are on the file
> > > +        * LRU and can be reclaimed under memory pressure.
> > > +        */
> > > +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> > > +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> > > +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);
> >
> > Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.
> >
>
> Ah, so this adds to the motivation of providing the anon_reclaimable stat
> because the calculation becomes even more convoluted and completely based
> on the kernel implementation details for both lazyfree memory and deferred
> split queues.

Yes, I agree.

> Did you have a calculation in mind for
> memcg_anon_reclaimable()?

 For deferred, "memcg->deferred_split_queue.split_queue_len" should be usable.

For lazyfree, NR_ACTIVE_FILE + NR_INACTIVE_FILE + NR_SHMEM -
NR_FILE_PAGES seems like the right formula.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17  1:37               ` Shakeel Butt
  0 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2020-07-17  1:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	Cgroups, Linux MM

On Thu, Jul 16, 2020 at 2:28 PM David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Thu, 16 Jul 2020, Shakeel Butt wrote:
>
> > > Userspace can lack insight into the amount of memory that can be reclaimed
> > > from a memcg based on values from memory.stat.  Two specific examples:
> > >
> > >  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
> > >    inactive file LRU that can be quickly reclaimed under memory pressure
> > >    but otherwise shows up as mapped anon in memory.stat, and
> > >
> > >  - Memory on deferred split queues (thp) that are compound pages that can
> > >    be split and uncharged from the memcg under memory pressure, but
> > >    otherwise shows up as charged anon LRU memory in memory.stat.
> > >
> > > Both of this anonymous usage is also charged to memory.current.
> > >
> > > Userspace can currently derive this information but it depends on kernel
> > > implementation details for how this memory is handled for the purposes of
> > > reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> > >
> > > For the purposes of writing portable userspace code that does not need to
> > > have insight into the kernel implementation for reclaimable memory, this
> > > exports a stat that reveals the amount of anonymous memory that can be
> > > reclaimed and uncharged from the memcg to start new applications.
> > >
> > > As the kernel implementation evolves for memory that can be reclaimed
> > > under memory pressure, this stat can be kept consistent.
> > >
> > > Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
> > >  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
> > >  2 files changed, 37 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
> > >                 Amount of memory used in anonymous mappings backed by
> > >                 transparent hugepages
> > >
> > > +         anon_reclaimable
> > > +               The amount of charged anonymous memory that can be reclaimed
> > > +               under memory pressure without swap.  This currently includes
> > > +               lazy freeable memory (MADV_FREE) and compound pages that can be
> > > +               split and uncharged.
> > > +
> > >           inactive_anon, active_anon, inactive_file, active_file, unevictable
> > >                 Amount of memory, swap-backed and filesystem-backed,
> > >                 on the internal memory management lists used by the
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> > >         return false;
> > >  }
> > >
> > > +/*
> > > + * Returns the amount of anon memory that is charged to the memcg that is
> > > + * reclaimable under memory pressure without swap, in pages.
> > > + */
> > > +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> > > +{
> > > +       long deferred, lazyfree;
> > > +
> > > +       /*
> > > +        * Deferred pages are charged anonymous pages that are on the LRU but
> > > +        * are unmapped.  These compound pages are split under memory pressure.
> > > +        */
> > > +       deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> > > +                              memcg_page_state(memcg, NR_INACTIVE_ANON) -
> > > +                              memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> >
> > Please note that the NR_ANON_MAPPED does not include tmpfs memory but
> > NR_[IN]ACTIVE_ANON does include the tmpfs.
> >
> > > +       /*
> > > +        * Lazyfree pages are charged clean anonymous pages that are on the file
> > > +        * LRU and can be reclaimed under memory pressure.
> > > +        */
> > > +       lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> > > +                              memcg_page_state(memcg, NR_INACTIVE_FILE) -
> > > +                              memcg_page_state(memcg, NR_FILE_PAGES), 0);
> >
> > Similarly NR_FILE_PAGES includes tmpfs memory but NR_[IN]ACTIVE_FILE does not.
> >
>
> Ah, so this adds to the motivation of providing the anon_reclaimable stat
> because the calculation becomes even more convoluted and completely based
> on the kernel implementation details for both lazyfree memory and deferred
> split queues.

Yes, I agree.

> Did you have a calculation in mind for
> memcg_anon_reclaimable()?

 For deferred, "memcg->deferred_split_queue.split_queue_len" should be usable.

For lazyfree, NR_ACTIVE_FILE + NR_INACTIVE_FILE + NR_SHMEM -
NR_FILE_PAGES seems like the right formula.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17  8:34           ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2020-07-17  8:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

On Thu 16-07-20 13:58:19, David Rientjes wrote:
> Userspace can lack insight into the amount of memory that can be reclaimed
> from a memcg based on values from memory.stat.  Two specific examples:
> 
>  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
>    inactive file LRU that can be quickly reclaimed under memory pressure
>    but otherwise shows up as mapped anon in memory.stat, and
> 
>  - Memory on deferred split queues (thp) that are compound pages that can
>    be split and uncharged from the memcg under memory pressure, but
>    otherwise shows up as charged anon LRU memory in memory.stat.
> 
> Both of this anonymous usage is also charged to memory.current.
> 
> Userspace can currently derive this information but it depends on kernel
> implementation details for how this memory is handled for the purposes of
> reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> 
> For the purposes of writing portable userspace code that does not need to
> have insight into the kernel implementation for reclaimable memory, this
> exports a stat that reveals the amount of anonymous memory that can be
> reclaimed and uncharged from the memcg to start new applications.
> 
> As the kernel implementation evolves for memory that can be reclaimed
> under memory pressure, this stat can be kept consistent.

Please be much more specific about the expected usage. You have
mentioned something in the email thread but this really belongs to the
changelog.

Why is reclaimable anonymous memory without any swap any special, say
from any other clean and easily reclaimable caches? What if there is a
swap available?

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
>  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
>  		Amount of memory used in anonymous mappings backed by
>  		transparent hugepages
>  
> +	  anon_reclaimable
> +		The amount of charged anonymous memory that can be reclaimed
> +		under memory pressure without swap.  This currently includes
> +		lazy freeable memory (MADV_FREE) and compound pages that can be
> +		split and uncharged.
> +
>  	  inactive_anon, active_anon, inactive_file, active_file, unevictable
>  		Amount of memory, swap-backed and filesystem-backed,
>  		on the internal memory management lists used by the
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
> +
> +	return deferred + lazyfree;
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>  	struct seq_buf s;
> @@ -1363,6 +1389,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	 * Provide statistics on the state of the memory subsystem as
>  	 * well as cumulative event counters that show past behavior.
>  	 *
> +	 * All values in this buffer are read individually, so no implied
> +	 * consistency amongst them.
> +	 *
>  	 * This list is ordered following a combination of these gradients:
>  	 * 1) generic big picture -> specifics and details
>  	 * 2) reflecting userspace activity -> reflecting kernel heuristics
> @@ -1405,6 +1434,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  		       (u64)memcg_page_state(memcg, NR_ANON_THPS) *
>  		       HPAGE_PMD_SIZE);
>  #endif
> +	seq_buf_printf(&s, "anon_reclaimable %llu\n",
> +		       (u64)memcg_anon_reclaimable(memcg) * PAGE_SIZE);
>  
>  	for (i = 0; i < NR_LRU_LISTS; i++)
>  		seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17  8:34           ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2020-07-17  8:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu 16-07-20 13:58:19, David Rientjes wrote:
> Userspace can lack insight into the amount of memory that can be reclaimed
> from a memcg based on values from memory.stat.  Two specific examples:
> 
>  - Lazy freeable memory (MADV_FREE) that are clean anonymous pages on the
>    inactive file LRU that can be quickly reclaimed under memory pressure
>    but otherwise shows up as mapped anon in memory.stat, and
> 
>  - Memory on deferred split queues (thp) that are compound pages that can
>    be split and uncharged from the memcg under memory pressure, but
>    otherwise shows up as charged anon LRU memory in memory.stat.
> 
> Both of this anonymous usage is also charged to memory.current.
> 
> Userspace can currently derive this information but it depends on kernel
> implementation details for how this memory is handled for the purposes of
> reclaim (anon on inactive file LRU or unmapped anon on the LRU).
> 
> For the purposes of writing portable userspace code that does not need to
> have insight into the kernel implementation for reclaimable memory, this
> exports a stat that reveals the amount of anonymous memory that can be
> reclaimed and uncharged from the memcg to start new applications.
> 
> As the kernel implementation evolves for memory that can be reclaimed
> under memory pressure, this stat can be kept consistent.

Please be much more specific about the expected usage. You have
mentioned something in the email thread but this really belongs to the
changelog.

Why is reclaimable anonymous memory without any swap any special, say
from any other clean and easily reclaimable caches? What if there is a
swap available?

> Signed-off-by: David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++++
>  mm/memcontrol.c                         | 31 +++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1296,6 +1296,12 @@ PAGE_SIZE multiple when read back.
>  		Amount of memory used in anonymous mappings backed by
>  		transparent hugepages
>  
> +	  anon_reclaimable
> +		The amount of charged anonymous memory that can be reclaimed
> +		under memory pressure without swap.  This currently includes
> +		lazy freeable memory (MADV_FREE) and compound pages that can be
> +		split and uncharged.
> +
>  	  inactive_anon, active_anon, inactive_file, active_file, unevictable
>  		Amount of memory, swap-backed and filesystem-backed,
>  		on the internal memory management lists used by the
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);
> +
> +	return deferred + lazyfree;
> +}
> +
>  static char *memory_stat_format(struct mem_cgroup *memcg)
>  {
>  	struct seq_buf s;
> @@ -1363,6 +1389,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	 * Provide statistics on the state of the memory subsystem as
>  	 * well as cumulative event counters that show past behavior.
>  	 *
> +	 * All values in this buffer are read individually, so no implied
> +	 * consistency amongst them.
> +	 *
>  	 * This list is ordered following a combination of these gradients:
>  	 * 1) generic big picture -> specifics and details
>  	 * 2) reflecting userspace activity -> reflecting kernel heuristics
> @@ -1405,6 +1434,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  		       (u64)memcg_page_state(memcg, NR_ANON_THPS) *
>  		       HPAGE_PMD_SIZE);
>  #endif
> +	seq_buf_printf(&s, "anon_reclaimable %llu\n",
> +		       (u64)memcg_anon_reclaimable(memcg) * PAGE_SIZE);
>  
>  	for (i = 0; i < NR_LRU_LISTS; i++)
>  		seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-17 12:17         ` Chris Down
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Down @ 2020-07-17 12:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

Hi David,

David Rientjes writes:
>With the proposed anon_reclaimable, do you have any reliability concerns?
>This would be the amount of lazy freeable memory and memory that can be
>uncharged if compound pages from the deferred split queue are split under
>memory pressure.  It seems to be a very precise value (as slab_reclaimable
>already in memory.stat is), so I'm not sure why there is a reliability
>concern.  Maybe you can elaborate?

Ability to reclaim a page is largely about context at the time of reclaim. For 
example, if you are running at the edge of swap, at a metric that truly 
describes "reclaimable memory" will contain vastly different numbers from one 
second to the next as cluster and page availability increases and decreases. We 
may also have to do things like look for youngness at reclaim time, so I'm not 
convinced metrics like this makes sense in the general case.

>Today, this information is indeed possible to calculate from userspace.
>The idea is to present this information that will be backwards compatible,
>however, as the kernel implementation changes.  When lazy freeable memory
>was added, for instance, userspace likely would not have preemptively been
>doing an "active_file + inactive_file - file" calculation to factor that
>in as reclaimable anon :)

I agree it's hard to calculate from userspace without assistance, but I also 
generally think generally exposing a highly nuanced and situational value to 
userspace is a recipe for confusion. The user either knows mm internals and can 
understand it, or don't and probably only misunderstand it. There is a non-zero 
cognitive cost to adding more metrics like this, which is why I'm interested in 
knowing more about the userspace usage semantics intended :-)

>The example I gave earlier in the thread showed how dramatically different
>memory.current is before and after the introduction of deferred split
>queues.  Userspace sees ballooning memcg usage and alerts on it (suspects
>a memory leak, for example) when in reality this is purely reclaimable
>memory under pressure and is the result of a kernel implementation detail.

Again, I'm curious why this can't be solved by artificial workingset 
pressurisation and monitoring. Generally, the most reliable reclaim metrics 
come from operating reclaim itself.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-17 12:17         ` Chris Down
  0 siblings, 0 replies; 29+ messages in thread
From: Chris Down @ 2020-07-17 12:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hi David,

David Rientjes writes:
>With the proposed anon_reclaimable, do you have any reliability concerns?
>This would be the amount of lazy freeable memory and memory that can be
>uncharged if compound pages from the deferred split queue are split under
>memory pressure.  It seems to be a very precise value (as slab_reclaimable
>already in memory.stat is), so I'm not sure why there is a reliability
>concern.  Maybe you can elaborate?

Ability to reclaim a page is largely about context at the time of reclaim. For 
example, if you are running at the edge of swap, at a metric that truly 
describes "reclaimable memory" will contain vastly different numbers from one 
second to the next as cluster and page availability increases and decreases. We 
may also have to do things like look for youngness at reclaim time, so I'm not 
convinced metrics like this makes sense in the general case.

>Today, this information is indeed possible to calculate from userspace.
>The idea is to present this information that will be backwards compatible,
>however, as the kernel implementation changes.  When lazy freeable memory
>was added, for instance, userspace likely would not have preemptively been
>doing an "active_file + inactive_file - file" calculation to factor that
>in as reclaimable anon :)

I agree it's hard to calculate from userspace without assistance, but I also 
generally think generally exposing a highly nuanced and situational value to 
userspace is a recipe for confusion. The user either knows mm internals and can 
understand it, or don't and probably only misunderstand it. There is a non-zero 
cognitive cost to adding more metrics like this, which is why I'm interested in 
knowing more about the userspace usage semantics intended :-)

>The example I gave earlier in the thread showed how dramatically different
>memory.current is before and after the introduction of deferred split
>queues.  Userspace sees ballooning memcg usage and alerts on it (suspects
>a memory leak, for example) when in reality this is purely reclaimable
>memory under pressure and is the result of a kernel implementation detail.

Again, I'm curious why this can't be solved by artificial workingset 
pressurisation and monitoring. Generally, the most reliable reclaim metrics 
come from operating reclaim itself.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17 14:39           ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-07-17 14:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko,
	Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen,
	Vladimir Davydov, cgroups, linux-mm

On Thu, Jul 16, 2020 at 01:58:19PM -0700, David Rientjes wrote:
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);

Unfortunately, we don't know if these have been reused after the
madvise until we actually do the rmap walk in page reclaim. All of
these could have dirty ptes and require swapout after all.

The MADV_FREE tradeoff was that the freed pages can get reused by
userspace without another context switch and tlb flush in the common
case, by exploiting the fact that the MMU sets the dirty bit for
us. The downside is that the kernel doesn't know what state these
pages are in until it takes a close-up look at them one by one.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide an anon_reclaimable stat
@ 2020-07-17 14:39           ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-07-17 14:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Andrew Morton, Yang Shi, Michal Hocko,
	Shakeel Butt, Yang Shi, Roman Gushchin, Greg Thelen,
	Vladimir Davydov, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu, Jul 16, 2020 at 01:58:19PM -0700, David Rientjes wrote:
> @@ -1350,6 +1350,32 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> +/*
> + * Returns the amount of anon memory that is charged to the memcg that is
> + * reclaimable under memory pressure without swap, in pages.
> + */
> +static unsigned long memcg_anon_reclaimable(struct mem_cgroup *memcg)
> +{
> +	long deferred, lazyfree;
> +
> +	/*
> +	 * Deferred pages are charged anonymous pages that are on the LRU but
> +	 * are unmapped.  These compound pages are split under memory pressure.
> +	 */
> +	deferred = max_t(long, memcg_page_state(memcg, NR_ACTIVE_ANON) +
> +			       memcg_page_state(memcg, NR_INACTIVE_ANON) -
> +			       memcg_page_state(memcg, NR_ANON_MAPPED), 0);
> +	/*
> +	 * Lazyfree pages are charged clean anonymous pages that are on the file
> +	 * LRU and can be reclaimed under memory pressure.
> +	 */
> +	lazyfree = max_t(long, memcg_page_state(memcg, NR_ACTIVE_FILE) +
> +			       memcg_page_state(memcg, NR_INACTIVE_FILE) -
> +			       memcg_page_state(memcg, NR_FILE_PAGES), 0);

Unfortunately, we don't know if these have been reused after the
madvise until we actually do the rmap walk in page reclaim. All of
these could have dirty ptes and require swapout after all.

The MADV_FREE tradeoff was that the freed pages can get reused by
userspace without another context switch and tlb flush in the common
case, by exploiting the fact that the MMU sets the dirty bit for
us. The downside is that the kernel doesn't know what state these
pages are in until it takes a close-up look at them one by one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-17 19:37           ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-17 19:37 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

On Fri, 17 Jul 2020, Chris Down wrote:

> > With the proposed anon_reclaimable, do you have any reliability concerns?
> > This would be the amount of lazy freeable memory and memory that can be
> > uncharged if compound pages from the deferred split queue are split under
> > memory pressure.  It seems to be a very precise value (as slab_reclaimable
> > already in memory.stat is), so I'm not sure why there is a reliability
> > concern.  Maybe you can elaborate?
> 
> Ability to reclaim a page is largely about context at the time of reclaim. For
> example, if you are running at the edge of swap, at a metric that truly
> describes "reclaimable memory" will contain vastly different numbers from one
> second to the next as cluster and page availability increases and decreases.
> We may also have to do things like look for youngness at reclaim time, so I'm
> not convinced metrics like this makes sense in the general case.
...
> Again, I'm curious why this can't be solved by artificial workingset
> pressurisation and monitoring. Generally, the most reliable reclaim metrics
> come from operating reclaim itself.
> 

Perhaps this is best discussed in the context I gave in the earlier 
thread: imagine a thp-backed heap of 64MB and then a malloc implementation 
doing MADV_DONTNEED over all but one page in every one of these 
pageblocks.

On a 4.3 kernel, for example, memory.current for the heap segment is now 
(64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and 
uncharging of the underlying hugepage.  On a 4.15 kernel, for example, 
memory.current is still 64MB because the underlying hugepages are still 
charged to the memcg due to deferred split queues.

For any application that monitors this, pressurization is not going to 
help: the memory will be reclaimed under memcg pressure but we aren't 
facing that pressure yet.  Userspace could identify this as a memory leak 
unless we describe what anon memory is actually reclaimable in this 
context (including on systems without swap).  For any entity that uses 
this information to infer if new work can be scheduled in this memcg (the 
reason MemAvailable exists in /proc/meminfo at the system level), this is 
now dramatically skewed.

At worse, on a swapless system, this memory is seen from userspace as 
unreclaimable because it's charged anon.

Do you have other suggestions for how userspace can understand what anon 
is reclaimable in this context before encountering memory pressure?  If 
so, it may be a great alternative to this: I haven't been able to think of 
such a way other than an anon_reclaimable stat.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-17 19:37           ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2020-07-17 19:37 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Yang Shi, Michal Hocko, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri, 17 Jul 2020, Chris Down wrote:

> > With the proposed anon_reclaimable, do you have any reliability concerns?
> > This would be the amount of lazy freeable memory and memory that can be
> > uncharged if compound pages from the deferred split queue are split under
> > memory pressure.  It seems to be a very precise value (as slab_reclaimable
> > already in memory.stat is), so I'm not sure why there is a reliability
> > concern.  Maybe you can elaborate?
> 
> Ability to reclaim a page is largely about context at the time of reclaim. For
> example, if you are running at the edge of swap, at a metric that truly
> describes "reclaimable memory" will contain vastly different numbers from one
> second to the next as cluster and page availability increases and decreases.
> We may also have to do things like look for youngness at reclaim time, so I'm
> not convinced metrics like this makes sense in the general case.
...
> Again, I'm curious why this can't be solved by artificial workingset
> pressurisation and monitoring. Generally, the most reliable reclaim metrics
> come from operating reclaim itself.
> 

Perhaps this is best discussed in the context I gave in the earlier 
thread: imagine a thp-backed heap of 64MB and then a malloc implementation 
doing MADV_DONTNEED over all but one page in every one of these 
pageblocks.

On a 4.3 kernel, for example, memory.current for the heap segment is now 
(64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and 
uncharging of the underlying hugepage.  On a 4.15 kernel, for example, 
memory.current is still 64MB because the underlying hugepages are still 
charged to the memcg due to deferred split queues.

For any application that monitors this, pressurization is not going to 
help: the memory will be reclaimed under memcg pressure but we aren't 
facing that pressure yet.  Userspace could identify this as a memory leak 
unless we describe what anon memory is actually reclaimable in this 
context (including on systems without swap).  For any entity that uses 
this information to infer if new work can be scheduled in this memcg (the 
reason MemAvailable exists in /proc/meminfo at the system level), this is 
now dramatically skewed.

At worse, on a swapless system, this memory is seen from userspace as 
unreclaimable because it's charged anon.

Do you have other suggestions for how userspace can understand what anon 
is reclaimable in this context before encountering memory pressure?  If 
so, it may be a great alternative to this: I haven't been able to think of 
such a way other than an anon_reclaimable stat.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-20  7:37             ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2020-07-20  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Chris Down, Andrew Morton, Yang Shi, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups, linux-mm

On Fri 17-07-20 12:37:57, David Rientjes wrote:
[...]
> On a 4.3 kernel, for example, memory.current for the heap segment is now 
> (64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and 
> uncharging of the underlying hugepage.  On a 4.15 kernel, for example, 
> memory.current is still 64MB because the underlying hugepages are still 
> charged to the memcg due to deferred split queues.

Deferred THP split should be a kernel internal implementation
optimization and a detail that userspace shouldn't really be worrying
about. If there are user visible effects that are standing in the way
then we should reconsider how much is the optimization worth. I do not
really remember any actual numbers that would strongly justify its
existence while I do remember several problems that this has introduced.

So I am really wondering whether exporting subtle metrics to the
userspace which can lead to confusion is the right approach to the
problem you have at hands.

Also could you be more specific about the numbers we are talking here?
E.g. what is the overal percentage of the "mis-accounted" split THPs
wrt. to the high/max limit? Is the userspace relying on very precise
numbers?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, memcg: provide a stat to describe reclaimable memory
@ 2020-07-20  7:37             ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2020-07-20  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Chris Down, Andrew Morton, Yang Shi, Shakeel Butt, Yang Shi,
	Roman Gushchin, Greg Thelen, Johannes Weiner, Vladimir Davydov,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 17-07-20 12:37:57, David Rientjes wrote:
[...]
> On a 4.3 kernel, for example, memory.current for the heap segment is now 
> (64MB / 2MB) * 4KB = 128KB because we have synchronous splitting and 
> uncharging of the underlying hugepage.  On a 4.15 kernel, for example, 
> memory.current is still 64MB because the underlying hugepages are still 
> charged to the memcg due to deferred split queues.

Deferred THP split should be a kernel internal implementation
optimization and a detail that userspace shouldn't really be worrying
about. If there are user visible effects that are standing in the way
then we should reconsider how much is the optimization worth. I do not
really remember any actual numbers that would strongly justify its
existence while I do remember several problems that this has introduced.

So I am really wondering whether exporting subtle metrics to the
userspace which can lead to confusion is the right approach to the
problem you have at hands.

Also could you be more specific about the numbers we are talking here?
E.g. what is the overal percentage of the "mis-accounted" split THPs
wrt. to the high/max limit? Is the userspace relying on very precise
numbers?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-07-20  7:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15  3:18 [patch] mm, memcg: provide a stat to describe reclaimable memory David Rientjes
2020-07-15  3:18 ` David Rientjes
2020-07-15  7:00 ` David Rientjes
2020-07-15  7:00   ` David Rientjes
2020-07-15  7:15   ` SeongJae Park
2020-07-15  7:15     ` SeongJae Park
2020-07-15 17:33     ` David Rientjes
2020-07-15 17:33       ` David Rientjes
2020-07-16 20:58       ` [patch] mm, memcg: provide an anon_reclaimable stat David Rientjes
2020-07-16 20:58         ` David Rientjes
2020-07-16 21:07         ` Shakeel Butt
2020-07-16 21:07           ` Shakeel Butt
2020-07-16 21:28           ` David Rientjes
2020-07-16 21:28             ` David Rientjes
2020-07-17  1:37             ` Shakeel Butt
2020-07-17  1:37               ` Shakeel Butt
2020-07-17  8:34         ` Michal Hocko
2020-07-17  8:34           ` Michal Hocko
2020-07-17 14:39         ` Johannes Weiner
2020-07-17 14:39           ` Johannes Weiner
2020-07-15 13:10 ` [patch] mm, memcg: provide a stat to describe reclaimable memory Chris Down
2020-07-15 13:10   ` Chris Down
     [not found]   ` <20200715131048.GA176092-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
2020-07-15 18:02     ` David Rientjes
2020-07-17 12:17       ` Chris Down
2020-07-17 12:17         ` Chris Down
2020-07-17 19:37         ` David Rientjes
2020-07-17 19:37           ` David Rientjes
2020-07-20  7:37           ` Michal Hocko
2020-07-20  7:37             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.