All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-11-30  2:03 Mina Almasry
  2022-11-30  8:44 ` Bagas Sanjaya
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mina Almasry @ 2022-11-30  2:03 UTC (permalink / raw)
  To: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb,
	gthelen, fvdl, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton
  Cc: Mina Almasry, cgroups, linux-doc, linux-kernel, linux-mm

The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory system:

nodes 0,1 -> top tier
nodes 2,3 -> second tier

$ echo "1m nodes=0" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.

$ echo "1m nodes=2,3" > memory.reclaim

This instructs the kernel to attempt to reclaim 1m memory in the second tier,
since this tier of memory has no demotion targets the memory will be
reclaimed.

$ echo "1m nodes=0,1" > memory.reclaim

Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on
the top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.

Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
reclaim""), the proactive reclaim interface memory.reclaim does both
reclaim and demotion. Reclaim and demotion incur different latency costs
to the jobs in the cgroup. Demoted memory would still be addressable
by the userspace at a higher latency, but reclaimed memory would need to
incur a pagefault.

The 'nodes' arg is useful to allow the userspace to control demotion
and reclaim independently according to its policy: if the memory.reclaim
is called on a node with demotion targets, it will attempt demotion first;
if it is called on a node without demotion targets, it will only attempt
reclaim.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 15 +++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
 mm/vmscan.c                             |  4 +-
 4 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 74cec76be9f2..ac5fcbcd5ae6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.

-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
+	This file accepts a string which contains the number of bytes to
+	reclaim.

 	Example::

 	  echo "1G" > memory.reclaim

-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.

+	This file also allows the user to specify the nodes to reclaim from,
+	via the 'nodes=' key, example::
+
+	  echo "1G nodes=0,1" > memory.reclaim
+
+	The above instructs the kernel to reclaim memory from nodes 0,1.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b61e2007d156..f542c114dffd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  nodemask_t nodemask);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23750cec0036..a0d7850173a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							NODE_MASK_ALL);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,

 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options,
+						    NODE_MASK_ALL);
 	psi_memstall_leave(&pflags);

 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+					NODE_MASK_ALL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;

 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  NODE_MASK_ALL))
 			nr_retries--;
 	}

@@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}

 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NODE_MASK_ALL);

 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,

 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NODE_MASK_ALL))
 				nr_reclaims--;
 			continue;
 		}
@@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }

+enum {
+	MEMORY_RECLAIM_NODES = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t if_tokens = {
+	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
-	unsigned int reclaim_options;
-	int err;
+	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
+				       MEMCG_RECLAIM_PROACTIVE;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int token;
+	char value[256];
+	nodemask_t nodemask = NODE_MASK_ALL;

 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;

-	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		token = match_token(start, if_tokens, args);
+		match_strlcpy(value, args, sizeof(value));
+		switch (token) {
+		case MEMORY_RECLAIM_NODES:
+			if (nodelist_parse(value, nodemask) < 0)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
 	while (nr_reclaimed < nr_to_reclaim) {
 		unsigned long reclaimed;

@@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options);
+						GFP_KERNEL, reclaim_options,
+						nodemask);

 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b8e8e43806b..23fc5b523764 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   nodemask_t nodemask)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.nodemask = &nodemask,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.38.1.584.g0f3c55d4c2-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
  2022-11-30  2:03 [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim Mina Almasry
@ 2022-11-30  8:44 ` Bagas Sanjaya
  2022-11-30 19:45     ` Mina Almasry
  2022-12-01 14:49   ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Bagas Sanjaya @ 2022-11-30  8:44 UTC (permalink / raw)
  To: Mina Almasry, Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc, shakeelb, gthelen, fvdl, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton
  Cc: cgroups, linux-doc, linux-kernel, linux-mm

On 11/30/22 09:03, Mina Almasry wrote:
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> +	This file accepts a string which contains the number of bytes to
> +	reclaim.
> 
Amount of memory to reclaim?

> +	This file also allows the user to specify the nodes to reclaim from,
> +	via the 'nodes=' key, example::
> +

"..., for example"

> +	  echo "1G nodes=0,1" > memory.reclaim
> +
> +	The above instructs the kernel to reclaim memory from nodes 0,1.
> +

Thanks.

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-11-30 19:45     ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-11-30 19:45 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb,
	gthelen, fvdl, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, cgroups, linux-doc, linux-kernel, linux-mm

On Wed, Nov 30, 2022 at 12:44 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On 11/30/22 09:03, Mina Almasry wrote:
> > -     This file accepts a single key, the number of bytes to reclaim.
> > -     No nested keys are currently supported.
> > +     This file accepts a string which contains the number of bytes to
> > +     reclaim.
> >
> Amount of memory to reclaim?
>

I want to have the word 'byte' in there somewhere to make that clear.
I guess maybe 'the amount of memory to reclaim in bytes'. Although as
written it seems more concise.

> > +     This file also allows the user to specify the nodes to reclaim from,
> > +     via the 'nodes=' key, example::
> > +
>
> "..., for example"
>

Will do in the next version. Thanks for taking a look, Bagas.

> > +       echo "1G nodes=0,1" > memory.reclaim
> > +
> > +     The above instructs the kernel to reclaim memory from nodes 0,1.
> > +
>
> Thanks.
>
> --
> An old man doll... just what I always wanted! - Clara
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-11-30 19:45     ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-11-30 19:45 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, Nov 30, 2022 at 12:44 AM Bagas Sanjaya <bagasdotme-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> On 11/30/22 09:03, Mina Almasry wrote:
> > -     This file accepts a single key, the number of bytes to reclaim.
> > -     No nested keys are currently supported.
> > +     This file accepts a string which contains the number of bytes to
> > +     reclaim.
> >
> Amount of memory to reclaim?
>

I want to have the word 'byte' in there somewhere to make that clear.
I guess maybe 'the amount of memory to reclaim in bytes'. Although as
written it seems more concise.

> > +     This file also allows the user to specify the nodes to reclaim from,
> > +     via the 'nodes=' key, example::
> > +
>
> "..., for example"
>

Will do in the next version. Thanks for taking a look, Bagas.

> > +       echo "1G nodes=0,1" > memory.reclaim
> > +
> > +     The above instructs the kernel to reclaim memory from nodes 0,1.
> > +
>
> Thanks.
>
> --
> An old man doll... just what I always wanted! - Clara
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-01 14:49   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2022-12-01 14:49 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb,
	gthelen, fvdl, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue 29-11-22 18:03:27, Mina Almasry wrote:
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Thanks for making this per node rather than tier based. This is a more
generic interface.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 68 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 74cec76be9f2..ac5fcbcd5ae6 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
> 
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> +	This file accepts a string which contains the number of bytes to
> +	reclaim.
> 
>  	Example::
> 
>  	  echo "1G" > memory.reclaim
> 
> -	The interface can be later extended with nested keys to
> -	configure the reclaim behavior. For example, specify the
> -	type of memory to reclaim from (anon, file, ..).
> -
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
> 
> +	This file also allows the user to specify the nodes to reclaim from,
> +	via the 'nodes=' key, example::
> +
> +	  echo "1G nodes=0,1" > memory.reclaim
> +
> +	The above instructs the kernel to reclaim memory from nodes 0,1.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b61e2007d156..f542c114dffd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  unsigned int reclaim_options);
> +						  unsigned int reclaim_options,
> +						  nodemask_t nodemask);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23750cec0036..a0d7850173a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>  		psi_memstall_enter(&pflags);
>  		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>  							gfp_mask,
> -							MEMCG_RECLAIM_MAY_SWAP);
> +							MEMCG_RECLAIM_MAY_SWAP,
> +							NODE_MASK_ALL);
>  		psi_memstall_leave(&pflags);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
> @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> 
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options);
> +						    gfp_mask, reclaim_options,
> +						    NODE_MASK_ALL);
>  	psi_memstall_leave(&pflags);
> 
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  		}
> 
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  			return -EINTR;
> 
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP))
> +						  MEMCG_RECLAIM_MAY_SWAP,
> +						  NODE_MASK_ALL))
>  			nr_retries--;
>  	}
> 
> @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		}
> 
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL);
> 
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> 
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL))
>  				nr_reclaims--;
>  			continue;
>  		}
> @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
> 
> +enum {
> +	MEMORY_RECLAIM_NODES = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t if_tokens = {
> +	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -	unsigned int reclaim_options;
> -	int err;
> +	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> +				       MEMCG_RECLAIM_PROACTIVE;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +	int token;
> +	char value[256];
> +	nodemask_t nodemask = NODE_MASK_ALL;
> 
>  	buf = strstrip(buf);
> -	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -	if (err)
> -		return err;
> 
> -	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		token = match_token(start, if_tokens, args);
> +		match_strlcpy(value, args, sizeof(value));
> +		switch (token) {
> +		case MEMORY_RECLAIM_NODES:
> +			if (nodelist_parse(value, nodemask) < 0)
> +				return -EINVAL;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>  	while (nr_reclaimed < nr_to_reclaim) {
>  		unsigned long reclaimed;
> 
> @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> 
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
>  						nr_to_reclaim - nr_reclaimed,
> -						GFP_KERNEL, reclaim_options);
> +						GFP_KERNEL, reclaim_options,
> +						nodemask);
> 
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8e8e43806b..23fc5b523764 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   nodemask_t nodemask)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned int noreclaim_flag;
> @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.nodemask = &nodemask,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.38.1.584.g0f3c55d4c2-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-01 14:49   ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2022-12-01 14:49 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, fvdl-hpIqsD4AKlfQT0dZR+AlfA,
	Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue 29-11-22 18:03:27, Mina Almasry wrote:
> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
> 
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
> 
> $ echo "1m nodes=0" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
> 
> $ echo "1m nodes=2,3" > memory.reclaim
> 
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
> 
> $ echo "1m nodes=0,1" > memory.reclaim
> 
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
> 
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
> 
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
> 
> Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thanks for making this per node rather than tier based. This is a more
generic interface.

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 68 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 74cec76be9f2..ac5fcbcd5ae6 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
> 
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> +	This file accepts a string which contains the number of bytes to
> +	reclaim.
> 
>  	Example::
> 
>  	  echo "1G" > memory.reclaim
> 
> -	The interface can be later extended with nested keys to
> -	configure the reclaim behavior. For example, specify the
> -	type of memory to reclaim from (anon, file, ..).
> -
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
> 
> +	This file also allows the user to specify the nodes to reclaim from,
> +	via the 'nodes=' key, example::
> +
> +	  echo "1G nodes=0,1" > memory.reclaim
> +
> +	The above instructs the kernel to reclaim memory from nodes 0,1.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b61e2007d156..f542c114dffd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  unsigned int reclaim_options);
> +						  unsigned int reclaim_options,
> +						  nodemask_t nodemask);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23750cec0036..a0d7850173a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>  		psi_memstall_enter(&pflags);
>  		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>  							gfp_mask,
> -							MEMCG_RECLAIM_MAY_SWAP);
> +							MEMCG_RECLAIM_MAY_SWAP,
> +							NODE_MASK_ALL);
>  		psi_memstall_leave(&pflags);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
> @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> 
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options);
> +						    gfp_mask, reclaim_options,
> +						    NODE_MASK_ALL);
>  	psi_memstall_leave(&pflags);
> 
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  		}
> 
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  			return -EINTR;
> 
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP))
> +						  MEMCG_RECLAIM_MAY_SWAP,
> +						  NODE_MASK_ALL))
>  			nr_retries--;
>  	}
> 
> @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		}
> 
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL);
> 
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> 
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL))
>  				nr_reclaims--;
>  			continue;
>  		}
> @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
> 
> +enum {
> +	MEMORY_RECLAIM_NODES = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t if_tokens = {
> +	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -	unsigned int reclaim_options;
> -	int err;
> +	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> +				       MEMCG_RECLAIM_PROACTIVE;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +	int token;
> +	char value[256];
> +	nodemask_t nodemask = NODE_MASK_ALL;
> 
>  	buf = strstrip(buf);
> -	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -	if (err)
> -		return err;
> 
> -	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		token = match_token(start, if_tokens, args);
> +		match_strlcpy(value, args, sizeof(value));
> +		switch (token) {
> +		case MEMORY_RECLAIM_NODES:
> +			if (nodelist_parse(value, nodemask) < 0)
> +				return -EINVAL;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>  	while (nr_reclaimed < nr_to_reclaim) {
>  		unsigned long reclaimed;
> 
> @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> 
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
>  						nr_to_reclaim - nr_reclaimed,
> -						GFP_KERNEL, reclaim_options);
> +						GFP_KERNEL, reclaim_options,
> +						nodemask);
> 
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8e8e43806b..23fc5b523764 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   nodemask_t nodemask)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned int noreclaim_flag;
> @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.nodemask = &nodemask,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.38.1.584.g0f3c55d4c2-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
  2022-11-30  2:03 [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim Mina Almasry
  2022-11-30  8:44 ` Bagas Sanjaya
  2022-12-01 14:49   ` Michal Hocko
@ 2022-12-01 21:32 ` Shakeel Butt
  2022-12-01 22:10   ` Mina Almasry
  2022-12-02  3:25   ` Huang, Ying
  3 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2022-12-01 21:32 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen, weixugc, gthelen,
	fvdl, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Nov 29, 2022 at 06:03:27PM -0800, Mina Almasry wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8e8e43806b..23fc5b523764 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   nodemask_t nodemask)

Can you please make this parameter a nodemask_t* and pass NULL instead
of NODE_MASK_ALL?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
  2022-12-01 21:32 ` Shakeel Butt
@ 2022-12-01 22:10   ` Mina Almasry
  2022-12-02  6:04     ` Muchun Song
  0 siblings, 1 reply; 16+ messages in thread
From: Mina Almasry @ 2022-12-01 22:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen, weixugc, gthelen,
	fvdl, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

On Thu, Dec 1, 2022 at 1:32 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Nov 29, 2022 at 06:03:27PM -0800, Mina Almasry wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7b8e8e43806b..23fc5b523764 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                          unsigned long nr_pages,
> >                                          gfp_t gfp_mask,
> > -                                        unsigned int reclaim_options)
> > +                                        unsigned int reclaim_options,
> > +                                        nodemask_t nodemask)
>
> Can you please make this parameter a nodemask_t* and pass NULL instead
> of NODE_MASK_ALL?

Thank you very much for the review. I sure can in the next version. To
be honest I thought about that and made the parameter nodemask_t
because I thought the call sites would be more readable. I.e. this:

    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
MEMCG_RECLAIM_MAY_SWAP,  NODE_MASK_ALL);

Would be more readable than this:

    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
MEMCG_RECLAIM_MAY_SWAP,  NULL);

But the tradeoff is that the callers need include/linux/nodemask.h.
But yes I can fix in the next version.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  3:25   ` Huang, Ying
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2022-12-02  3:25 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb, gthelen,
	fvdl, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

Mina Almasry <almasrymina@google.com> writes:

> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 68 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 74cec76be9f2..ac5fcbcd5ae6 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
>
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> +	This file accepts a string which contains the number of bytes to
> +	reclaim.
>
>  	Example::
>
>  	  echo "1G" > memory.reclaim
>
> -	The interface can be later extended with nested keys to
> -	configure the reclaim behavior. For example, specify the
> -	type of memory to reclaim from (anon, file, ..).
> -
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>
> +	This file also allows the user to specify the nodes to reclaim from,
> +	via the 'nodes=' key, example::
> +
> +	  echo "1G nodes=0,1" > memory.reclaim
> +
> +	The above instructs the kernel to reclaim memory from nodes 0,1.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b61e2007d156..f542c114dffd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  unsigned int reclaim_options);
> +						  unsigned int reclaim_options,
> +						  nodemask_t nodemask);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23750cec0036..a0d7850173a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>  		psi_memstall_enter(&pflags);
>  		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>  							gfp_mask,
> -							MEMCG_RECLAIM_MAY_SWAP);
> +							MEMCG_RECLAIM_MAY_SWAP,
> +							NODE_MASK_ALL);
>  		psi_memstall_leave(&pflags);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
> @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options);
> +						    gfp_mask, reclaim_options,
> +						    NODE_MASK_ALL);
>  	psi_memstall_leave(&pflags);
>
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  		}
>
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  			return -EINTR;
>
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP))
> +						  MEMCG_RECLAIM_MAY_SWAP,
> +						  NODE_MASK_ALL))
>  			nr_retries--;
>  	}
>
> @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		}
>
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL);
>
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL))
>  				nr_reclaims--;
>  			continue;
>  		}
> @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>
> +enum {
> +	MEMORY_RECLAIM_NODES = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t if_tokens = {
> +	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -	unsigned int reclaim_options;
> -	int err;
> +	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> +				       MEMCG_RECLAIM_PROACTIVE;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +	int token;
> +	char value[256];
> +	nodemask_t nodemask = NODE_MASK_ALL;
>
>  	buf = strstrip(buf);
> -	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -	if (err)
> -		return err;
>
> -	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		token = match_token(start, if_tokens, args);
> +		match_strlcpy(value, args, sizeof(value));

Per my understanding, we don't need to copy the string, because strsep()
has replaced " " with "\0".  Right?

Best Regards,
Huang, Ying

> +		switch (token) {
> +		case MEMORY_RECLAIM_NODES:
> +			if (nodelist_parse(value, nodemask) < 0)
> +				return -EINVAL;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>  	while (nr_reclaimed < nr_to_reclaim) {
>  		unsigned long reclaimed;
>
> @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
>  						nr_to_reclaim - nr_reclaimed,
> -						GFP_KERNEL, reclaim_options);
> +						GFP_KERNEL, reclaim_options,
> +						nodemask);
>
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8e8e43806b..23fc5b523764 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   nodemask_t nodemask)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned int noreclaim_flag;
> @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.nodemask = &nodemask,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.38.1.584.g0f3c55d4c2-goog

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  3:25   ` Huang, Ying
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2022-12-02  3:25 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yang Shi, Yosry Ahmed, Tim Chen, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

> The nodes= arg instructs the kernel to only scan the given nodes for
> proactive reclaim. For example use cases, consider a 2 tier memory system:
>
> nodes 0,1 -> top tier
> nodes 2,3 -> second tier
>
> $ echo "1m nodes=0" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory from node 0.
> Since node 0 is a top tier node, demotion will be attempted first. This
> is useful to direct proactive reclaim to specific nodes that are under
> pressure.
>
> $ echo "1m nodes=2,3" > memory.reclaim
>
> This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> since this tier of memory has no demotion targets the memory will be
> reclaimed.
>
> $ echo "1m nodes=0,1" > memory.reclaim
>
> Instructs the kernel to reclaim memory from the top tier nodes, which can
> be desirable according to the userspace policy if there is pressure on
> the top tiers. Since these nodes have demotion targets, the kernel will
> attempt demotion first.
>
> Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> reclaim""), the proactive reclaim interface memory.reclaim does both
> reclaim and demotion. Reclaim and demotion incur different latency costs
> to the jobs in the cgroup. Demoted memory would still be addressable
> by the userspace at a higher latency, but reclaimed memory would need to
> incur a pagefault.
>
> The 'nodes' arg is useful to allow the userspace to control demotion
> and reclaim independently according to its policy: if the memory.reclaim
> is called on a node with demotion targets, it will attempt demotion first;
> if it is called on a node without demotion targets, it will only attempt
> reclaim.
>
> Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
>  mm/vmscan.c                             |  4 +-
>  4 files changed, 68 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 74cec76be9f2..ac5fcbcd5ae6 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
>  	This is a simple interface to trigger memory reclaim in the
>  	target cgroup.
>
> -	This file accepts a single key, the number of bytes to reclaim.
> -	No nested keys are currently supported.
> +	This file accepts a string which contains the number of bytes to
> +	reclaim.
>
>  	Example::
>
>  	  echo "1G" > memory.reclaim
>
> -	The interface can be later extended with nested keys to
> -	configure the reclaim behavior. For example, specify the
> -	type of memory to reclaim from (anon, file, ..).
> -
>  	Please note that the kernel can over or under reclaim from
>  	the target cgroup. If less bytes are reclaimed than the
>  	specified amount, -EAGAIN is returned.
> @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>
> +	This file also allows the user to specify the nodes to reclaim from,
> +	via the 'nodes=' key, example::
> +
> +	  echo "1G nodes=0,1" > memory.reclaim
> +
> +	The above instructs the kernel to reclaim memory from nodes 0,1.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b61e2007d156..f542c114dffd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> -						  unsigned int reclaim_options);
> +						  unsigned int reclaim_options,
> +						  nodemask_t nodemask);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 23750cec0036..a0d7850173a9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/parser.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>  		psi_memstall_enter(&pflags);
>  		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>  							gfp_mask,
> -							MEMCG_RECLAIM_MAY_SWAP);
> +							MEMCG_RECLAIM_MAY_SWAP,
> +							NODE_MASK_ALL);
>  		psi_memstall_leave(&pflags);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
> @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>  	psi_memstall_enter(&pflags);
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -						    gfp_mask, reclaim_options);
> +						    gfp_mask, reclaim_options,
> +						    NODE_MASK_ALL);
>  	psi_memstall_leave(&pflags);
>
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  		}
>
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL)) {
>  			ret = -EBUSY;
>  			break;
>  		}
> @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  			return -EINTR;
>
>  		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP))
> +						  MEMCG_RECLAIM_MAY_SWAP,
> +						  NODE_MASK_ALL))
>  			nr_retries--;
>  	}
>
> @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  		}
>
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL);
>
>  		if (!reclaimed && !nr_retries--)
>  			break;
> @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>  		if (nr_reclaims) {
>  			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> +					NODE_MASK_ALL))
>  				nr_reclaims--;
>  			continue;
>  		}
> @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>
> +enum {
> +	MEMORY_RECLAIM_NODES = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t if_tokens = {
> +	{ MEMORY_RECLAIM_NODES, "nodes=%s" },
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			      size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>  	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>  	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> -	unsigned int reclaim_options;
> -	int err;
> +	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> +				       MEMCG_RECLAIM_PROACTIVE;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +	int token;
> +	char value[256];
> +	nodemask_t nodemask = NODE_MASK_ALL;
>
>  	buf = strstrip(buf);
> -	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -	if (err)
> -		return err;
>
> -	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> +	old_buf = buf;
> +	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip(buf);
> +
> +	while ((start = strsep(&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		token = match_token(start, if_tokens, args);
> +		match_strlcpy(value, args, sizeof(value));

Per my understanding, we don't need to copy the string, because strsep()
has replaced " " with "\0".  Right?

Best Regards,
Huang, Ying

> +		switch (token) {
> +		case MEMORY_RECLAIM_NODES:
> +			if (nodelist_parse(value, nodemask) < 0)
> +				return -EINVAL;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>  	while (nr_reclaimed < nr_to_reclaim) {
>  		unsigned long reclaimed;
>
> @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
>  						nr_to_reclaim - nr_reclaimed,
> -						GFP_KERNEL, reclaim_options);
> +						GFP_KERNEL, reclaim_options,
> +						nodemask);
>
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8e8e43806b..23fc5b523764 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   unsigned int reclaim_options)
> +					   unsigned int reclaim_options,
> +					   nodemask_t nodemask)
>  {
>  	unsigned long nr_reclaimed;
>  	unsigned int noreclaim_flag;
> @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.nodemask = &nodemask,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> --
> 2.38.1.584.g0f3c55d4c2-goog

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  4:32     ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-12-02  4:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb, gthelen,
	fvdl, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

On Thu, Dec 1, 2022 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Mina Almasry <almasrymina@google.com> writes:
>
> > The nodes= arg instructs the kernel to only scan the given nodes for
> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >
> > nodes 0,1 -> top tier
> > nodes 2,3 -> second tier
> >
> > $ echo "1m nodes=0" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> > Since node 0 is a top tier node, demotion will be attempted first. This
> > is useful to direct proactive reclaim to specific nodes that are under
> > pressure.
> >
> > $ echo "1m nodes=2,3" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> > since this tier of memory has no demotion targets the memory will be
> > reclaimed.
> >
> > $ echo "1m nodes=0,1" > memory.reclaim
> >
> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> > be desirable according to the userspace policy if there is pressure on
> > the top tiers. Since these nodes have demotion targets, the kernel will
> > attempt demotion first.
> >
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. Reclaim and demotion incur different latency costs
> > to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > The 'nodes' arg is useful to allow the userspace to control demotion
> > and reclaim independently according to its policy: if the memory.reclaim
> > is called on a node with demotion targets, it will attempt demotion first;
> > if it is called on a node without demotion targets, it will only attempt
> > reclaim.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
> >  include/linux/swap.h                    |  3 +-
> >  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
> >  mm/vmscan.c                             |  4 +-
> >  4 files changed, 68 insertions(+), 21 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 74cec76be9f2..ac5fcbcd5ae6 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
> >       This is a simple interface to trigger memory reclaim in the
> >       target cgroup.
> >
> > -     This file accepts a single key, the number of bytes to reclaim.
> > -     No nested keys are currently supported.
> > +     This file accepts a string which contains the number of bytes to
> > +     reclaim.
> >
> >       Example::
> >
> >         echo "1G" > memory.reclaim
> >
> > -     The interface can be later extended with nested keys to
> > -     configure the reclaim behavior. For example, specify the
> > -     type of memory to reclaim from (anon, file, ..).
> > -
> >       Please note that the kernel can over or under reclaim from
> >       the target cgroup. If less bytes are reclaimed than the
> >       specified amount, -EAGAIN is returned.
> > @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
> >       This means that the networking layer will not adapt based on
> >       reclaim induced by memory.reclaim.
> >
> > +     This file also allows the user to specify the nodes to reclaim from,
> > +     via the 'nodes=' key, example::
> > +
> > +       echo "1G nodes=0,1" > memory.reclaim
> > +
> > +     The above instructs the kernel to reclaim memory from nodes 0,1.
> > +
> >    memory.peak
> >       A read-only single value file which exists on non-root
> >       cgroups.
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b61e2007d156..f542c114dffd 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                                 unsigned long nr_pages,
> >                                                 gfp_t gfp_mask,
> > -                                               unsigned int reclaim_options);
> > +                                               unsigned int reclaim_options,
> > +                                               nodemask_t nodemask);
> >  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> >                                               gfp_t gfp_mask, bool noswap,
> >                                               pg_data_t *pgdat,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 23750cec0036..a0d7850173a9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -63,6 +63,7 @@
> >  #include <linux/resume_user_mode.h>
> >  #include <linux/psi.h>
> >  #include <linux/seq_buf.h>
> > +#include <linux/parser.h>
> >  #include "internal.h"
> >  #include <net/sock.h>
> >  #include <net/ip.h>
> > @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> >               psi_memstall_enter(&pflags);
> >               nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> >                                                       gfp_mask,
> > -                                                     MEMCG_RECLAIM_MAY_SWAP);
> > +                                                     MEMCG_RECLAIM_MAY_SWAP,
> > +                                                     NODE_MASK_ALL);
> >               psi_memstall_leave(&pflags);
> >       } while ((memcg = parent_mem_cgroup(memcg)) &&
> >                !mem_cgroup_is_root(memcg));
> > @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >
> >       psi_memstall_enter(&pflags);
> >       nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> > -                                                 gfp_mask, reclaim_options);
> > +                                                 gfp_mask, reclaim_options,
> > +                                                 NODE_MASK_ALL);
> >       psi_memstall_leave(&pflags);
> >
> >       if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> >               }
> >
> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> > +                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL)) {
> >                       ret = -EBUSY;
> >                       break;
> >               }
> > @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >                       return -EINTR;
> >
> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -                                               MEMCG_RECLAIM_MAY_SWAP))
> > +                                               MEMCG_RECLAIM_MAY_SWAP,
> > +                                               NODE_MASK_ALL))
> >                       nr_retries--;
> >       }
> >
> > @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >               }
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL);
> >
> >               if (!reclaimed && !nr_retries--)
> >                       break;
> > @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >
> >               if (nr_reclaims) {
> >                       if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL))
> >                               nr_reclaims--;
> >                       continue;
> >               }
> > @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +enum {
> > +     MEMORY_RECLAIM_NODES = 0,
> > +     MEMORY_RECLAIM_NULL,
> > +};
> > +
> > +static const match_table_t if_tokens = {
> > +     { MEMORY_RECLAIM_NODES, "nodes=%s" },
> > +     { MEMORY_RECLAIM_NULL, NULL },
> > +};
> > +
> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                             size_t nbytes, loff_t off)
> >  {
> >       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> >       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > -     unsigned int reclaim_options;
> > -     int err;
> > +     unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> > +                                    MEMCG_RECLAIM_PROACTIVE;
> > +     char *old_buf, *start;
> > +     substring_t args[MAX_OPT_ARGS];
> > +     int token;
> > +     char value[256];
> > +     nodemask_t nodemask = NODE_MASK_ALL;
> >
> >       buf = strstrip(buf);
> > -     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > -     if (err)
> > -             return err;
> >
> > -     reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> > +     old_buf = buf;
> > +     nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> > +     if (buf == old_buf)
> > +             return -EINVAL;
> > +
> > +     buf = strstrip(buf);
> > +
> > +     while ((start = strsep(&buf, " ")) != NULL) {
> > +             if (!strlen(start))
> > +                     continue;
> > +             token = match_token(start, if_tokens, args);
> > +             match_strlcpy(value, args, sizeof(value));
>
> Per my understanding, we don't need to copy the string, because strsep()
> has replaced " " with "\0".  Right?
>

Unless I'm missing something I don't think this has anything to do
with strsep(). `args` is not a null terminated string that can be
passed to nodelist_parse(). Instead it is a struct substring_t that
has args->to and args->from. To convert substring_t args to a null
terminated string, I call match_strlcpy(). I think this is a common
pattern doen in a few places.

I think args->to may point to '\0' because of how strsep() and
match_token() work internally, but I'm somewhat uncomfortable making
assumptions about the implementation of these functions here (it may
change in the future and break the assumption).

> Best Regards,
> Huang, Ying
>
> > +             switch (token) {
> > +             case MEMORY_RECLAIM_NODES:
> > +                     if (nodelist_parse(value, nodemask) < 0)
> > +                             return -EINVAL;
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +     }
> > +
> >       while (nr_reclaimed < nr_to_reclaim) {
> >               unsigned long reclaimed;
> >
> > @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >                                               nr_to_reclaim - nr_reclaimed,
> > -                                             GFP_KERNEL, reclaim_options);
> > +                                             GFP_KERNEL, reclaim_options,
> > +                                             nodemask);
> >
> >               if (!reclaimed && !nr_retries--)
> >                       return -EAGAIN;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7b8e8e43806b..23fc5b523764 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                          unsigned long nr_pages,
> >                                          gfp_t gfp_mask,
> > -                                        unsigned int reclaim_options)
> > +                                        unsigned int reclaim_options,
> > +                                        nodemask_t nodemask)
> >  {
> >       unsigned long nr_reclaimed;
> >       unsigned int noreclaim_flag;
> > @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >               .may_unmap = 1,
> >               .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> >               .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> > +             .nodemask = &nodemask,
> >       };
> >       /*
> >        * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> > --
> > 2.38.1.584.g0f3c55d4c2-goog
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  4:32     ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-12-02  4:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yang Shi, Yosry Ahmed, Tim Chen, weixugc-hpIqsD4AKlfQT0dZR+AlfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu, Dec 1, 2022 at 7:26 PM Huang, Ying <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
> Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
> > The nodes= arg instructs the kernel to only scan the given nodes for
> > proactive reclaim. For example use cases, consider a 2 tier memory system:
> >
> > nodes 0,1 -> top tier
> > nodes 2,3 -> second tier
> >
> > $ echo "1m nodes=0" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
> > Since node 0 is a top tier node, demotion will be attempted first. This
> > is useful to direct proactive reclaim to specific nodes that are under
> > pressure.
> >
> > $ echo "1m nodes=2,3" > memory.reclaim
> >
> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
> > since this tier of memory has no demotion targets the memory will be
> > reclaimed.
> >
> > $ echo "1m nodes=0,1" > memory.reclaim
> >
> > Instructs the kernel to reclaim memory from the top tier nodes, which can
> > be desirable according to the userspace policy if there is pressure on
> > the top tiers. Since these nodes have demotion targets, the kernel will
> > attempt demotion first.
> >
> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
> > reclaim""), the proactive reclaim interface memory.reclaim does both
> > reclaim and demotion. Reclaim and demotion incur different latency costs
> > to the jobs in the cgroup. Demoted memory would still be addressable
> > by the userspace at a higher latency, but reclaimed memory would need to
> > incur a pagefault.
> >
> > The 'nodes' arg is useful to allow the userspace to control demotion
> > and reclaim independently according to its policy: if the memory.reclaim
> > is called on a node with demotion targets, it will attempt demotion first;
> > if it is called on a node without demotion targets, it will only attempt
> > reclaim.
> >
> > Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
> >  include/linux/swap.h                    |  3 +-
> >  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
> >  mm/vmscan.c                             |  4 +-
> >  4 files changed, 68 insertions(+), 21 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 74cec76be9f2..ac5fcbcd5ae6 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
> >       This is a simple interface to trigger memory reclaim in the
> >       target cgroup.
> >
> > -     This file accepts a single key, the number of bytes to reclaim.
> > -     No nested keys are currently supported.
> > +     This file accepts a string which contains the number of bytes to
> > +     reclaim.
> >
> >       Example::
> >
> >         echo "1G" > memory.reclaim
> >
> > -     The interface can be later extended with nested keys to
> > -     configure the reclaim behavior. For example, specify the
> > -     type of memory to reclaim from (anon, file, ..).
> > -
> >       Please note that the kernel can over or under reclaim from
> >       the target cgroup. If less bytes are reclaimed than the
> >       specified amount, -EAGAIN is returned.
> > @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
> >       This means that the networking layer will not adapt based on
> >       reclaim induced by memory.reclaim.
> >
> > +     This file also allows the user to specify the nodes to reclaim from,
> > +     via the 'nodes=' key, example::
> > +
> > +       echo "1G nodes=0,1" > memory.reclaim
> > +
> > +     The above instructs the kernel to reclaim memory from nodes 0,1.
> > +
> >    memory.peak
> >       A read-only single value file which exists on non-root
> >       cgroups.
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b61e2007d156..f542c114dffd 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                                 unsigned long nr_pages,
> >                                                 gfp_t gfp_mask,
> > -                                               unsigned int reclaim_options);
> > +                                               unsigned int reclaim_options,
> > +                                               nodemask_t nodemask);
> >  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> >                                               gfp_t gfp_mask, bool noswap,
> >                                               pg_data_t *pgdat,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 23750cec0036..a0d7850173a9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -63,6 +63,7 @@
> >  #include <linux/resume_user_mode.h>
> >  #include <linux/psi.h>
> >  #include <linux/seq_buf.h>
> > +#include <linux/parser.h>
> >  #include "internal.h"
> >  #include <net/sock.h>
> >  #include <net/ip.h>
> > @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> >               psi_memstall_enter(&pflags);
> >               nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
> >                                                       gfp_mask,
> > -                                                     MEMCG_RECLAIM_MAY_SWAP);
> > +                                                     MEMCG_RECLAIM_MAY_SWAP,
> > +                                                     NODE_MASK_ALL);
> >               psi_memstall_leave(&pflags);
> >       } while ((memcg = parent_mem_cgroup(memcg)) &&
> >                !mem_cgroup_is_root(memcg));
> > @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >
> >       psi_memstall_enter(&pflags);
> >       nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> > -                                                 gfp_mask, reclaim_options);
> > +                                                 gfp_mask, reclaim_options,
> > +                                                 NODE_MASK_ALL);
> >       psi_memstall_leave(&pflags);
> >
> >       if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> >               }
> >
> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> > +                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL)) {
> >                       ret = -EBUSY;
> >                       break;
> >               }
> > @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >                       return -EINTR;
> >
> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > -                                               MEMCG_RECLAIM_MAY_SWAP))
> > +                                               MEMCG_RECLAIM_MAY_SWAP,
> > +                                               NODE_MASK_ALL))
> >                       nr_retries--;
> >       }
> >
> > @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> >               }
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL);
> >
> >               if (!reclaimed && !nr_retries--)
> >                       break;
> > @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >
> >               if (nr_reclaims) {
> >                       if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > +                                     NODE_MASK_ALL))
> >                               nr_reclaims--;
> >                       continue;
> >               }
> > @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +enum {
> > +     MEMORY_RECLAIM_NODES = 0,
> > +     MEMORY_RECLAIM_NULL,
> > +};
> > +
> > +static const match_table_t if_tokens = {
> > +     { MEMORY_RECLAIM_NODES, "nodes=%s" },
> > +     { MEMORY_RECLAIM_NULL, NULL },
> > +};
> > +
> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                             size_t nbytes, loff_t off)
> >  {
> >       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> >       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > -     unsigned int reclaim_options;
> > -     int err;
> > +     unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
> > +                                    MEMCG_RECLAIM_PROACTIVE;
> > +     char *old_buf, *start;
> > +     substring_t args[MAX_OPT_ARGS];
> > +     int token;
> > +     char value[256];
> > +     nodemask_t nodemask = NODE_MASK_ALL;
> >
> >       buf = strstrip(buf);
> > -     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > -     if (err)
> > -             return err;
> >
> > -     reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
> > +     old_buf = buf;
> > +     nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> > +     if (buf == old_buf)
> > +             return -EINVAL;
> > +
> > +     buf = strstrip(buf);
> > +
> > +     while ((start = strsep(&buf, " ")) != NULL) {
> > +             if (!strlen(start))
> > +                     continue;
> > +             token = match_token(start, if_tokens, args);
> > +             match_strlcpy(value, args, sizeof(value));
>
> Per my understanding, we don't need to copy the string, because strsep()
> has replaced " " with "\0".  Right?
>

Unless I'm missing something I don't think this has anything to do
with strsep(). `args` is not a null terminated string that can be
passed to nodelist_parse(). Instead it is a struct substring_t that
has args->to and args->from. To convert substring_t args to a null
terminated string, I call match_strlcpy(). I think this is a common
pattern doen in a few places.

I think args->to may point to '\0' because of how strsep() and
match_token() work internally, but I'm somewhat uncomfortable making
assumptions about the implementation of these functions here (it may
change in the future and break the assumption).

> Best Regards,
> Huang, Ying
>
> > +             switch (token) {
> > +             case MEMORY_RECLAIM_NODES:
> > +                     if (nodelist_parse(value, nodemask) < 0)
> > +                             return -EINVAL;
> > +                     break;
> > +             default:
> > +                     return -EINVAL;
> > +             }
> > +     }
> > +
> >       while (nr_reclaimed < nr_to_reclaim) {
> >               unsigned long reclaimed;
> >
> > @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >
> >               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >                                               nr_to_reclaim - nr_reclaimed,
> > -                                             GFP_KERNEL, reclaim_options);
> > +                                             GFP_KERNEL, reclaim_options,
> > +                                             nodemask);
> >
> >               if (!reclaimed && !nr_retries--)
> >                       return -EAGAIN;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7b8e8e43806b..23fc5b523764 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >                                          unsigned long nr_pages,
> >                                          gfp_t gfp_mask,
> > -                                        unsigned int reclaim_options)
> > +                                        unsigned int reclaim_options,
> > +                                        nodemask_t nodemask)
> >  {
> >       unsigned long nr_reclaimed;
> >       unsigned int noreclaim_flag;
> > @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >               .may_unmap = 1,
> >               .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> >               .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> > +             .nodemask = &nodemask,
> >       };
> >       /*
> >        * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> > --
> > 2.38.1.584.g0f3c55d4c2-goog
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
  2022-12-01 22:10   ` Mina Almasry
@ 2022-12-02  6:04     ` Muchun Song
  2022-12-02  6:24         ` Mina Almasry
  0 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2022-12-02  6:04 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Shakeel Butt, Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc, gthelen, fvdl, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, cgroups, Linux Doc Mailing List, linux-kernel,
	Linux Memory Management List



> On Dec 2, 2022, at 06:10, Mina Almasry <almasrymina@google.com> wrote:
> 
> On Thu, Dec 1, 2022 at 1:32 PM Shakeel Butt <shakeelb@google.com> wrote:
>> 
>> On Tue, Nov 29, 2022 at 06:03:27PM -0800, Mina Almasry wrote:
>> [...]
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 7b8e8e43806b..23fc5b523764 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>>> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>                                         unsigned long nr_pages,
>>>                                         gfp_t gfp_mask,
>>> -                                        unsigned int reclaim_options)
>>> +                                        unsigned int reclaim_options,
>>> +                                        nodemask_t nodemask)
>> 
>> Can you please make this parameter a nodemask_t* and pass NULL instead
>> of NODE_MASK_ALL?
> 
> Thank you very much for the review. I sure can in the next version. To
> be honest I thought about that and made the parameter nodemask_t
> because I thought the call sites would be more readable. I.e. this:
> 
>    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> MEMCG_RECLAIM_MAY_SWAP,  NODE_MASK_ALL);

nodemask_t is an array, which can be large depending on CONFIG_NODES_SHIFT.
I don't think passing a big array is an efficient way. So I agree with Shakeel.

Thanks.

> 
> Would be more readable than this:
> 
>    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> MEMCG_RECLAIM_MAY_SWAP,  NULL);
> 
> But the tradeoff is that the callers need include/linux/nodemask.h.
> But yes I can fix in the next version.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  6:24         ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-12-02  6:24 UTC (permalink / raw)
  To: Muchun Song
  Cc: Shakeel Butt, Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc, gthelen, fvdl, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, cgroups, Linux Doc Mailing List, linux-kernel,
	Linux Memory Management List

On Thu, Dec 1, 2022 at 10:05 PM Muchun Song <muchun.song@linux.dev> wrote:
>
>
>
> > On Dec 2, 2022, at 06:10, Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Thu, Dec 1, 2022 at 1:32 PM Shakeel Butt <shakeelb@google.com> wrote:
> >>
> >> On Tue, Nov 29, 2022 at 06:03:27PM -0800, Mina Almasry wrote:
> >> [...]
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 7b8e8e43806b..23fc5b523764 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >>> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >>>                                         unsigned long nr_pages,
> >>>                                         gfp_t gfp_mask,
> >>> -                                        unsigned int reclaim_options)
> >>> +                                        unsigned int reclaim_options,
> >>> +                                        nodemask_t nodemask)
> >>
> >> Can you please make this parameter a nodemask_t* and pass NULL instead
> >> of NODE_MASK_ALL?
> >
> > Thank you very much for the review. I sure can in the next version. To
> > be honest I thought about that and made the parameter nodemask_t
> > because I thought the call sites would be more readable. I.e. this:
> >
> >    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > MEMCG_RECLAIM_MAY_SWAP,  NODE_MASK_ALL);
>
> nodemask_t is an array, which can be large depending on CONFIG_NODES_SHIFT.
> I don't think passing a big array is an efficient way. So I agree with Shakeel.
>

Ah, yes, I think the nodemask_t ends up compiling to something like:

typedef struct {
    unsigned long name[BITS_TO_LONGS(MAX_NUMNODES)]
 } nodemask_t;

If it was an array it would be passed by reference anway, but I think
if it is a struct containing an array the array will get copied
indeed. Sure, I will fix in the next version.

> Thanks.
>
> >
> > Would be more readable than this:
> >
> >    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > MEMCG_RECLAIM_MAY_SWAP,  NULL);
> >
> > But the tradeoff is that the callers need include/linux/nodemask.h.
> > But yes I can fix in the next version.
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
@ 2022-12-02  6:24         ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2022-12-02  6:24 UTC (permalink / raw)
  To: Muchun Song
  Cc: Shakeel Butt, Huang Ying, Yang Shi, Yosry Ahmed, Tim Chen,
	weixugc-hpIqsD4AKlfQT0dZR+AlfA, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fvdl-hpIqsD4AKlfQT0dZR+AlfA, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Linux Doc Mailing List, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Linux Memory Management List

On Thu, Dec 1, 2022 at 10:05 PM Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> wrote:
>
>
>
> > On Dec 2, 2022, at 06:10, Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Thu, Dec 1, 2022 at 1:32 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >> On Tue, Nov 29, 2022 at 06:03:27PM -0800, Mina Almasry wrote:
> >> [...]
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 7b8e8e43806b..23fc5b523764 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >>> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >>>                                         unsigned long nr_pages,
> >>>                                         gfp_t gfp_mask,
> >>> -                                        unsigned int reclaim_options)
> >>> +                                        unsigned int reclaim_options,
> >>> +                                        nodemask_t nodemask)
> >>
> >> Can you please make this parameter a nodemask_t* and pass NULL instead
> >> of NODE_MASK_ALL?
> >
> > Thank you very much for the review. I sure can in the next version. To
> > be honest I thought about that and made the parameter nodemask_t
> > because I thought the call sites would be more readable. I.e. this:
> >
> >    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > MEMCG_RECLAIM_MAY_SWAP,  NODE_MASK_ALL);
>
> nodemask_t is an array, which can be large depending on CONFIG_NODES_SHIFT.
> I don't think passing a big array is an efficient way. So I agree with Shakeel.
>

Ah, yes, I think the nodemask_t ends up compiling to something like:

typedef struct {
    unsigned long name[BITS_TO_LONGS(MAX_NUMNODES)]
 } nodemask_t;

If it was an array it would be passed by reference anway, but I think
if it is a struct containing an array the array will get copied
indeed. Sure, I will fix in the next version.

> Thanks.
>
> >
> > Would be more readable than this:
> >
> >    try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > MEMCG_RECLAIM_MAY_SWAP,  NULL);
> >
> > But the tradeoff is that the callers need include/linux/nodemask.h.
> > But yes I can fix in the next version.
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim
  2022-12-02  4:32     ` Mina Almasry
  (?)
@ 2022-12-05  1:45     ` Huang, Ying
  -1 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2022-12-05  1:45 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Yang Shi, Yosry Ahmed, Tim Chen, weixugc, shakeelb, gthelen,
	fvdl, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	cgroups, linux-doc, linux-kernel, linux-mm

Mina Almasry <almasrymina@google.com> writes:

> On Thu, Dec 1, 2022 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Mina Almasry <almasrymina@google.com> writes:
>>
>> > The nodes= arg instructs the kernel to only scan the given nodes for
>> > proactive reclaim. For example use cases, consider a 2 tier memory system:
>> >
>> > nodes 0,1 -> top tier
>> > nodes 2,3 -> second tier
>> >
>> > $ echo "1m nodes=0" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory from node 0.
>> > Since node 0 is a top tier node, demotion will be attempted first. This
>> > is useful to direct proactive reclaim to specific nodes that are under
>> > pressure.
>> >
>> > $ echo "1m nodes=2,3" > memory.reclaim
>> >
>> > This instructs the kernel to attempt to reclaim 1m memory in the second tier,
>> > since this tier of memory has no demotion targets the memory will be
>> > reclaimed.
>> >
>> > $ echo "1m nodes=0,1" > memory.reclaim
>> >
>> > Instructs the kernel to reclaim memory from the top tier nodes, which can
>> > be desirable according to the userspace policy if there is pressure on
>> > the top tiers. Since these nodes have demotion targets, the kernel will
>> > attempt demotion first.
>> >
>> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg
>> > reclaim""), the proactive reclaim interface memory.reclaim does both
>> > reclaim and demotion. Reclaim and demotion incur different latency costs
>> > to the jobs in the cgroup. Demoted memory would still be addressable
>> > by the userspace at a higher latency, but reclaimed memory would need to
>> > incur a pagefault.
>> >
>> > The 'nodes' arg is useful to allow the userspace to control demotion
>> > and reclaim independently according to its policy: if the memory.reclaim
>> > is called on a node with demotion targets, it will attempt demotion first;
>> > if it is called on a node without demotion targets, it will only attempt
>> > reclaim.
>> >
>> > Signed-off-by: Mina Almasry <almasrymina@google.com>
>> > ---
>> >  Documentation/admin-guide/cgroup-v2.rst | 15 +++---
>> >  include/linux/swap.h                    |  3 +-
>> >  mm/memcontrol.c                         | 67 ++++++++++++++++++++-----
>> >  mm/vmscan.c                             |  4 +-
>> >  4 files changed, 68 insertions(+), 21 deletions(-)
>> >
>> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> > index 74cec76be9f2..ac5fcbcd5ae6 100644
>> > --- a/Documentation/admin-guide/cgroup-v2.rst
>> > +++ b/Documentation/admin-guide/cgroup-v2.rst
>> > @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
>> >       This is a simple interface to trigger memory reclaim in the
>> >       target cgroup.
>> >
>> > -     This file accepts a single key, the number of bytes to reclaim.
>> > -     No nested keys are currently supported.
>> > +     This file accepts a string which contains the number of bytes to
>> > +     reclaim.
>> >
>> >       Example::
>> >
>> >         echo "1G" > memory.reclaim
>> >
>> > -     The interface can be later extended with nested keys to
>> > -     configure the reclaim behavior. For example, specify the
>> > -     type of memory to reclaim from (anon, file, ..).
>> > -
>> >       Please note that the kernel can over or under reclaim from
>> >       the target cgroup. If less bytes are reclaimed than the
>> >       specified amount, -EAGAIN is returned.
>> > @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
>> >       This means that the networking layer will not adapt based on
>> >       reclaim induced by memory.reclaim.
>> >
>> > +     This file also allows the user to specify the nodes to reclaim from,
>> > +     via the 'nodes=' key, example::
>> > +
>> > +       echo "1G nodes=0,1" > memory.reclaim
>> > +
>> > +     The above instructs the kernel to reclaim memory from nodes 0,1.
>> > +
>> >    memory.peak
>> >       A read-only single value file which exists on non-root
>> >       cgroups.
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index b61e2007d156..f542c114dffd 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -419,7 +419,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>> >  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>> >                                                 unsigned long nr_pages,
>> >                                                 gfp_t gfp_mask,
>> > -                                               unsigned int reclaim_options);
>> > +                                               unsigned int reclaim_options,
>> > +                                               nodemask_t nodemask);
>> >  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>> >                                               gfp_t gfp_mask, bool noswap,
>> >                                               pg_data_t *pgdat,
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 23750cec0036..a0d7850173a9 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -63,6 +63,7 @@
>> >  #include <linux/resume_user_mode.h>
>> >  #include <linux/psi.h>
>> >  #include <linux/seq_buf.h>
>> > +#include <linux/parser.h>
>> >  #include "internal.h"
>> >  #include <net/sock.h>
>> >  #include <net/ip.h>
>> > @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>> >               psi_memstall_enter(&pflags);
>> >               nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>> >                                                       gfp_mask,
>> > -                                                     MEMCG_RECLAIM_MAY_SWAP);
>> > +                                                     MEMCG_RECLAIM_MAY_SWAP,
>> > +                                                     NODE_MASK_ALL);
>> >               psi_memstall_leave(&pflags);
>> >       } while ((memcg = parent_mem_cgroup(memcg)) &&
>> >                !mem_cgroup_is_root(memcg));
>> > @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> >
>> >       psi_memstall_enter(&pflags);
>> >       nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>> > -                                                 gfp_mask, reclaim_options);
>> > +                                                 gfp_mask, reclaim_options,
>> > +                                                 NODE_MASK_ALL);
>> >       psi_memstall_leave(&pflags);
>> >
>> >       if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>> > @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>> >               }
>> >
>> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>> > -                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
>> > +                                     memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
>> > +                                     NODE_MASK_ALL)) {
>> >                       ret = -EBUSY;
>> >                       break;
>> >               }
>> > @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>> >                       return -EINTR;
>> >
>> >               if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>> > -                                               MEMCG_RECLAIM_MAY_SWAP))
>> > +                                               MEMCG_RECLAIM_MAY_SWAP,
>> > +                                               NODE_MASK_ALL))
>> >                       nr_retries--;
>> >       }
>> >
>> > @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>> >               }
>> >
>> >               reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
>> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
>> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
>> > +                                     NODE_MASK_ALL);
>> >
>> >               if (!reclaimed && !nr_retries--)
>> >                       break;
>> > @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>> >
>> >               if (nr_reclaims) {
>> >                       if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
>> > -                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
>> > +                                     GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
>> > +                                     NODE_MASK_ALL))
>> >                               nr_reclaims--;
>> >                       continue;
>> >               }
>> > @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>> >       return nbytes;
>> >  }
>> >
>> > +enum {
>> > +     MEMORY_RECLAIM_NODES = 0,
>> > +     MEMORY_RECLAIM_NULL,
>> > +};
>> > +
>> > +static const match_table_t if_tokens = {
>> > +     { MEMORY_RECLAIM_NODES, "nodes=%s" },
>> > +     { MEMORY_RECLAIM_NULL, NULL },
>> > +};
>> > +
>> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>> >                             size_t nbytes, loff_t off)
>> >  {
>> >       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>> >       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>> >       unsigned long nr_to_reclaim, nr_reclaimed = 0;
>> > -     unsigned int reclaim_options;
>> > -     int err;
>> > +     unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>> > +                                    MEMCG_RECLAIM_PROACTIVE;
>> > +     char *old_buf, *start;
>> > +     substring_t args[MAX_OPT_ARGS];
>> > +     int token;
>> > +     char value[256];
>> > +     nodemask_t nodemask = NODE_MASK_ALL;
>> >
>> >       buf = strstrip(buf);
>> > -     err = page_counter_memparse(buf, "", &nr_to_reclaim);
>> > -     if (err)
>> > -             return err;
>> >
>> > -     reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>> > +     old_buf = buf;
>> > +     nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
>> > +     if (buf == old_buf)
>> > +             return -EINVAL;
>> > +
>> > +     buf = strstrip(buf);
>> > +
>> > +     while ((start = strsep(&buf, " ")) != NULL) {
>> > +             if (!strlen(start))
>> > +                     continue;
>> > +             token = match_token(start, if_tokens, args);
>> > +             match_strlcpy(value, args, sizeof(value));
>>
>> Per my understanding, we don't need to copy the string, because strsep()
>> has replaced " " with "\0".  Right?
>>
>
> Unless I'm missing something I don't think this has anything to do
> with strsep(). `args` is not a null terminated string that can be
> passed to nodelist_parse(). Instead it is a struct substring_t that
> has args->to and args->from. To convert substring_t args to a null
> terminated string, I call match_strlcpy(). I think this is a common
> pattern doen in a few places.
>
> I think args->to may point to '\0' because of how strsep() and
> match_token() work internally, but I'm somewhat uncomfortable making
> assumptions about the implementation of these functions here (it may
> change in the future and break the assumption).

At least fro strsep(), "\0" terminating isn't just a implementation
details.  As in the comments of strsep(), strsep() has same semantics of
that of glibc2.  While in `man strsep`, it is said that the argument string
will be overwritten for "\0".

It's not so clear for match_token().  But I don't think we can get
anything other than the string after "nodes=".

Best Regards,
Huang, Ying

>>
>> > +             switch (token) {
>> > +             case MEMORY_RECLAIM_NODES:
>> > +                     if (nodelist_parse(value, nodemask) < 0)
>> > +                             return -EINVAL;
>> > +                     break;
>> > +             default:
>> > +                     return -EINVAL;
>> > +             }
>> > +     }
>> > +
>> >       while (nr_reclaimed < nr_to_reclaim) {
>> >               unsigned long reclaimed;
>> >
>> > @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>> >
>> >               reclaimed = try_to_free_mem_cgroup_pages(memcg,
>> >                                               nr_to_reclaim - nr_reclaimed,
>> > -                                             GFP_KERNEL, reclaim_options);
>> > +                                             GFP_KERNEL, reclaim_options,
>> > +                                             nodemask);
>> >
>> >               if (!reclaimed && !nr_retries--)
>> >                       return -EAGAIN;
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 7b8e8e43806b..23fc5b523764 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>> >  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>> >                                          unsigned long nr_pages,
>> >                                          gfp_t gfp_mask,
>> > -                                        unsigned int reclaim_options)
>> > +                                        unsigned int reclaim_options,
>> > +                                        nodemask_t nodemask)
>> >  {
>> >       unsigned long nr_reclaimed;
>> >       unsigned int noreclaim_flag;
>> > @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>> >               .may_unmap = 1,
>> >               .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>> >               .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
>> > +             .nodemask = &nodemask,
>> >       };
>> >       /*
>> >        * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
>> > --
>> > 2.38.1.584.g0f3c55d4c2-goog
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-12-05  1:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-30  2:03 [RFC PATCH v2] mm: Add nodes= arg to memory.reclaim Mina Almasry
2022-11-30  8:44 ` Bagas Sanjaya
2022-11-30 19:45   ` Mina Almasry
2022-11-30 19:45     ` Mina Almasry
2022-12-01 14:49 ` Michal Hocko
2022-12-01 14:49   ` Michal Hocko
2022-12-01 21:32 ` Shakeel Butt
2022-12-01 22:10   ` Mina Almasry
2022-12-02  6:04     ` Muchun Song
2022-12-02  6:24       ` Mina Almasry
2022-12-02  6:24         ` Mina Almasry
2022-12-02  3:25 ` Huang, Ying
2022-12-02  3:25   ` Huang, Ying
2022-12-02  4:32   ` Mina Almasry
2022-12-02  4:32     ` Mina Almasry
2022-12-05  1:45     ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.