All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/1] Add swappiness argument to memory.reclaim
@ 2023-12-13  1:38 Dan Schatzberg
  2023-12-13  1:38 ` [PATCH V4 1/2] mm: add defines for min/max swappiness Dan Schatzberg
  2023-12-13  1:38 ` [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
  0 siblings, 2 replies; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-13  1:38 UTC (permalink / raw)
  To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Chris Li,
	Kefeng Wang, Dan Schatzberg, Hugh Dickins, Yue Zhao

Changes since V3:
  * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
  * Added explicit calls to mem_cgroup_swappiness

Changes since V2:
  * No functional change
  * Used int consistently rather than a pointer

Changes since V1:
  * Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs. As these are
  typically rare events, they have relatively little impact on write endurance.
  However, proactive reclaim runs continuously and so its impact on SSD write
  endurance is more significant. Therefore it is desireable to control swap-out
  for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
  exhaustion. This makes sense if the swap exhaustion is triggered due to
  reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
  one could see OOMs when free memory is ample but anon is just particularly
  cold). Therefore, it's desireable to have proactive reclaim reduce or stop
  swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
  system administration that wishes to control vm.swappiness. In our case, we
  need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
  misconfigured setting. This requires some additional management to record the
  "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/

Dan Schatzberg (2):
  mm: add defines for min/max swappiness
  mm: add swapiness= arg to memory.reclaim

 Documentation/admin-guide/cgroup-v2.rst | 19 +++++---
 include/linux/swap.h                    |  5 +-
 mm/memcontrol.c                         | 63 ++++++++++++++++++++-----
 mm/vmscan.c                             | 21 +++++----
 4 files changed, 80 insertions(+), 28 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V4 1/2] mm: add defines for min/max swappiness
  2023-12-13  1:38 [PATCH V4 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg
@ 2023-12-13  1:38 ` Dan Schatzberg
  2023-12-13  1:58   ` Huan Yang
  2023-12-13  1:38 ` [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
  1 sibling, 1 reply; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-13  1:38 UTC (permalink / raw)
  To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Chris Li,
	Kefeng Wang, Dan Schatzberg, Yue Zhao, Hugh Dickins

We use the constants 0 and 200 in a few places in the mm code when
referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
and MAX_SWAPPINESS #defines to improve clarity. There are no functional
changes.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 include/linux/swap.h |  2 ++
 mm/memcontrol.c      |  2 +-
 mm/vmscan.c          | 10 +++++-----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6dd6575b905..e2ab76c25b4a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+#define MIN_SWAPPINESS 0
+#define MAX_SWAPPINESS 200
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c1061df9cd1..9748a6b88bb8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-	if (val > 200)
+	if (val > MAX_SWAPPINESS)
 		return -EINVAL;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 506f8220c5fe..2aa671fe938b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	ap = swappiness * (total_cost + 1);
 	ap /= anon_cost + 1;
 
-	fp = (200 - swappiness) * (total_cost + 1);
+	fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
 	fp /= file_cost + 1;
 
 	fraction[0] = ap;
@@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
 {
 	int type, tier;
 	struct ctrl_pos sp, pv;
-	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+	int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
 
 	/*
 	 * Compare the first tier of anon with that of file to determine which
@@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 		type = LRU_GEN_ANON;
 	else if (swappiness == 1)
 		type = LRU_GEN_FILE;
-	else if (swappiness == 200)
+	else if (swappiness == MAX_SWAPPINESS)
 		type = LRU_GEN_ANON;
 	else
 		type = get_type_to_scan(lruvec, swappiness, &tier);
@@ -5368,9 +5368,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
 
 	lruvec = get_lruvec(memcg, nid);
 
-	if (swappiness < 0)
+	if (swappiness < MIN_SWAPPINESS)
 		swappiness = get_swappiness(lruvec, sc);
-	else if (swappiness > 200)
+	else if (swappiness > MAX_SWAPPINESS)
 		goto done;
 
 	switch (cmd) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-13  1:38 [PATCH V4 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg
  2023-12-13  1:38 ` [PATCH V4 1/2] mm: add defines for min/max swappiness Dan Schatzberg
@ 2023-12-13  1:38 ` Dan Schatzberg
  2023-12-13  2:05   ` Yu Zhao
  2023-12-14  8:38   ` Michal Hocko
  1 sibling, 2 replies; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-13  1:38 UTC (permalink / raw)
  To: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Dan Schatzberg, Yue Zhao, Hugh Dickins

Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 19 +++++---
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 61 ++++++++++++++++++++-----
 mm/vmscan.c                             | 11 +++--
 4 files changed, 72 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..06319349c072 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
-
 	Example::
 
 	  echo "1G" > memory.reclaim
 
-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
+The following nested keys are defined.
+
+	  ==========		================================
+	  swappiness		Swappiness value to reclaim with
+	  ==========		================================
+
+	Specifying a swappiness value instructs the kernel to perform
+	the reclaim with that swappiness value. Note that this has the
+	same semantics as the vm.swappiness sysctl - it sets the
+	relative IO cost of reclaiming anon vs file memory but does
+	not allow for reclaiming specific amounts of anon or file memory.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..690898c347de 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  int swappiness);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9748a6b88bb8..be3d5797d9df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/parser.h>
 #include <linux/sched/isolation.h>
 #include "internal.h"
 #include <net/sock.h>
@@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							mem_cgroup_swappiness(memcg));
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options,
+						    mem_cgroup_swappiness(memcg));
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3663,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg))) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3774,7 +3778,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  mem_cgroup_swappiness(memcg)))
 			nr_retries--;
 	}
 
@@ -6720,7 +6725,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg));
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6769,7 +6775,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					mem_cgroup_swappiness(memcg)))
 				nr_reclaims--;
 			continue;
 		}
@@ -6895,6 +6902,16 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
@@ -6902,12 +6919,33 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
 	unsigned int reclaim_options;
-	int err;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
+	int swappiness = mem_cgroup_swappiness(memcg);
 
 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6964,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-					GFP_KERNEL, reclaim_options);
+					GFP_KERNEL, reclaim_options,
+					swappiness);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2aa671fe938b..cfef06c7c23f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,6 +136,9 @@ struct scan_control {
 	/* Always discard instead of demoting to lower tier memory */
 	unsigned int no_demotion:1;
 
+	/* Swappiness value for reclaim */
+	int swappiness;
+
 	/* Allocation order */
 	s8 order;
 
@@ -2327,7 +2330,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
-	int swappiness = mem_cgroup_swappiness(memcg);
+	int swappiness = sc->swappiness;
 	u64 fraction[ANON_AND_FILE];
 	u64 denominator = 0;	/* gcc */
 	enum scan_balance scan_balance;
@@ -2608,7 +2611,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
 
-	return mem_cgroup_swappiness(memcg);
+	return sc->swappiness;
 }
 
 static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6433,7 +6436,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   int swappiness)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6448,6 +6452,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.swappiness = swappiness,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 1/2] mm: add defines for min/max swappiness
  2023-12-13  1:38 ` [PATCH V4 1/2] mm: add defines for min/max swappiness Dan Schatzberg
@ 2023-12-13  1:58   ` Huan Yang
  2023-12-13 16:41     ` Dan Schatzberg
  0 siblings, 1 reply; 13+ messages in thread
From: Huan Yang @ 2023-12-13  1:58 UTC (permalink / raw)
  To: Dan Schatzberg, Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Chris Li,
	Kefeng Wang, Yue Zhao, Hugh Dickins


在 2023/12/13 9:38, Dan Schatzberg 写道:
> [????????? schatzberg.dan@gmail.com ????????? https://aka.ms/LearnAboutSenderIdentification,????????????]
>
> We use the constants 0 and 200 in a few places in the mm code when
> referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
> and MAX_SWAPPINESS #defines to improve clarity. There are no functional
> changes.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> ---
>   include/linux/swap.h |  2 ++
>   mm/memcontrol.c      |  2 +-
>   mm/vmscan.c          | 10 +++++-----
>   3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f6dd6575b905..e2ab76c25b4a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>
>   #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
>   #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
> +#define MIN_SWAPPINESS 0
> +#define MAX_SWAPPINESS 200

Do MAX_SWAPPINESS apply for all swapppiness? If so, maybe better change 
swappiness sysctl define:
```
sysctl.c:

{
         .procname    = "swappiness",
         .data        = &vm_swappiness,
         .maxlen        = sizeof(vm_swappiness),
         .mode        = 0644,
         .proc_handler    = proc_dointvec_minmax,
         .extra1        = SYSCTL_ZERO,
         .extra2        = SYSCTL_TWO_HUNDRED,
     },

```

Here hard code swappiness in [0, 200], and now add a new define.

And many other code hard reference 200 into max value of swappiness, like:

```
memcontrol.c:
static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
                        struct cftype *cft, u64 val)
{
     struct mem_cgroup *memcg = mem_cgroup_from_css(css);

     if (val > 200)
         return -EINVAL;

     if (!mem_cgroup_is_root(memcg))
         memcg->swappiness = val;
     else
         vm_swappiness = val;

     return 0;
}
vmscan.c:

/*
  * From 0 .. 200.  Higher means more swappy.
  */
int vm_swappiness = 60;

```


>   extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                    unsigned long nr_pages,
>                                                    gfp_t gfp_mask,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c1061df9cd1..9748a6b88bb8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>   {
>          struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>
> -       if (val > 200)
> +       if (val > MAX_SWAPPINESS)
>                  return -EINVAL;
>
>          if (!mem_cgroup_is_root(memcg))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 506f8220c5fe..2aa671fe938b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>          ap = swappiness * (total_cost + 1);
>          ap /= anon_cost + 1;
>
> -       fp = (200 - swappiness) * (total_cost + 1);
> +       fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
>          fp /= file_cost + 1;
>
>          fraction[0] = ap;
> @@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
>   {
>          int type, tier;
>          struct ctrl_pos sp, pv;
> -       int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
> +       int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
>
>          /*
>           * Compare the first tier of anon with that of file to determine which
> @@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>                  type = LRU_GEN_ANON;
>          else if (swappiness == 1)
>                  type = LRU_GEN_FILE;
> -       else if (swappiness == 200)
> +       else if (swappiness == MAX_SWAPPINESS)
>                  type = LRU_GEN_ANON;
>          else
>                  type = get_type_to_scan(lruvec, swappiness, &tier);
> @@ -5368,9 +5368,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
>
>          lruvec = get_lruvec(memcg, nid);
>
> -       if (swappiness < 0)
> +       if (swappiness < MIN_SWAPPINESS)
>                  swappiness = get_swappiness(lruvec, sc);
> -       else if (swappiness > 200)
> +       else if (swappiness > MAX_SWAPPINESS)
>                  goto done;
>
>          switch (cmd) {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-13  1:38 ` [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
@ 2023-12-13  2:05   ` Yu Zhao
  2023-12-13 16:38     ` Dan Schatzberg
  2023-12-14  8:38   ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Yu Zhao @ 2023-12-13  2:05 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>

NAK.

Please initialize new variables properly and test code changes
thoroughly before submission.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-13  2:05   ` Yu Zhao
@ 2023-12-13 16:38     ` Dan Schatzberg
  2023-12-14  4:28       ` Yosry Ahmed
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-13 16:38 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> >
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> >
> > For example:
> >
> > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> >
> > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > swap) regardless of the vm.swappiness sysctl setting.
> >
> > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> 
> NAK.
> 
> Please initialize new variables properly and test code changes
> thoroughly before submission.

Could you be a bit more specific? The patch is compiling and working
locally but perhaps there's some configuration or behavior that I
haven't been testing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 1/2] mm: add defines for min/max swappiness
  2023-12-13  1:58   ` Huan Yang
@ 2023-12-13 16:41     ` Dan Schatzberg
  2023-12-14  2:00       ` Huan Yang
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-13 16:41 UTC (permalink / raw)
  To: Huan Yang
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Chris Li,
	Kefeng Wang, Yue Zhao, Hugh Dickins

On Wed, Dec 13, 2023 at 09:58:26AM +0800, Huan Yang wrote:
> 
> 在 2023/12/13 9:38, Dan Schatzberg 写道:
> > [????????? schatzberg.dan@gmail.com ????????? https://aka.ms/LearnAboutSenderIdentification,????????????]
> > 
> > We use the constants 0 and 200 in a few places in the mm code when
> > referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
> > and MAX_SWAPPINESS #defines to improve clarity. There are no functional
> > changes.
> > 
> > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> > ---
> >   include/linux/swap.h |  2 ++
> >   mm/memcontrol.c      |  2 +-
> >   mm/vmscan.c          | 10 +++++-----
> >   3 files changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index f6dd6575b905..e2ab76c25b4a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > 
> >   #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
> >   #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
> > +#define MIN_SWAPPINESS 0
> > +#define MAX_SWAPPINESS 200
> 
> Do MAX_SWAPPINESS apply for all swapppiness? If so, maybe better change
> swappiness sysctl define:
> ```
> sysctl.c:
> 
> {
>         .procname    = "swappiness",
>         .data        = &vm_swappiness,
>         .maxlen        = sizeof(vm_swappiness),
>         .mode        = 0644,
>         .proc_handler    = proc_dointvec_minmax,
>         .extra1        = SYSCTL_ZERO,
>         .extra2        = SYSCTL_TWO_HUNDRED,
>     },
> 
> ```
> 
> Here hard code swappiness in [0, 200], and now add a new define.

Yes, MAX_SWAPPINESS is a hard limit. I'm not sure what you're
proposing here - the SYSCTL_ZERO and SYSCTL_TWO_HUNDRED values are a
little different than the defines I added. I think most of the value
is just consistently using the defines in the core mm code.

> 
> And many other code hard reference 200 into max value of swappiness, like:
> 
> ```
> memcontrol.c:
> static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>                        struct cftype *cft, u64 val)
> {
>     struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> 
>     if (val > 200)
>         return -EINVAL;
> 
>     if (!mem_cgroup_is_root(memcg))
>         memcg->swappiness = val;
>     else
>         vm_swappiness = val;
> 
>     return 0;
> }

This one is already fixed in my patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 1/2] mm: add defines for min/max swappiness
  2023-12-13 16:41     ` Dan Schatzberg
@ 2023-12-14  2:00       ` Huan Yang
  0 siblings, 0 replies; 13+ messages in thread
From: Huan Yang @ 2023-12-14  2:00 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Chris Li,
	Kefeng Wang, Yue Zhao, Hugh Dickins


在 2023/12/14 0:41, Dan Schatzberg 写道:
> [Some people who received this message don't often get email from schatzberg.dan@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed, Dec 13, 2023 at 09:58:26AM +0800, Huan Yang wrote:
>> 在 2023/12/13 9:38, Dan Schatzberg 写道:
>>> [????????? schatzberg.dan@gmail.com ????????? https://aka.ms/LearnAboutSenderIdentification,????????????]
>>>
>>> We use the constants 0 and 200 in a few places in the mm code when
>>> referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
>>> and MAX_SWAPPINESS #defines to improve clarity. There are no functional
>>> changes.
>>>
>>> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
>>> ---
>>>    include/linux/swap.h |  2 ++
>>>    mm/memcontrol.c      |  2 +-
>>>    mm/vmscan.c          | 10 +++++-----
>>>    3 files changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index f6dd6575b905..e2ab76c25b4a 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>>
>>>    #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
>>>    #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
>>> +#define MIN_SWAPPINESS 0
>>> +#define MAX_SWAPPINESS 200
>> Do MAX_SWAPPINESS apply for all swapppiness? If so, maybe better change
>> swappiness sysctl define:
>> ```
>> sysctl.c:
>>
>> {
>>          .procname    = "swappiness",
>>          .data        = &vm_swappiness,
>>          .maxlen        = sizeof(vm_swappiness),
>>          .mode        = 0644,
>>          .proc_handler    = proc_dointvec_minmax,
>>          .extra1        = SYSCTL_ZERO,
>>          .extra2        = SYSCTL_TWO_HUNDRED,
>>      },
>>
>> ```
>>
>> Here hard code swappiness in [0, 200], and now add a new define.
> Yes, MAX_SWAPPINESS is a hard limit. I'm not sure what you're
> proposing here - the SYSCTL_ZERO and SYSCTL_TWO_HUNDRED values are a
I mean, MAX_SWAPPINESS and SYSCTL_TWO_HUNDRED  all limit swappiness,
but  are different definitions.

If swappiness change 200 into 400, you need to change here extra2 and your
MAX_SWAPPINESS. It's wierd.
> little different than the defines I added. I think most of the value
> is just consistently using the defines in the core mm code.
>
>> And many other code hard reference 200 into max value of swappiness, like:
>>
>> ```
>> memcontrol.c:
>> static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>>                         struct cftype *cft, u64 val)
>> {
>>      struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>>
>>      if (val > 200)
>>          return -EINVAL;
>>
>>      if (!mem_cgroup_is_root(memcg))
>>          memcg->swappiness = val;
>>      else
>>          vm_swappiness = val;
>>
>>      return 0;
>> }
> This one is already fixed in my patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-13 16:38     ` Dan Schatzberg
@ 2023-12-14  4:28       ` Yosry Ahmed
  0 siblings, 0 replies; 13+ messages in thread
From: Yosry Ahmed @ 2023-12-14  4:28 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Yu Zhao, Johannes Weiner, Roman Gushchin, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Wed, Dec 13, 2023 at 8:38 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Tue, Dec 12, 2023 at 07:05:36PM -0700, Yu Zhao wrote:
> > On Tue, Dec 12, 2023 at 6:39 PM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> > >
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> > >
> > > For example:
> > >
> > > echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
> > >
> > > will perform reclaim on the rootcg with a swappiness setting of 0 (no
> > > swap) regardless of the vm.swappiness sysctl setting.
> > >
> > > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> >
> > NAK.
> >
> > Please initialize new variables properly and test code changes
> > thoroughly before submission.
>
> Could you be a bit more specific? The patch is compiling and working
> locally but perhaps there's some configuration or behavior that I
> haven't been testing.

scan_control.swappiness is only initialized from
try_to_free_mem_cgroup_pages(), which means that swappiness is now 0
for all other types of reclaim, which can be a huge problem.

It might be easier to restore a special value (-1, 201, whatever) that
means "use mem_cgroup_swappiness()".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-13  1:38 ` [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
  2023-12-13  2:05   ` Yu Zhao
@ 2023-12-14  8:38   ` Michal Hocko
  2023-12-14 18:22     ` Dan Schatzberg
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2023-12-14  8:38 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Yue Zhao,
	Hugh Dickins

On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.

You are providing the usecase in the cover letter and Andrew usually
appends that to the first patch in the series. I think it would be
better to have the usecase described here.

[...]
> @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
>  	This means that the networking layer will not adapt based on
>  	reclaim induced by memory.reclaim.
>  
> +The following nested keys are defined.
> +
> +	  ==========		================================
> +	  swappiness		Swappiness value to reclaim with
> +	  ==========		================================
> +
> +	Specifying a swappiness value instructs the kernel to perform
> +	the reclaim with that swappiness value. Note that this has the
> +	same semantics as the vm.swappiness sysctl - it sets the

same semantics as vm.swappiness applied to memcg reclaim with all the
existing limitations and potential future extensions.

> +	relative IO cost of reclaiming anon vs file memory but does
> +	not allow for reclaiming specific amounts of anon or file memory.
> +
>    memory.peak
>  	A read-only single value file which exists on non-root
>  	cgroups.

The biggest problem with the implementation I can see, and others have
pointed out the same, is how fragile this is. You really have to check
the code and _every_ place which defines scan_control to learn that
mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
shrink_all_memory and __node_reclaim. I have only checked couple of
them, like direct reclaim and kswapd and none of them really sets up
swappiness to the default memcg nor global value. So you effectively end
up with swappiness == 0.

While the review can point those out it is quite easy to break and you
will only learn about that very indirectly. I think it would be easier
to review and maintain if you go with a pointer that would fallback to
mem_cgroup_swappiness() if NULL which will be the case for every
existing reclaimer except memory.reclaim with swappiness value.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-14  8:38   ` Michal Hocko
@ 2023-12-14 18:22     ` Dan Schatzberg
  2023-12-14 18:28       ` Yosry Ahmed
  2023-12-15  8:50       ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Dan Schatzberg @ 2023-12-14 18:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Yue Zhao,
	Hugh Dickins

On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > Allow proactive reclaimers to submit an additional swappiness=<val>
> > argument to memory.reclaim. This overrides the global or per-memcg
> > swappiness setting for that reclaim attempt.
> 
> You are providing the usecase in the cover letter and Andrew usually
> appends that to the first patch in the series. I think it would be
> better to have the usecase described here.
> 
> [...]
> > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> >  	This means that the networking layer will not adapt based on
> >  	reclaim induced by memory.reclaim.
> >  
> > +The following nested keys are defined.
> > +
> > +	  ==========		================================
> > +	  swappiness		Swappiness value to reclaim with
> > +	  ==========		================================
> > +
> > +	Specifying a swappiness value instructs the kernel to perform
> > +	the reclaim with that swappiness value. Note that this has the
> > +	same semantics as the vm.swappiness sysctl - it sets the
> 
> same semantics as vm.swappiness applied to memcg reclaim with all the
> existing limitations and potential future extensions.

Thanks, will make this change.

> 
> > +	relative IO cost of reclaiming anon vs file memory but does
> > +	not allow for reclaiming specific amounts of anon or file memory.
> > +
> >    memory.peak
> >  	A read-only single value file which exists on non-root
> >  	cgroups.
> 
> The biggest problem with the implementation I can see, and others have
> pointed out the same, is how fragile this is. You really have to check
> the code and _every_ place which defines scan_control to learn that
> mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> shrink_all_memory and __node_reclaim. I have only checked couple of
> them, like direct reclaim and kswapd and none of them really sets up
> swappiness to the default memcg nor global value. So you effectively end
> up with swappiness == 0.
> 
> While the review can point those out it is quite easy to break and you
> will only learn about that very indirectly. I think it would be easier
> to review and maintain if you go with a pointer that would fallback to
> mem_cgroup_swappiness() if NULL which will be the case for every
> existing reclaimer except memory.reclaim with swappiness value.

I agree. My initial implementation used a pointer for this
reason. I'll switch this back. Just to be clear - I still need to
initialize scan_control.swappiness in all these other places right? It
appears to mostly be stack-initialized which means it has
indeterminate value, not necessarily zero.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-14 18:22     ` Dan Schatzberg
@ 2023-12-14 18:28       ` Yosry Ahmed
  2023-12-15  8:50       ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Yosry Ahmed @ 2023-12-14 18:28 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Michal Hocko, Johannes Weiner, Roman Gushchin, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Yue Zhao,
	Hugh Dickins

On Thu, Dec 14, 2023 at 10:22 AM Dan Schatzberg
<schatzberg.dan@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
> > On Tue 12-12-23 17:38:03, Dan Schatzberg wrote:
> > > Allow proactive reclaimers to submit an additional swappiness=<val>
> > > argument to memory.reclaim. This overrides the global or per-memcg
> > > swappiness setting for that reclaim attempt.
> >
> > You are providing the usecase in the cover letter and Andrew usually
> > appends that to the first patch in the series. I think it would be
> > better to have the usecase described here.
> >
> > [...]
> > > @@ -1304,6 +1297,18 @@ PAGE_SIZE multiple when read back.
> > >     This means that the networking layer will not adapt based on
> > >     reclaim induced by memory.reclaim.
> > >
> > > +The following nested keys are defined.
> > > +
> > > +     ==========            ================================
> > > +     swappiness            Swappiness value to reclaim with
> > > +     ==========            ================================
> > > +
> > > +   Specifying a swappiness value instructs the kernel to perform
> > > +   the reclaim with that swappiness value. Note that this has the
> > > +   same semantics as the vm.swappiness sysctl - it sets the
> >
> > same semantics as vm.swappiness applied to memcg reclaim with all the
> > existing limitations and potential future extensions.
>
> Thanks, will make this change.
>
> >
> > > +   relative IO cost of reclaiming anon vs file memory but does
> > > +   not allow for reclaiming specific amounts of anon or file memory.
> > > +
> > >    memory.peak
> > >     A read-only single value file which exists on non-root
> > >     cgroups.
> >
> > The biggest problem with the implementation I can see, and others have
> > pointed out the same, is how fragile this is. You really have to check
> > the code and _every_ place which defines scan_control to learn that
> > mem_cgroup_shrink_node, reclaim_clean_pages_from_list,
> > reclaim_folio_list, lru_gen_seq_write, try_to_free_pages, balance_pgdat,
> > shrink_all_memory and __node_reclaim. I have only checked couple of
> > them, like direct reclaim and kswapd and none of them really sets up
> > swappiness to the default memcg nor global value. So you effectively end
> > up with swappiness == 0.
> >
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
>
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right? It
> appears to mostly be stack-initialized which means it has
> indeterminate value, not necessarily zero.

My understanding is that in a partially initialized struct,
uninitialized members default to 0, but I am not a C expert :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim
  2023-12-14 18:22     ` Dan Schatzberg
  2023-12-14 18:28       ` Yosry Ahmed
@ 2023-12-15  8:50       ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2023-12-15  8:50 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Johannes Weiner, Roman Gushchin, Yosry Ahmed, Huan Yang,
	linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Jonathan Corbet, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Yue Zhao,
	Hugh Dickins

On Thu 14-12-23 13:22:29, Dan Schatzberg wrote:
> On Thu, Dec 14, 2023 at 09:38:55AM +0100, Michal Hocko wrote:
[...]
> > While the review can point those out it is quite easy to break and you
> > will only learn about that very indirectly. I think it would be easier
> > to review and maintain if you go with a pointer that would fallback to
> > mem_cgroup_swappiness() if NULL which will be the case for every
> > existing reclaimer except memory.reclaim with swappiness value.
> 
> I agree. My initial implementation used a pointer for this
> reason. I'll switch this back. Just to be clear - I still need to
> initialize scan_control.swappiness in all these other places right?

No. They will just get initialized to 0. All you need to make sure is
that the swappiness is used consistently. I would propose something like
scan_control_swappiness() (or sc_swappiness) which would actually do

	if (sc->swappiness)
		return sc->swappiness;
	return mem_cgroup_swappiness(sc->target_mem_cgroup);

and then make sure that mem_cgroup_swappiness is never used directly.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-12-15  8:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-13  1:38 [PATCH V4 0/1] Add swappiness argument to memory.reclaim Dan Schatzberg
2023-12-13  1:38 ` [PATCH V4 1/2] mm: add defines for min/max swappiness Dan Schatzberg
2023-12-13  1:58   ` Huan Yang
2023-12-13 16:41     ` Dan Schatzberg
2023-12-14  2:00       ` Huan Yang
2023-12-13  1:38 ` [PATCH V4 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
2023-12-13  2:05   ` Yu Zhao
2023-12-13 16:38     ` Dan Schatzberg
2023-12-14  4:28       ` Yosry Ahmed
2023-12-14  8:38   ` Michal Hocko
2023-12-14 18:22     ` Dan Schatzberg
2023-12-14 18:28       ` Yosry Ahmed
2023-12-15  8:50       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.