[PATCH mm v5 RESEND 0/4] memcg: Slow down swap allocation as the available space gets depleted

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH mm v5 RESEND 0/4] memcg: Slow down swap allocation as the available space gets depleted
@ 2020-05-21  0:24 ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Tejun describes the problem as follows:

When swap runs out, there's an abrupt change in system behavior -
the anonymous memory suddenly becomes unmanageable which readily
breaks any sort of memory isolation and can bring down the whole
system. To avoid that, oomd [1] monitors free swap space and triggers
kills when it drops below the specific threshold (e.g. 15%).

While this works, it's far from ideal:
 - Depending on IO performance and total swap size, a given
   headroom might not be enough or too much.
 - oomd has to monitor swap depletion in addition to the usual
   pressure metrics and it currently doesn't consider memory.swap.max.

Solve this by adapting parts of the approach that memory.high uses -
slow down allocation as the resource gets depleted turning the
depletion behavior from abrupt cliff one to gradual degradation
observable through memory pressure metric.

[1] https://github.com/facebookincubator/oomd

v4: https://lore.kernel.org/linux-mm/20200519171938.3569605-1-kuba@kernel.org/
v3: https://lore.kernel.org/linux-mm/20200515202027.3217470-1-kuba@kernel.org/
v2: https://lore.kernel.org/linux-mm/20200511225516.2431921-1-kuba@kernel.org/
v1: https://lore.kernel.org/linux-mm/20200417010617.927266-1-kuba@kernel.org/

Jakub Kicinski (4):
  mm: prepare for swap over-high accounting and penalty calculation
  mm: move penalty delay clamping out of calculate_high_delay()
  mm: move cgroup high memory limit setting into struct page_counter
  mm: automatically penalize tasks with high swap use

 Documentation/admin-guide/cgroup-v2.rst |  20 +++
 include/linux/memcontrol.h              |   4 +-
 include/linux/page_counter.h            |  13 ++
 mm/memcontrol.c                         | 173 +++++++++++++++++-------
 4 files changed, 161 insertions(+), 49 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 0/4] memcg: Slow down swap allocation as the available space gets depleted
@ 2020-05-21  0:24 ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	chris-6Bi1550iOqEnzZ6mRAm98g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	Jakub Kicinski

Tejun describes the problem as follows:

When swap runs out, there's an abrupt change in system behavior -
the anonymous memory suddenly becomes unmanageable which readily
breaks any sort of memory isolation and can bring down the whole
system. To avoid that, oomd [1] monitors free swap space and triggers
kills when it drops below the specific threshold (e.g. 15%).

While this works, it's far from ideal:
 - Depending on IO performance and total swap size, a given
   headroom might not be enough or too much.
 - oomd has to monitor swap depletion in addition to the usual
   pressure metrics and it currently doesn't consider memory.swap.max.

Solve this by adapting parts of the approach that memory.high uses -
slow down allocation as the resource gets depleted turning the
depletion behavior from abrupt cliff one to gradual degradation
observable through memory pressure metric.

[1] https://github.com/facebookincubator/oomd

v4: https://lore.kernel.org/linux-mm/20200519171938.3569605-1-kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
v3: https://lore.kernel.org/linux-mm/20200515202027.3217470-1-kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
v2: https://lore.kernel.org/linux-mm/20200511225516.2431921-1-kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/
v1: https://lore.kernel.org/linux-mm/20200417010617.927266-1-kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org/

Jakub Kicinski (4):
  mm: prepare for swap over-high accounting and penalty calculation
  mm: move penalty delay clamping out of calculate_high_delay()
  mm: move cgroup high memory limit setting into struct page_counter
  mm: automatically penalize tasks with high swap use

 Documentation/admin-guide/cgroup-v2.rst |  20 +++
 include/linux/memcontrol.h              |   4 +-
 include/linux/page_counter.h            |  13 ++
 mm/memcontrol.c                         | 173 +++++++++++++++++-------
 4 files changed, 161 insertions(+), 49 deletions(-)

-- 
2.25.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 1/4] mm: prepare for swap over-high accounting and penalty calculation
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Slice the memory overage calculation logic a little bit so we can
reuse it to apply a similar penalty to the swap. The logic which
accesses the memory-specific fields (use and high values) has to
be taken out of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 62 ++++++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df9510b7d64..0d05e6a593f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2302,41 +2302,48 @@ static void high_work_func(struct work_struct *work)
  #define MEMCG_DELAY_PRECISION_SHIFT 20
  #define MEMCG_DELAY_SCALING_SHIFT 14
 
-/*
- * Get the number of jiffies that we should penalise a mischievous cgroup which
- * is exceeding its memory.high by checking both it and its ancestors.
- */
-static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
-					  unsigned int nr_pages)
+static u64 calculate_overage(unsigned long usage, unsigned long high)
 {
-	unsigned long penalty_jiffies;
-	u64 max_overage = 0;
-
-	do {
-		unsigned long usage, high;
-		u64 overage;
+	u64 overage;
 
-		usage = page_counter_read(&memcg->memory);
-		high = READ_ONCE(memcg->high);
+	if (usage <= high)
+		return 0;
 
-		if (usage <= high)
-			continue;
+	/*
+	 * Prevent division by 0 in overage calculation by acting as if
+	 * it was a threshold of 1 page
+	 */
+	high = max(high, 1UL);
 
-		/*
-		 * Prevent division by 0 in overage calculation by acting as if
-		 * it was a threshold of 1 page
-		 */
-		high = max(high, 1UL);
+	overage = usage - high;
+	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
+	return div64_u64(overage, high);
+}
 
-		overage = usage - high;
-		overage <<= MEMCG_DELAY_PRECISION_SHIFT;
-		overage = div64_u64(overage, high);
+static u64 mem_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
 
-		if (overage > max_overage)
-			max_overage = overage;
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->memory),
+					    READ_ONCE(memcg->high));
+		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
 
+	return max_overage;
+}
+
+/*
+ * Get the number of jiffies that we should penalise a mischievous cgroup which
+ * is exceeding its memory.high by checking both it and its ancestors.
+ */
+static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
+					  unsigned int nr_pages,
+					  u64 max_overage)
+{
+	unsigned long penalty_jiffies;
+
 	if (!max_overage)
 		return 0;
 
@@ -2392,7 +2399,8 @@ void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages);
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 1/4] mm: prepare for swap over-high accounting and penalty calculation
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	chris-6Bi1550iOqEnzZ6mRAm98g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	Jakub Kicinski

Slice the memory overage calculation logic a little bit so we can
reuse it to apply a similar penalty to the swap. The logic which
accesses the memory-specific fields (use and high values) has to
be taken out of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 62 ++++++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df9510b7d64..0d05e6a593f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2302,41 +2302,48 @@ static void high_work_func(struct work_struct *work)
  #define MEMCG_DELAY_PRECISION_SHIFT 20
  #define MEMCG_DELAY_SCALING_SHIFT 14
 
-/*
- * Get the number of jiffies that we should penalise a mischievous cgroup which
- * is exceeding its memory.high by checking both it and its ancestors.
- */
-static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
-					  unsigned int nr_pages)
+static u64 calculate_overage(unsigned long usage, unsigned long high)
 {
-	unsigned long penalty_jiffies;
-	u64 max_overage = 0;
-
-	do {
-		unsigned long usage, high;
-		u64 overage;
+	u64 overage;
 
-		usage = page_counter_read(&memcg->memory);
-		high = READ_ONCE(memcg->high);
+	if (usage <= high)
+		return 0;
 
-		if (usage <= high)
-			continue;
+	/*
+	 * Prevent division by 0 in overage calculation by acting as if
+	 * it was a threshold of 1 page
+	 */
+	high = max(high, 1UL);
 
-		/*
-		 * Prevent division by 0 in overage calculation by acting as if
-		 * it was a threshold of 1 page
-		 */
-		high = max(high, 1UL);
+	overage = usage - high;
+	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
+	return div64_u64(overage, high);
+}
 
-		overage = usage - high;
-		overage <<= MEMCG_DELAY_PRECISION_SHIFT;
-		overage = div64_u64(overage, high);
+static u64 mem_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
 
-		if (overage > max_overage)
-			max_overage = overage;
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->memory),
+					    READ_ONCE(memcg->high));
+		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
 
+	return max_overage;
+}
+
+/*
+ * Get the number of jiffies that we should penalise a mischievous cgroup which
+ * is exceeding its memory.high by checking both it and its ancestors.
+ */
+static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
+					  unsigned int nr_pages,
+					  u64 max_overage)
+{
+	unsigned long penalty_jiffies;
+
 	if (!max_overage)
 		return 0;
 
@@ -2392,7 +2399,8 @@ void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages);
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 2/4] mm: move penalty delay clamping out of calculate_high_delay()
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

We will want to call calculate_high_delay() twice - once for
memory and once for swap, and we should apply the clamp value
to sum of the penalties. Clamping has to be applied outside
of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0d05e6a593f5..dd8605a9137a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2367,14 +2367,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	 * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
 	 * larger the current charge patch is than that.
 	 */
-	penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
-
-	/*
-	 * Clamp the max delay per usermode return so as to still keep the
-	 * application moving forwards and also permit diagnostics, albeit
-	 * extremely slowly.
-	 */
-	return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
 /*
@@ -2402,6 +2395,13 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Clamp the max delay per usermode return so as to still keep the
+	 * application moving forwards and also permit diagnostics, albeit
+	 * extremely slowly.
+	 */
+	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
 	 * that it's not even worth doing, in an attempt to be nice to those who
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 2/4] mm: move penalty delay clamping out of calculate_high_delay()
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	chris-6Bi1550iOqEnzZ6mRAm98g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	Jakub Kicinski

We will want to call calculate_high_delay() twice - once for
memory and once for swap, and we should apply the clamp value
to sum of the penalties. Clamping has to be applied outside
of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0d05e6a593f5..dd8605a9137a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2367,14 +2367,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	 * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
 	 * larger the current charge patch is than that.
 	 */
-	penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
-
-	/*
-	 * Clamp the max delay per usermode return so as to still keep the
-	 * application moving forwards and also permit diagnostics, albeit
-	 * extremely slowly.
-	 */
-	return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
 /*
@@ -2402,6 +2395,13 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Clamp the max delay per usermode return so as to still keep the
+	 * application moving forwards and also permit diagnostics, albeit
+	 * extremely slowly.
+	 */
+	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
 	 * that it's not even worth doing, in an attempt to be nice to those who
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 3/4] mm: move cgroup high memory limit setting into struct page_counter
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

High memory limit is currently recorded directly in
struct mem_cgroup. We are about to add a high limit
for swap, move the field to struct page_counter and
add some helpers.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
--
v5: make page_counter_set_high() a static inline in the header
v4: new patch
---
 include/linux/memcontrol.h   |  3 ---
 include/linux/page_counter.h | 13 +++++++++++++
 mm/memcontrol.c              | 17 +++++++++--------
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e0bcef180672..d726867d8af9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,9 +206,6 @@ struct mem_cgroup {
 	struct page_counter kmem;
 	struct page_counter tcpmem;
 
-	/* Upper bound of normal memory consumption range */
-	unsigned long high;
-
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57f659b..6a89ff948412 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -10,6 +10,7 @@ struct page_counter {
 	atomic_long_t usage;
 	unsigned long min;
 	unsigned long low;
+	unsigned long high;
 	unsigned long max;
 	struct page_counter *parent;
 
@@ -55,6 +56,13 @@ bool page_counter_try_charge(struct page_counter *counter,
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+
+static inline void page_counter_set_high(struct page_counter *counter,
+					 unsigned long nr_pages)
+{
+	WRITE_ONCE(counter->high, nr_pages);
+}
+
 int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_memparse(const char *buf, const char *max,
 			  unsigned long *nr_pages);
@@ -64,4 +72,9 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 	counter->watermark = page_counter_read(counter);
 }
 
+static inline bool page_counter_is_above_high(struct page_counter *counter)
+{
+	return page_counter_read(counter) > READ_ONCE(counter->high);
+}
+
 #endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd8605a9137a..d4b7bc80aa38 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2233,7 +2233,7 @@ static void reclaim_high(struct mem_cgroup *memcg,
 			 gfp_t gfp_mask)
 {
 	do {
-		if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high))
+		if (!page_counter_is_above_high(&memcg->memory))
 			continue;
 		memcg_memory_event(memcg, MEMCG_HIGH);
 		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
@@ -2326,7 +2326,7 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 
 	do {
 		overage = calculate_overage(page_counter_read(&memcg->memory),
-					    READ_ONCE(memcg->high));
+					    READ_ONCE(memcg->memory.high));
 		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2585,7 +2585,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
+		if (page_counter_is_above_high(&memcg->memory)) {
 			/* Don't bother a random interrupted task */
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
@@ -4286,7 +4286,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(READ_ONCE(memcg->memory.max),
-					    READ_ONCE(memcg->high));
+					    READ_ONCE(memcg->memory.high));
 		unsigned long used = page_counter_read(&memcg->memory);
 
 		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
@@ -5011,7 +5011,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
+	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
@@ -5164,7 +5164,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
-	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
+	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	memcg_wb_domain_size_changed(memcg);
 }
@@ -5984,7 +5984,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 
 static int memory_high_show(struct seq_file *m, void *v)
 {
-	return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high));
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->memory.high));
 }
 
 static ssize_t memory_high_write(struct kernfs_open_file *of,
@@ -6001,7 +6002,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	WRITE_ONCE(memcg->high, high);
+	page_counter_set_high(&memcg->memory, high);
 
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 3/4] mm: move cgroup high memory limit setting into struct page_counter
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	chris-6Bi1550iOqEnzZ6mRAm98g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	Jakub Kicinski

High memory limit is currently recorded directly in
struct mem_cgroup. We are about to add a high limit
for swap, move the field to struct page_counter and
add some helpers.

Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
--
v5: make page_counter_set_high() a static inline in the header
v4: new patch
---
 include/linux/memcontrol.h   |  3 ---
 include/linux/page_counter.h | 13 +++++++++++++
 mm/memcontrol.c              | 17 +++++++++--------
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e0bcef180672..d726867d8af9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,9 +206,6 @@ struct mem_cgroup {
 	struct page_counter kmem;
 	struct page_counter tcpmem;
 
-	/* Upper bound of normal memory consumption range */
-	unsigned long high;
-
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57f659b..6a89ff948412 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -10,6 +10,7 @@ struct page_counter {
 	atomic_long_t usage;
 	unsigned long min;
 	unsigned long low;
+	unsigned long high;
 	unsigned long max;
 	struct page_counter *parent;
 
@@ -55,6 +56,13 @@ bool page_counter_try_charge(struct page_counter *counter,
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+
+static inline void page_counter_set_high(struct page_counter *counter,
+					 unsigned long nr_pages)
+{
+	WRITE_ONCE(counter->high, nr_pages);
+}
+
 int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_memparse(const char *buf, const char *max,
 			  unsigned long *nr_pages);
@@ -64,4 +72,9 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 	counter->watermark = page_counter_read(counter);
 }
 
+static inline bool page_counter_is_above_high(struct page_counter *counter)
+{
+	return page_counter_read(counter) > READ_ONCE(counter->high);
+}
+
 #endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd8605a9137a..d4b7bc80aa38 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2233,7 +2233,7 @@ static void reclaim_high(struct mem_cgroup *memcg,
 			 gfp_t gfp_mask)
 {
 	do {
-		if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high))
+		if (!page_counter_is_above_high(&memcg->memory))
 			continue;
 		memcg_memory_event(memcg, MEMCG_HIGH);
 		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
@@ -2326,7 +2326,7 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 
 	do {
 		overage = calculate_overage(page_counter_read(&memcg->memory),
-					    READ_ONCE(memcg->high));
+					    READ_ONCE(memcg->memory.high));
 		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2585,7 +2585,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
+		if (page_counter_is_above_high(&memcg->memory)) {
 			/* Don't bother a random interrupted task */
 			if (in_interrupt()) {
 				schedule_work(&memcg->high_work);
@@ -4286,7 +4286,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(READ_ONCE(memcg->memory.max),
-					    READ_ONCE(memcg->high));
+					    READ_ONCE(memcg->memory.high));
 		unsigned long used = page_counter_read(&memcg->memory);
 
 		*pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
@@ -5011,7 +5011,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
+	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
@@ -5164,7 +5164,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
-	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
+	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	memcg_wb_domain_size_changed(memcg);
 }
@@ -5984,7 +5984,8 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 
 static int memory_high_show(struct seq_file *m, void *v)
 {
-	return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high));
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->memory.high));
 }
 
 static ssize_t memory_high_write(struct kernfs_open_file *of,
@@ -6001,7 +6002,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	WRITE_ONCE(memcg->high, high);
+	page_counter_set_high(&memcg->memory, high);
 
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penalizing is similar
to memory.high penalty (sleep on return to user space), but with
a less steep slope.

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy
tasks consuming a lot of swap and impacting other tasks, or even
bringing the whole system to stand still with complete SWAP
exhaustion. Hopefully without the need to find per-task hard
limits.

Slowing misbehaving tasks down gradually allows user space oom
killers or other protection mechanisms to react. oomd and earlyoom
already do killing based on swap exhaustion, and memory.swap.high
protection will help implement such userspace oom policies more
reliably.

We can use one counter for number of pages allocated under
pressure to save struct task space and avoid two separate
hierarchy walks on the hot path. The exact overage is
calculated on return to user space, anyway.

Take the new high limit into account when determining if swap
is "full". Borrowing the explanation from Johannes:

  The idea behind "swap full" is that as long as the workload has plenty
  of swap space available and it's not changing its memory contents, it
  makes sense to generously hold on to copies of data in the swap
  device, even after the swapin. A later reclaim cycle can drop the page
  without any IO. Trading disk space for IO.

  But the only two ways to reclaim a swap slot is when they're faulted
  in and the references go away, or by scanning the virtual address space
  like swapoff does - which is very expensive (one could argue it's too
  expensive even for swapoff, it's often more practical to just reboot).

  So at some point in the fill level, we have to start freeing up swap
  slots on fault/swapin. Otherwise we could eventually run out of swap
  slots while they're filled with copies of data that is also in RAM.

  We don't want to OOM a workload because its available swap space is
  filled with redundant cache.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
v4:
 - add a comment on using a single counter for both mem and swap pages
v3:
 - count events for all groups over limit
 - add doc for high events
 - remove the magic scaling factor
 - improve commit message
v2:
 - add docs
 - improve commit message
---
 Documentation/admin-guide/cgroup-v2.rst | 20 ++++++
 include/linux/memcontrol.h              |  1 +
 mm/memcontrol.c                         | 84 +++++++++++++++++++++++--
 3 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index fed4e1d2a343..1536deb2f28e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1373,6 +1373,22 @@ PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, all its further allocations will be throttled to
+	allow userspace to implement custom out-of-memory procedures.
+
+	This limit marks a point of no return for the cgroup. It is NOT
+	designed to manage the amount of swapping a workload does
+	during regular operation. Compare to memory.swap.max, which
+	prohibits swapping past a set amount, but lets the cgroup
+	continue unimpeded as long as other memory can be reclaimed.
+
+	Healthy workloads are not expected to reach this limit.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
@@ -1386,6 +1402,10 @@ PAGE_SIZE multiple when read back.
 	otherwise, a value change in this file generates a file
 	modified event.
 
+	  high
+		The number of times the cgroup's swap usage was over
+		the high threshold.
+
 	  max
 		The number of times the cgroup's swap usage was about
 		to go over the max boundary and swap allocation
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d726867d8af9..865afda5b6f0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4b7bc80aa38..a92ddaecd28e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2334,6 +2334,22 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 swap_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->swap),
+					    READ_ONCE(memcg->swap.high));
+		if (overage)
+			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
+		max_overage = max(overage, max_overage);
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		 !mem_cgroup_is_root(memcg));
+
+	return max_overage;
+}
+
 /*
  * Get the number of jiffies that we should penalise a mischievous cgroup which
  * is exceeding its memory.high by checking both it and its ancestors.
@@ -2395,6 +2411,13 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Make the swap curve more gradual, swap can be considered "cheaper",
+	 * and is allocated in larger chunks. We want the delays to be gradual.
+	 */
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+						swap_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_is_above_high(&memcg->memory)) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+		bool mem_high, swap_high;
+
+		mem_high = page_counter_is_above_high(&memcg->memory);
+		swap_high = page_counter_is_above_high(&memcg->swap);
+
+		/* Don't bother a random interrupted task */
+		if (in_interrupt()) {
+			if (mem_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			continue;
+		}
+
+		if (mem_high || swap_high) {
+			/* Use one counter for number of pages allocated
+			 * under pressure to save struct task space and
+			 * avoid two separate hierarchy walks.
+			 */
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;
@@ -5013,6 +5049,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5166,6 +5203,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
 
@@ -6987,10 +7025,13 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
-		if (page_counter_read(&memcg->swap) * 2 >=
-		    READ_ONCE(memcg->swap.max))
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		unsigned long usage = page_counter_read(&memcg->swap);
+
+		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
+		    usage * 2 >= READ_ONCE(memcg->swap.max))
 			return true;
+	}
 
 	return false;
 }
@@ -7013,6 +7054,29 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static int swap_high_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->swap.high));
+}
+
+static ssize_t swap_high_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	page_counter_set_high(&memcg->swap, high);
+
+	return nbytes;
+}
+
 static int swap_max_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -7040,6 +7104,8 @@ static int swap_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	seq_printf(m, "high %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
 	seq_printf(m, "max %lu\n",
 		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
 	seq_printf(m, "fail %lu\n",
@@ -7054,6 +7120,12 @@ static struct cftype swap_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = swap_current_read,
 	},
+	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
 	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-21  0:24   ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-21  0:24 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hannes-druUgvl0LCNAfugRpC6u6w,
	chris-6Bi1550iOqEnzZ6mRAm98g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	Jakub Kicinski

Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penalizing is similar
to memory.high penalty (sleep on return to user space), but with
a less steep slope.

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy
tasks consuming a lot of swap and impacting other tasks, or even
bringing the whole system to stand still with complete SWAP
exhaustion. Hopefully without the need to find per-task hard
limits.

Slowing misbehaving tasks down gradually allows user space oom
killers or other protection mechanisms to react. oomd and earlyoom
already do killing based on swap exhaustion, and memory.swap.high
protection will help implement such userspace oom policies more
reliably.

We can use one counter for number of pages allocated under
pressure to save struct task space and avoid two separate
hierarchy walks on the hot path. The exact overage is
calculated on return to user space, anyway.

Take the new high limit into account when determining if swap
is "full". Borrowing the explanation from Johannes:

  The idea behind "swap full" is that as long as the workload has plenty
  of swap space available and it's not changing its memory contents, it
  makes sense to generously hold on to copies of data in the swap
  device, even after the swapin. A later reclaim cycle can drop the page
  without any IO. Trading disk space for IO.

  But the only two ways to reclaim a swap slot is when they're faulted
  in and the references go away, or by scanning the virtual address space
  like swapoff does - which is very expensive (one could argue it's too
  expensive even for swapoff, it's often more practical to just reboot).

  So at some point in the fill level, we have to start freeing up swap
  slots on fault/swapin. Otherwise we could eventually run out of swap
  slots while they're filled with copies of data that is also in RAM.

  We don't want to OOM a workload because its available swap space is
  filled with redundant cache.

Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
--
v4:
 - add a comment on using a single counter for both mem and swap pages
v3:
 - count events for all groups over limit
 - add doc for high events
 - remove the magic scaling factor
 - improve commit message
v2:
 - add docs
 - improve commit message
---
 Documentation/admin-guide/cgroup-v2.rst | 20 ++++++
 include/linux/memcontrol.h              |  1 +
 mm/memcontrol.c                         | 84 +++++++++++++++++++++++--
 3 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index fed4e1d2a343..1536deb2f28e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1373,6 +1373,22 @@ PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, all its further allocations will be throttled to
+	allow userspace to implement custom out-of-memory procedures.
+
+	This limit marks a point of no return for the cgroup. It is NOT
+	designed to manage the amount of swapping a workload does
+	during regular operation. Compare to memory.swap.max, which
+	prohibits swapping past a set amount, but lets the cgroup
+	continue unimpeded as long as other memory can be reclaimed.
+
+	Healthy workloads are not expected to reach this limit.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
@@ -1386,6 +1402,10 @@ PAGE_SIZE multiple when read back.
 	otherwise, a value change in this file generates a file
 	modified event.
 
+	  high
+		The number of times the cgroup's swap usage was over
+		the high threshold.
+
 	  max
 		The number of times the cgroup's swap usage was about
 		to go over the max boundary and swap allocation
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d726867d8af9..865afda5b6f0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,7 @@ enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4b7bc80aa38..a92ddaecd28e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2334,6 +2334,22 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 swap_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->swap),
+					    READ_ONCE(memcg->swap.high));
+		if (overage)
+			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
+		max_overage = max(overage, max_overage);
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		 !mem_cgroup_is_root(memcg));
+
+	return max_overage;
+}
+
 /*
  * Get the number of jiffies that we should penalise a mischievous cgroup which
  * is exceeding its memory.high by checking both it and its ancestors.
@@ -2395,6 +2411,13 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Make the swap curve more gradual, swap can be considered "cheaper",
+	 * and is allocated in larger chunks. We want the delays to be gradual.
+	 */
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+						swap_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_is_above_high(&memcg->memory)) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+		bool mem_high, swap_high;
+
+		mem_high = page_counter_is_above_high(&memcg->memory);
+		swap_high = page_counter_is_above_high(&memcg->swap);
+
+		/* Don't bother a random interrupted task */
+		if (in_interrupt()) {
+			if (mem_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			continue;
+		}
+
+		if (mem_high || swap_high) {
+			/* Use one counter for number of pages allocated
+			 * under pressure to save struct task space and
+			 * avoid two separate hierarchy walks.
+			 */
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;
@@ -5013,6 +5049,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5166,6 +5203,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
 
@@ -6987,10 +7025,13 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
-		if (page_counter_read(&memcg->swap) * 2 >=
-		    READ_ONCE(memcg->swap.max))
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		unsigned long usage = page_counter_read(&memcg->swap);
+
+		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
+		    usage * 2 >= READ_ONCE(memcg->swap.max))
 			return true;
+	}
 
 	return false;
 }
@@ -7013,6 +7054,29 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static int swap_high_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->swap.high));
+}
+
+static ssize_t swap_high_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	page_counter_set_high(&memcg->swap, high);
+
+	return nbytes;
+}
+
 static int swap_max_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -7040,6 +7104,8 @@ static int swap_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	seq_printf(m, "high %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
 	seq_printf(m, "max %lu\n",
 		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
 	seq_printf(m, "fail %lu\n",
@@ -7054,6 +7120,12 @@ static struct cftype swap_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = swap_current_read,
 	},
+	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
 	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 1/4] mm: prepare for swap over-high accounting and penalty calculation
@ 2020-05-26 14:35     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

On Wed, May 20, 2020 at 05:24:08PM -0700, Jakub Kicinski wrote:
> Slice the memory overage calculation logic a little bit so we can
> reuse it to apply a similar penalty to the swap. The logic which
> accesses the memory-specific fields (use and high values) has to
> be taken out of calculate_high_delay().
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 1/4] mm: prepare for swap over-high accounting and penalty calculation
@ 2020-05-26 14:35     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

On Wed, May 20, 2020 at 05:24:08PM -0700, Jakub Kicinski wrote:
> Slice the memory overage calculation logic a little bit so we can
> reuse it to apply a similar penalty to the swap. The logic which
> accesses the memory-specific fields (use and high values) has to
> be taken out of calculate_high_delay().
> 
> Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 2/4] mm: move penalty delay clamping out of calculate_high_delay()
@ 2020-05-26 14:36     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

On Wed, May 20, 2020 at 05:24:09PM -0700, Jakub Kicinski wrote:
> We will want to call calculate_high_delay() twice - once for
> memory and once for swap, and we should apply the clamp value
> to sum of the penalties. Clamping has to be applied outside
> of calculate_high_delay().
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 2/4] mm: move penalty delay clamping out of calculate_high_delay()
@ 2020-05-26 14:36     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

On Wed, May 20, 2020 at 05:24:09PM -0700, Jakub Kicinski wrote:
> We will want to call calculate_high_delay() twice - once for
> memory and once for swap, and we should apply the clamp value
> to sum of the penalties. Clamping has to be applied outside
> of calculate_high_delay().
> 
> Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 3/4] mm: move cgroup high memory limit setting into struct page_counter
@ 2020-05-26 14:42     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

On Wed, May 20, 2020 at 05:24:10PM -0700, Jakub Kicinski wrote:
> High memory limit is currently recorded directly in
> struct mem_cgroup. We are about to add a high limit
> for swap, move the field to struct page_counter and
> add some helpers.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This move is overdue and should make it easier to integrate high
reclaim better into the existing max reclaim flow as well. Thanks!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 3/4] mm: move cgroup high memory limit setting into struct page_counter
@ 2020-05-26 14:42     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 14:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

On Wed, May 20, 2020 at 05:24:10PM -0700, Jakub Kicinski wrote:
> High memory limit is currently recorded directly in
> struct mem_cgroup. We are about to add a high limit
> for swap, move the field to struct page_counter and
> add some helpers.
> 
> Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

This move is overdue and should make it easier to integrate high
reclaim better into the existing max reclaim flow as well. Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-26 15:33     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 15:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

Hi Jakub,

the patch looks mostly good to me, but there are a couple of things
that should be cleaned up before merging:

On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> Add a memory.swap.high knob, which can be used to protect the system
> from SWAP exhaustion. The mechanism used for penalizing is similar
> to memory.high penalty (sleep on return to user space), but with
> a less steep slope.

The last part is no longer true after incorporating Michal's feedback.

> That is not to say that the knob itself is equivalent to memory.high.
> The objective is more to protect the system from potentially buggy
> tasks consuming a lot of swap and impacting other tasks, or even
> bringing the whole system to stand still with complete SWAP
> exhaustion. Hopefully without the need to find per-task hard
> limits.
> 
> Slowing misbehaving tasks down gradually allows user space oom
> killers or other protection mechanisms to react. oomd and earlyoom
> already do killing based on swap exhaustion, and memory.swap.high
> protection will help implement such userspace oom policies more
> reliably.
> 
> We can use one counter for number of pages allocated under
> pressure to save struct task space and avoid two separate
> hierarchy walks on the hot path. The exact overage is
> calculated on return to user space, anyway.
> 
> Take the new high limit into account when determining if swap
> is "full". Borrowing the explanation from Johannes:
> 
>   The idea behind "swap full" is that as long as the workload has plenty
>   of swap space available and it's not changing its memory contents, it
>   makes sense to generously hold on to copies of data in the swap
>   device, even after the swapin. A later reclaim cycle can drop the page
>   without any IO. Trading disk space for IO.
> 
>   But the only two ways to reclaim a swap slot is when they're faulted
>   in and the references go away, or by scanning the virtual address space
>   like swapoff does - which is very expensive (one could argue it's too
>   expensive even for swapoff, it's often more practical to just reboot).
> 
>   So at some point in the fill level, we have to start freeing up swap
>   slots on fault/swapin. Otherwise we could eventually run out of swap
>   slots while they're filled with copies of data that is also in RAM.
> 
>   We don't want to OOM a workload because its available swap space is
>   filled with redundant cache.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> --
> v4:
>  - add a comment on using a single counter for both mem and swap pages
> v3:
>  - count events for all groups over limit
>  - add doc for high events
>  - remove the magic scaling factor
>  - improve commit message
> v2:
>  - add docs
>  - improve commit message
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 20 ++++++
>  include/linux/memcontrol.h              |  1 +
>  mm/memcontrol.c                         | 84 +++++++++++++++++++++++--
>  3 files changed, 99 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index fed4e1d2a343..1536deb2f28e 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1373,6 +1373,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +
>    memory.swap.max
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default is "max".
> @@ -1386,6 +1402,10 @@ PAGE_SIZE multiple when read back.
>  	otherwise, a value change in this file generates a file
>  	modified event.
>  
> +	  high
> +		The number of times the cgroup's swap usage was over
> +		the high threshold.
> +
>  	  max
>  		The number of times the cgroup's swap usage was about
>  		to go over the max boundary and swap allocation
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d726867d8af9..865afda5b6f0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,7 @@ enum memcg_memory_event {
>  	MEMCG_MAX,
>  	MEMCG_OOM,
>  	MEMCG_OOM_KILL,
> +	MEMCG_SWAP_HIGH,
>  	MEMCG_SWAP_MAX,
>  	MEMCG_SWAP_FAIL,
>  	MEMCG_NR_MEMORY_EVENTS,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4b7bc80aa38..a92ddaecd28e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2334,6 +2334,22 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
>  	return max_overage;
>  }
>  
> +static u64 swap_find_max_overage(struct mem_cgroup *memcg)
> +{
> +	u64 overage, max_overage = 0;
> +
> +	do {
> +		overage = calculate_overage(page_counter_read(&memcg->swap),
> +					    READ_ONCE(memcg->swap.high));
> +		if (overage)
> +			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
> +		max_overage = max(overage, max_overage);
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		 !mem_cgroup_is_root(memcg));
> +
> +	return max_overage;
> +}
> +
>  /*
>   * Get the number of jiffies that we should penalise a mischievous cgroup which
>   * is exceeding its memory.high by checking both it and its ancestors.
> @@ -2395,6 +2411,13 @@ void mem_cgroup_handle_over_high(void)
>  	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
>  					       mem_find_max_overage(memcg));
>  
> +	/*
> +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> +	 * and is allocated in larger chunks. We want the delays to be gradual.
> +	 */

This comment is also out-of-date, as the same curve is being applied.

> +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> +						swap_find_max_overage(memcg));
> +
>  	/*
>  	 * Clamp the max delay per usermode return so as to still keep the
>  	 * application moving forwards and also permit diagnostics, albeit
> @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * reclaim, the cost of mismatch is negligible.
>  	 */
>  	do {
> -		if (page_counter_is_above_high(&memcg->memory)) {
> -			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> +		bool mem_high, swap_high;
> +
> +		mem_high = page_counter_is_above_high(&memcg->memory);
> +		swap_high = page_counter_is_above_high(&memcg->swap);

Please open-code these checks instead - we don't really do getters and
predicates for these, and only have the setters because they are more
complicated operations.

> +		if (mem_high || swap_high) {
> +			/* Use one counter for number of pages allocated
> +			 * under pressure to save struct task space and
> +			 * avoid two separate hierarchy walks.
> +			 /*
>  			current->memcg_nr_pages_over_high += batch;

That comment style is leaking out of the networking code ;-) Please
use the customary style in this code base, /*\n *...

As for one counter instead of two: I'm not sure that question arises
in the reader. There have also been some questions recently what the
counter actually means. How about the following:

			/*
			 * The allocating tasks in this cgroup will need to do
			 * reclaim or be throttled to prevent further growth
			 * of the memory or swap footprints.
			 *
			 * Target some best-effort fairness between the tasks,
			 * and distribute reclaim work and delay penalties
			 * based on how much each task is actually allocating.
			 */

Otherwise, the patch looks good to me.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-26 15:33     ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-26 15:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

Hi Jakub,

the patch looks mostly good to me, but there are a couple of things
that should be cleaned up before merging:

On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> Add a memory.swap.high knob, which can be used to protect the system
> from SWAP exhaustion. The mechanism used for penalizing is similar
> to memory.high penalty (sleep on return to user space), but with
> a less steep slope.

The last part is no longer true after incorporating Michal's feedback.

> That is not to say that the knob itself is equivalent to memory.high.
> The objective is more to protect the system from potentially buggy
> tasks consuming a lot of swap and impacting other tasks, or even
> bringing the whole system to stand still with complete SWAP
> exhaustion. Hopefully without the need to find per-task hard
> limits.
> 
> Slowing misbehaving tasks down gradually allows user space oom
> killers or other protection mechanisms to react. oomd and earlyoom
> already do killing based on swap exhaustion, and memory.swap.high
> protection will help implement such userspace oom policies more
> reliably.
> 
> We can use one counter for number of pages allocated under
> pressure to save struct task space and avoid two separate
> hierarchy walks on the hot path. The exact overage is
> calculated on return to user space, anyway.
> 
> Take the new high limit into account when determining if swap
> is "full". Borrowing the explanation from Johannes:
> 
>   The idea behind "swap full" is that as long as the workload has plenty
>   of swap space available and it's not changing its memory contents, it
>   makes sense to generously hold on to copies of data in the swap
>   device, even after the swapin. A later reclaim cycle can drop the page
>   without any IO. Trading disk space for IO.
> 
>   But the only two ways to reclaim a swap slot is when they're faulted
>   in and the references go away, or by scanning the virtual address space
>   like swapoff does - which is very expensive (one could argue it's too
>   expensive even for swapoff, it's often more practical to just reboot).
> 
>   So at some point in the fill level, we have to start freeing up swap
>   slots on fault/swapin. Otherwise we could eventually run out of swap
>   slots while they're filled with copies of data that is also in RAM.
> 
>   We don't want to OOM a workload because its available swap space is
>   filled with redundant cache.
> 
> Signed-off-by: Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> --
> v4:
>  - add a comment on using a single counter for both mem and swap pages
> v3:
>  - count events for all groups over limit
>  - add doc for high events
>  - remove the magic scaling factor
>  - improve commit message
> v2:
>  - add docs
>  - improve commit message
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 20 ++++++
>  include/linux/memcontrol.h              |  1 +
>  mm/memcontrol.c                         | 84 +++++++++++++++++++++++--
>  3 files changed, 99 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index fed4e1d2a343..1536deb2f28e 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1373,6 +1373,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +
>    memory.swap.max
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default is "max".
> @@ -1386,6 +1402,10 @@ PAGE_SIZE multiple when read back.
>  	otherwise, a value change in this file generates a file
>  	modified event.
>  
> +	  high
> +		The number of times the cgroup's swap usage was over
> +		the high threshold.
> +
>  	  max
>  		The number of times the cgroup's swap usage was about
>  		to go over the max boundary and swap allocation
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d726867d8af9..865afda5b6f0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,7 @@ enum memcg_memory_event {
>  	MEMCG_MAX,
>  	MEMCG_OOM,
>  	MEMCG_OOM_KILL,
> +	MEMCG_SWAP_HIGH,
>  	MEMCG_SWAP_MAX,
>  	MEMCG_SWAP_FAIL,
>  	MEMCG_NR_MEMORY_EVENTS,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4b7bc80aa38..a92ddaecd28e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2334,6 +2334,22 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
>  	return max_overage;
>  }
>  
> +static u64 swap_find_max_overage(struct mem_cgroup *memcg)
> +{
> +	u64 overage, max_overage = 0;
> +
> +	do {
> +		overage = calculate_overage(page_counter_read(&memcg->swap),
> +					    READ_ONCE(memcg->swap.high));
> +		if (overage)
> +			memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
> +		max_overage = max(overage, max_overage);
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		 !mem_cgroup_is_root(memcg));
> +
> +	return max_overage;
> +}
> +
>  /*
>   * Get the number of jiffies that we should penalise a mischievous cgroup which
>   * is exceeding its memory.high by checking both it and its ancestors.
> @@ -2395,6 +2411,13 @@ void mem_cgroup_handle_over_high(void)
>  	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
>  					       mem_find_max_overage(memcg));
>  
> +	/*
> +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> +	 * and is allocated in larger chunks. We want the delays to be gradual.
> +	 */

This comment is also out-of-date, as the same curve is being applied.

> +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> +						swap_find_max_overage(memcg));
> +
>  	/*
>  	 * Clamp the max delay per usermode return so as to still keep the
>  	 * application moving forwards and also permit diagnostics, albeit
> @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * reclaim, the cost of mismatch is negligible.
>  	 */
>  	do {
> -		if (page_counter_is_above_high(&memcg->memory)) {
> -			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> +		bool mem_high, swap_high;
> +
> +		mem_high = page_counter_is_above_high(&memcg->memory);
> +		swap_high = page_counter_is_above_high(&memcg->swap);

Please open-code these checks instead - we don't really do getters and
predicates for these, and only have the setters because they are more
complicated operations.

> +		if (mem_high || swap_high) {
> +			/* Use one counter for number of pages allocated
> +			 * under pressure to save struct task space and
> +			 * avoid two separate hierarchy walks.
> +			 /*
>  			current->memcg_nr_pages_over_high += batch;

That comment style is leaking out of the networking code ;-) Please
use the customary style in this code base, /*\n *...

As for one counter instead of two: I'm not sure that question arises
in the reader. There have also been some questions recently what the
counter actually means. How about the following:

			/*
			 * The allocating tasks in this cgroup will need to do
			 * reclaim or be throttled to prevent further growth
			 * of the memory or swap footprints.
			 *
			 * Target some best-effort fairness between the tasks,
			 * and distribute reclaim work and delay penalties
			 * based on how much each task is actually allocating.
			 */

Otherwise, the patch looks good to me.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-26 20:11       ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-26 20:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

On Tue, 26 May 2020 11:33:09 -0400 Johannes Weiner wrote:
> On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> > Add a memory.swap.high knob, which can be used to protect the system
> > from SWAP exhaustion. The mechanism used for penalizing is similar
> > to memory.high penalty (sleep on return to user space), but with
> > a less steep slope.  
> 
> The last part is no longer true after incorporating Michal's feedback.
>
> > +	/*
> > +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> > +	 * and is allocated in larger chunks. We want the delays to be gradual.
> > +	 */  
> 
> This comment is also out-of-date, as the same curve is being applied.

Indeed :S
 
> > +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> > +						swap_find_max_overage(memcg));
> > +
> >  	/*
> >  	 * Clamp the max delay per usermode return so as to still keep the
> >  	 * application moving forwards and also permit diagnostics, albeit
> > @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * reclaim, the cost of mismatch is negligible.
> >  	 */
> >  	do {
> > -		if (page_counter_is_above_high(&memcg->memory)) {
> > -			/* Don't bother a random interrupted task */
> > -			if (in_interrupt()) {
> > +		bool mem_high, swap_high;
> > +
> > +		mem_high = page_counter_is_above_high(&memcg->memory);
> > +		swap_high = page_counter_is_above_high(&memcg->swap);  
> 
> Please open-code these checks instead - we don't really do getters and
> predicates for these, and only have the setters because they are more
> complicated operations.

I added this helper because the calculation doesn't fit into 80 chars. 

In particular reclaim_high will need a temporary variable or IMHO
questionable line split.

static void reclaim_high(struct mem_cgroup *memcg,
			 unsigned int nr_pages,
			 gfp_t gfp_mask)
{
	do {
		if (!page_counter_is_above_high(&memcg->memory))
			continue;
		memcg_memory_event(memcg, MEMCG_HIGH);
		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
	} while ((memcg = parent_mem_cgroup(memcg)) &&
		 !mem_cgroup_is_root(memcg));
}

What's your preference? Mine is a helper, but I'm probably not
sensitive enough to the ontology here :)

> > +		if (mem_high || swap_high) {
> > +			/* Use one counter for number of pages allocated
> > +			 * under pressure to save struct task space and
> > +			 * avoid two separate hierarchy walks.
> > +			 /*
> >  			current->memcg_nr_pages_over_high += batch;  
> 
> That comment style is leaking out of the networking code ;-) Please
> use the customary style in this code base, /*\n *...
> 
> As for one counter instead of two: I'm not sure that question arises
> in the reader. There have also been some questions recently what the
> counter actually means. How about the following:
> 
> 			/*
> 			 * The allocating tasks in this cgroup will need to do
> 			 * reclaim or be throttled to prevent further growth
> 			 * of the memory or swap footprints.
> 			 *
> 			 * Target some best-effort fairness between the tasks,
> 			 * and distribute reclaim work and delay penalties
> 			 * based on how much each task is actually allocating.
> 			 */

sounds good!

> Otherwise, the patch looks good to me.

Thanks!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-26 20:11       ` Jakub Kicinski
  0 siblings, 0 replies; 22+ messages in thread
From: Jakub Kicinski @ 2020-05-26 20:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

On Tue, 26 May 2020 11:33:09 -0400 Johannes Weiner wrote:
> On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> > Add a memory.swap.high knob, which can be used to protect the system
> > from SWAP exhaustion. The mechanism used for penalizing is similar
> > to memory.high penalty (sleep on return to user space), but with
> > a less steep slope.  
> 
> The last part is no longer true after incorporating Michal's feedback.
>
> > +	/*
> > +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> > +	 * and is allocated in larger chunks. We want the delays to be gradual.
> > +	 */  
> 
> This comment is also out-of-date, as the same curve is being applied.

Indeed :S
 
> > +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> > +						swap_find_max_overage(memcg));
> > +
> >  	/*
> >  	 * Clamp the max delay per usermode return so as to still keep the
> >  	 * application moving forwards and also permit diagnostics, albeit
> > @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 * reclaim, the cost of mismatch is negligible.
> >  	 */
> >  	do {
> > -		if (page_counter_is_above_high(&memcg->memory)) {
> > -			/* Don't bother a random interrupted task */
> > -			if (in_interrupt()) {
> > +		bool mem_high, swap_high;
> > +
> > +		mem_high = page_counter_is_above_high(&memcg->memory);
> > +		swap_high = page_counter_is_above_high(&memcg->swap);  
> 
> Please open-code these checks instead - we don't really do getters and
> predicates for these, and only have the setters because they are more
> complicated operations.

I added this helper because the calculation doesn't fit into 80 chars. 

In particular reclaim_high will need a temporary variable or IMHO
questionable line split.

static void reclaim_high(struct mem_cgroup *memcg,
			 unsigned int nr_pages,
			 gfp_t gfp_mask)
{
	do {
		if (!page_counter_is_above_high(&memcg->memory))
			continue;
		memcg_memory_event(memcg, MEMCG_HIGH);
		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
	} while ((memcg = parent_mem_cgroup(memcg)) &&
		 !mem_cgroup_is_root(memcg));
}

What's your preference? Mine is a helper, but I'm probably not
sensitive enough to the ontology here :)

> > +		if (mem_high || swap_high) {
> > +			/* Use one counter for number of pages allocated
> > +			 * under pressure to save struct task space and
> > +			 * avoid two separate hierarchy walks.
> > +			 /*
> >  			current->memcg_nr_pages_over_high += batch;  
> 
> That comment style is leaking out of the networking code ;-) Please
> use the customary style in this code base, /*\n *...
> 
> As for one counter instead of two: I'm not sure that question arises
> in the reader. There have also been some questions recently what the
> counter actually means. How about the following:
> 
> 			/*
> 			 * The allocating tasks in this cgroup will need to do
> 			 * reclaim or be throttled to prevent further growth
> 			 * of the memory or swap footprints.
> 			 *
> 			 * Target some best-effort fairness between the tasks,
> 			 * and distribute reclaim work and delay penalties
> 			 * based on how much each task is actually allocating.
> 			 */

sounds good!

> Otherwise, the patch looks good to me.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-27 15:51         ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-27 15:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, chris, cgroups, shakeelb, mhocko

On Tue, May 26, 2020 at 01:11:57PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 11:33:09 -0400 Johannes Weiner wrote:
> > On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> > > +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> > > +						swap_find_max_overage(memcg));
> > > +
> > >  	/*
> > >  	 * Clamp the max delay per usermode return so as to still keep the
> > >  	 * application moving forwards and also permit diagnostics, albeit
> > > @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	 * reclaim, the cost of mismatch is negligible.
> > >  	 */
> > >  	do {
> > > -		if (page_counter_is_above_high(&memcg->memory)) {
> > > -			/* Don't bother a random interrupted task */
> > > -			if (in_interrupt()) {
> > > +		bool mem_high, swap_high;
> > > +
> > > +		mem_high = page_counter_is_above_high(&memcg->memory);
> > > +		swap_high = page_counter_is_above_high(&memcg->swap);  
> > 
> > Please open-code these checks instead - we don't really do getters and
> > predicates for these, and only have the setters because they are more
> > complicated operations.
> 
> I added this helper because the calculation doesn't fit into 80 chars. 
> 
> In particular reclaim_high will need a temporary variable or IMHO
> questionable line split.
> 
> static void reclaim_high(struct mem_cgroup *memcg,
> 			 unsigned int nr_pages,
> 			 gfp_t gfp_mask)
> {
> 	do {
> 		if (!page_counter_is_above_high(&memcg->memory))
> 			continue;
> 		memcg_memory_event(memcg, MEMCG_HIGH);
> 		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> 	} while ((memcg = parent_mem_cgroup(memcg)) &&
> 		 !mem_cgroup_is_root(memcg));
> }
> 
> What's your preference? Mine is a helper, but I'm probably not
> sensitive enough to the ontology here :)

		if (page_counter_read(&memcg->memory) <
		    READ_ONCE(memcg->memory.high))
			continue;

should work fine. It's the same formatting in mem_cgroup_swap_full():

		if (page_counter_read(&memcg->swap) * 2 >=
		    READ_ONCE(memcg->swap.max))


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use
@ 2020-05-27 15:51         ` Johannes Weiner
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2020-05-27 15:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
	tj-DgEjT+Ai2ygdnm+yROfE0A, chris-6Bi1550iOqEnzZ6mRAm98g,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A

On Tue, May 26, 2020 at 01:11:57PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 11:33:09 -0400 Johannes Weiner wrote:
> > On Wed, May 20, 2020 at 05:24:11PM -0700, Jakub Kicinski wrote:
> > > +	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> > > +						swap_find_max_overage(memcg));
> > > +
> > >  	/*
> > >  	 * Clamp the max delay per usermode return so as to still keep the
> > >  	 * application moving forwards and also permit diagnostics, albeit
> > > @@ -2585,12 +2608,25 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	 * reclaim, the cost of mismatch is negligible.
> > >  	 */
> > >  	do {
> > > -		if (page_counter_is_above_high(&memcg->memory)) {
> > > -			/* Don't bother a random interrupted task */
> > > -			if (in_interrupt()) {
> > > +		bool mem_high, swap_high;
> > > +
> > > +		mem_high = page_counter_is_above_high(&memcg->memory);
> > > +		swap_high = page_counter_is_above_high(&memcg->swap);  
> > 
> > Please open-code these checks instead - we don't really do getters and
> > predicates for these, and only have the setters because they are more
> > complicated operations.
> 
> I added this helper because the calculation doesn't fit into 80 chars. 
> 
> In particular reclaim_high will need a temporary variable or IMHO
> questionable line split.
> 
> static void reclaim_high(struct mem_cgroup *memcg,
> 			 unsigned int nr_pages,
> 			 gfp_t gfp_mask)
> {
> 	do {
> 		if (!page_counter_is_above_high(&memcg->memory))
> 			continue;
> 		memcg_memory_event(memcg, MEMCG_HIGH);
> 		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> 	} while ((memcg = parent_mem_cgroup(memcg)) &&
> 		 !mem_cgroup_is_root(memcg));
> }
> 
> What's your preference? Mine is a helper, but I'm probably not
> sensitive enough to the ontology here :)

		if (page_counter_read(&memcg->memory) <
		    READ_ONCE(memcg->memory.high))
			continue;

should work fine. It's the same formatting in mem_cgroup_swap_full():

		if (page_counter_read(&memcg->swap) * 2 >=
		    READ_ONCE(memcg->swap.max))

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-05-27 15:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-21  0:24 [PATCH mm v5 RESEND 0/4] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
2020-05-21  0:24 ` Jakub Kicinski
2020-05-21  0:24 ` [PATCH mm v5 RESEND 1/4] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
2020-05-21  0:24   ` Jakub Kicinski
2020-05-26 14:35   ` Johannes Weiner
2020-05-26 14:35     ` Johannes Weiner
2020-05-21  0:24 ` [PATCH mm v5 RESEND 2/4] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
2020-05-21  0:24   ` Jakub Kicinski
2020-05-26 14:36   ` Johannes Weiner
2020-05-26 14:36     ` Johannes Weiner
2020-05-21  0:24 ` [PATCH mm v5 RESEND 3/4] mm: move cgroup high memory limit setting into struct page_counter Jakub Kicinski
2020-05-21  0:24   ` Jakub Kicinski
2020-05-26 14:42   ` Johannes Weiner
2020-05-26 14:42     ` Johannes Weiner
2020-05-21  0:24 ` [PATCH mm v5 RESEND 4/4] mm: automatically penalize tasks with high swap use Jakub Kicinski
2020-05-21  0:24   ` Jakub Kicinski
2020-05-26 15:33   ` Johannes Weiner
2020-05-26 15:33     ` Johannes Weiner
2020-05-26 20:11     ` Jakub Kicinski
2020-05-26 20:11       ` Jakub Kicinski
2020-05-27 15:51       ` Johannes Weiner
2020-05-27 15:51         ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.