linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted
@ 2020-05-11 22:55 Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Tejun describes the problem as follows:

When swap runs out, there's an abrupt change in system behavior -
the anonymous memory suddenly becomes unmanageable which readily
breaks any sort of memory isolation and can bring down the whole
system. To avoid that, oomd [1] monitors free swap space and triggers
kills when it drops below the specific threshold (e.g. 15%).

While this works, it's far from ideal:
 - Depending on IO performance and total swap size, a given
   headroom might not be enough or too much.
 - oomd has to monitor swap depletion in addition to the usual
   pressure metrics and it currently doesn't consider memory.swap.max.

Solve this by adapting parts of the approach that memory.high uses -
slow down allocation as the resource gets depleted turning the
depletion behavior from abrupt cliff one to gradual degradation
observable through memory pressure metric.

[1] https://github.com/facebookincubator/oomd

v1: https://lore.kernel.org/linux-mm/20200417010617.927266-1-kuba@kernel.org/

Jakub Kicinski (3):
  mm: prepare for swap over-high accounting and penalty calculation
  mm: move penalty delay clamping out of calculate_high_delay()
  mm: automatically penalize tasks with high swap use

 Documentation/admin-guide/cgroup-v2.rst |  16 +++
 include/linux/memcontrol.h              |   4 +
 mm/memcontrol.c                         | 166 ++++++++++++++++++------
 3 files changed, 147 insertions(+), 39 deletions(-)

-- 
2.25.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation
  2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
@ 2020-05-11 22:55 ` Jakub Kicinski
  2020-05-12  7:08   ` Michal Hocko
  2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
  2 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Slice the memory overage calculation logic a little bit so we can
reuse it to apply a similar penalty to the swap. The logic which
accesses the memory-specific fields (use and high values) has to
be taken out of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 mm/memcontrol.c | 62 ++++++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05dcb72314b5..8a9b671c3249 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work)
  #define MEMCG_DELAY_PRECISION_SHIFT 20
  #define MEMCG_DELAY_SCALING_SHIFT 14
 
-/*
- * Get the number of jiffies that we should penalise a mischievous cgroup which
- * is exceeding its memory.high by checking both it and its ancestors.
- */
-static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
-					  unsigned int nr_pages)
+static u64 calculate_overage(unsigned long usage, unsigned long high)
 {
-	unsigned long penalty_jiffies;
-	u64 max_overage = 0;
-
-	do {
-		unsigned long usage, high;
-		u64 overage;
+	u64 overage;
 
-		usage = page_counter_read(&memcg->memory);
-		high = READ_ONCE(memcg->high);
+	if (usage <= high)
+		return 0;
 
-		if (usage <= high)
-			continue;
+	/*
+	 * Prevent division by 0 in overage calculation by acting as if
+	 * it was a threshold of 1 page
+	 */
+	high = max(high, 1UL);
 
-		/*
-		 * Prevent division by 0 in overage calculation by acting as if
-		 * it was a threshold of 1 page
-		 */
-		high = max(high, 1UL);
+	overage = usage - high;
+	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
+	return div64_u64(overage, high);
+}
 
-		overage = usage - high;
-		overage <<= MEMCG_DELAY_PRECISION_SHIFT;
-		overage = div64_u64(overage, high);
+static u64 mem_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
 
-		if (overage > max_overage)
-			max_overage = overage;
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->memory),
+					    READ_ONCE(memcg->high));
+		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
 
+	return max_overage;
+}
+
+/*
+ * Get the number of jiffies that we should penalise a mischievous cgroup which
+ * is exceeding its memory.high by checking both it and its ancestors.
+ */
+static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
+					  unsigned int nr_pages,
+					  u64 max_overage)
+{
+	unsigned long penalty_jiffies;
+
 	if (!max_overage)
 		return 0;
 
@@ -2411,7 +2418,8 @@ void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages);
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
-- 
2.25.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay()
  2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
@ 2020-05-11 22:55 ` Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
  2 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

We will want to call calculate_high_delay() twice - once for
memory and once for swap, and we should apply the clamp value
to sum of the penalties. Clamping has to be applied outside
of calculate_high_delay().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 mm/memcontrol.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a9b671c3249..66dd87bb9e0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2386,14 +2386,7 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	 * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
 	 * larger the current charge patch is than that.
 	 */
-	penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
-
-	/*
-	 * Clamp the max delay per usermode return so as to still keep the
-	 * application moving forwards and also permit diagnostics, albeit
-	 * extremely slowly.
-	 */
-	return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
 /*
@@ -2421,6 +2414,13 @@ void mem_cgroup_handle_over_high(void)
 	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Clamp the max delay per usermode return so as to still keep the
+	 * application moving forwards and also permit diagnostics, albeit
+	 * extremely slowly.
+	 */
+	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
 	 * that it's not even worth doing, in an attempt to be nice to those who
-- 
2.25.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
  2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
@ 2020-05-11 22:55 ` Jakub Kicinski
  2020-05-12  7:26   ` Michal Hocko
  2020-05-13  8:38   ` Michal Hocko
  2 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-11 22:55 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb,
	mhocko, Jakub Kicinski

Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penelizing is similar
to memory.high penalty (sleep on return to user space), but with
a less steep slope.

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy
tasks consuming a lot of swap and impacting other tasks, or even
bringing the whole system to stand still with complete SWAP
exhaustion. Hopefully without the need to find per-task hard
limits.

Slowing misbehaving tasks down gradually allows user space oom
killers or other protection mechanisms to react. oomd and earlyoom
already do killing based on swap exhaustion, and memory.swap.high
protection will help implement such userspace oom policies more
reliably.

Use one counter for number of pages allocated under pressure
to save struct task space and avoid two separate hierarchy
walks on the hot path.

Use swap.high when deciding if swap is full.
Perform reclaim and count memory over high events.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
--
v2:
 - add docs,
 - improve commit message.
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +++++
 include/linux/memcontrol.h              |  4 ++
 mm/memcontrol.c                         | 94 +++++++++++++++++++++++--
 3 files changed, 107 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 5f12f203822e..c60226daa193 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, all its further allocations will be throttled to
+	allow userspace to implement custom out-of-memory procedures.
+
+	This limit marks a point of no return for the cgroup. It is NOT
+	designed to manage the amount of swapping a workload does
+	during regular operation. Compare to memory.swap.max, which
+	prohibits swapping past a set amount, but lets the cgroup
+	continue unimpeded as long as other memory can be reclaimed.
+
+	Healthy workloads are not expected to reach this limit.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b478a4e83297..882bda952a5c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@ enum memcg_memory_event {
 	MEMCG_MAX,
 	MEMCG_OOM,
 	MEMCG_OOM_KILL,
+	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
 	MEMCG_NR_MEMORY_EVENTS,
@@ -212,6 +213,9 @@ struct mem_cgroup {
 	/* Upper bound of normal memory consumption range */
 	unsigned long high;
 
+	/* Upper bound of swap consumption range */
+	unsigned long swap_high;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 66dd87bb9e0f..a3d13b30e3d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 swap_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+	struct mem_cgroup *max_cg;
+
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->swap),
+					    READ_ONCE(memcg->swap_high));
+		if (overage > max_overage) {
+			max_overage = overage;
+			max_cg = memcg;
+		}
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		 !mem_cgroup_is_root(memcg));
+
+	if (max_overage)
+		memcg_memory_event(max_cg, MEMCG_SWAP_HIGH);
+
+	return max_overage;
+}
+
 /*
  * Get the number of jiffies that we should penalise a mischievous cgroup which
  * is exceeding its memory.high by checking both it and its ancestors.
  */
 static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 					  unsigned int nr_pages,
+					  unsigned char cost_shift,
 					  u64 max_overage)
 {
 	unsigned long penalty_jiffies;
@@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	if (!max_overage)
 		return 0;
 
+	if (cost_shift)
+		max_overage >>= cost_shift;
+
 	/*
 	 * We use overage compared to memory.high to calculate the number of
 	 * jiffies to sleep (penalty_jiffies). Ideally this value should be
@@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0,
 					       mem_find_max_overage(memcg));
 
+	/*
+	 * Make the swap curve more gradual, swap can be considered "cheaper",
+	 * and is allocated in larger chunks. We want the delays to be gradual.
+	 */
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2,
+						swap_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
-			/* Don't bother a random interrupted task */
-			if (in_interrupt()) {
+		bool mem_high, swap_high;
+
+		mem_high = page_counter_read(&memcg->memory) >
+			READ_ONCE(memcg->high);
+		swap_high = page_counter_read(&memcg->swap) >
+			READ_ONCE(memcg->swap_high);
+
+		/* Don't bother a random interrupted task */
+		if (in_interrupt()) {
+			if (mem_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
+			continue;
+		}
+
+		if (mem_high || swap_high) {
 			current->memcg_nr_pages_over_high += batch;
 			set_notify_resume(current);
 			break;
@@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
 		memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
 
@@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page)
 	if (!memcg)
 		return false;
 
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
-		if (page_counter_read(&memcg->swap) * 2 >=
-		    READ_ONCE(memcg->swap.max))
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		unsigned long usage = page_counter_read(&memcg->swap);
+
+		if (usage * 2 >= READ_ONCE(memcg->swap_high) ||
+		    usage * 2 >= READ_ONCE(memcg->swap.max))
 			return true;
+	}
 
 	return false;
 }
@@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
+static int swap_high_show(struct seq_file *m, void *v)
+{
+	unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high);
+
+	return seq_puts_memcg_tunable(m, high);
+}
+
+static ssize_t swap_high_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &high);
+	if (err)
+		return err;
+
+	WRITE_ONCE(memcg->swap_high, high);
+
+	return nbytes;
+}
+
 static int swap_max_show(struct seq_file *m, void *v)
 {
 	return seq_puts_memcg_tunable(m,
@@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	seq_printf(m, "high %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
 	seq_printf(m, "max %lu\n",
 		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
 	seq_printf(m, "fail %lu\n",
@@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = swap_current_read,
 	},
+	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
 	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-- 
2.25.4



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation
  2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
@ 2020-05-12  7:08   ` Michal Hocko
  2020-05-12 17:28     ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-05-12  7:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Mon 11-05-20 15:55:14, Jakub Kicinski wrote:
> Slice the memory overage calculation logic a little bit so we can
> reuse it to apply a similar penalty to the swap. The logic which
> accesses the memory-specific fields (use and high values) has to
> be taken out of calculate_high_delay().
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

some recommendations below.

> ---
>  mm/memcontrol.c | 62 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05dcb72314b5..8a9b671c3249 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work)
>   #define MEMCG_DELAY_PRECISION_SHIFT 20
>   #define MEMCG_DELAY_SCALING_SHIFT 14
>  
> -/*
> - * Get the number of jiffies that we should penalise a mischievous cgroup which
> - * is exceeding its memory.high by checking both it and its ancestors.
> - */
> -static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
> -					  unsigned int nr_pages)
> +static u64 calculate_overage(unsigned long usage, unsigned long high)

the naming is slightly confusing. I would concider the return value
to be in memory units rather than time because I would read it as
overrage of high. calculate_throttle_penalty would be more clear to me.

>  {
> -	unsigned long penalty_jiffies;
> -	u64 max_overage = 0;
> -
> -	do {
> -		unsigned long usage, high;
> -		u64 overage;
> +	u64 overage;
>  
> -		usage = page_counter_read(&memcg->memory);
> -		high = READ_ONCE(memcg->high);
> +	if (usage <= high)
> +		return 0;
>  
> -		if (usage <= high)
> -			continue;
> +	/*
> +	 * Prevent division by 0 in overage calculation by acting as if
> +	 * it was a threshold of 1 page
> +	 */
> +	high = max(high, 1UL);
>  
> -		/*
> -		 * Prevent division by 0 in overage calculation by acting as if
> -		 * it was a threshold of 1 page
> -		 */
> -		high = max(high, 1UL);
> +	overage = usage - high;
> +	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
> +	return div64_u64(overage, high);
> +}
>  
> -		overage = usage - high;
> -		overage <<= MEMCG_DELAY_PRECISION_SHIFT;
> -		overage = div64_u64(overage, high);
> +static u64 mem_find_max_overage(struct mem_cgroup *memcg)

This would then become find_high_throttle_penalty

> +{
> +	u64 overage, max_overage = 0;
>  
> -		if (overage > max_overage)
> -			max_overage = overage;
> +	do {
> +		overage = calculate_overage(page_counter_read(&memcg->memory),
> +					    READ_ONCE(memcg->high));
> +		max_overage = max(overage, max_overage);
>  	} while ((memcg = parent_mem_cgroup(memcg)) &&
>  		 !mem_cgroup_is_root(memcg));
>  
> +	return max_overage;
> +}
> +
> +/*
> + * Get the number of jiffies that we should penalise a mischievous cgroup which
> + * is exceeding its memory.high by checking both it and its ancestors.
> + */
> +static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
> +					  unsigned int nr_pages,
> +					  u64 max_overage)
> +{
> +	unsigned long penalty_jiffies;
> +
>  	if (!max_overage)
>  		return 0;
>  
> @@ -2411,7 +2418,8 @@ void mem_cgroup_handle_over_high(void)
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
>  	 */
> -	penalty_jiffies = calculate_high_delay(memcg, nr_pages);
> +	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
> +					       mem_find_max_overage(memcg));
>  
>  	/*
>  	 * Don't sleep if the amount of jiffies this memcg owes us is so low
> -- 
> 2.25.4

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
@ 2020-05-12  7:26   ` Michal Hocko
  2020-05-12 17:55     ` Jakub Kicinski
  2020-05-13  8:38   ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-05-12  7:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> Add a memory.swap.high knob, which can be used to protect the system
> from SWAP exhaustion. The mechanism used for penelizing is similar
> to memory.high penalty (sleep on return to user space), but with
> a less steep slope.
> 
> That is not to say that the knob itself is equivalent to memory.high.
> The objective is more to protect the system from potentially buggy
> tasks consuming a lot of swap and impacting other tasks, or even
> bringing the whole system to stand still with complete SWAP
> exhaustion. Hopefully without the need to find per-task hard
> limits.
> 
> Slowing misbehaving tasks down gradually allows user space oom
> killers or other protection mechanisms to react. oomd and earlyoom
> already do killing based on swap exhaustion, and memory.swap.high
> protection will help implement such userspace oom policies more
> reliably.

Thanks for adding more information about the usecase and motivation.
> 
> Use one counter for number of pages allocated under pressure
> to save struct task space and avoid two separate hierarchy
> walks on the hot path.
> 
> Use swap.high when deciding if swap is full.

Please be more specific why.

> Perform reclaim and count memory over high events.

Please expand on this and explain how this is working and why the
semantic is subtly different from MEMCG_HIGH. I suspect the reason
is that there is no reclaim for the swap so you are only emitting an
event on the memcg which is actually throttled. This is in line with
memory.high but the difference is that we do reclaim each memcg subtree
in the high limit excess. That means that the counter tells us how many
times the specific memcg was in excess which would be impossible with
your implementation.

I would also suggest to explain or ideally even separate the swap
penalty scaling logic to a seprate patch. What kind of data it is based
on?
 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> --
> v2:
>  - add docs,
>  - improve commit message.
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 16 +++++
>  include/linux/memcontrol.h              |  4 ++
>  mm/memcontrol.c                         | 94 +++++++++++++++++++++++--
>  3 files changed, 107 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 5f12f203822e..c60226daa193 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +
>    memory.swap.max
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default is "max".
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b478a4e83297..882bda952a5c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -45,6 +45,7 @@ enum memcg_memory_event {
>  	MEMCG_MAX,
>  	MEMCG_OOM,
>  	MEMCG_OOM_KILL,
> +	MEMCG_SWAP_HIGH,
>  	MEMCG_SWAP_MAX,
>  	MEMCG_SWAP_FAIL,
>  	MEMCG_NR_MEMORY_EVENTS,
> @@ -212,6 +213,9 @@ struct mem_cgroup {
>  	/* Upper bound of normal memory consumption range */
>  	unsigned long high;
>  
> +	/* Upper bound of swap consumption range */
> +	unsigned long swap_high;
> +
>  	/* Range enforcement for interrupt charges */
>  	struct work_struct high_work;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 66dd87bb9e0f..a3d13b30e3d6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
>  	return max_overage;
>  }
>  
> +static u64 swap_find_max_overage(struct mem_cgroup *memcg)
> +{
> +	u64 overage, max_overage = 0;
> +	struct mem_cgroup *max_cg;
> +
> +	do {
> +		overage = calculate_overage(page_counter_read(&memcg->swap),
> +					    READ_ONCE(memcg->swap_high));
> +		if (overage > max_overage) {
> +			max_overage = overage;
> +			max_cg = memcg;
> +		}
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		 !mem_cgroup_is_root(memcg));
> +
> +	if (max_overage)
> +		memcg_memory_event(max_cg, MEMCG_SWAP_HIGH);
> +
> +	return max_overage;
> +}
> +
>  /*
>   * Get the number of jiffies that we should penalise a mischievous cgroup which
>   * is exceeding its memory.high by checking both it and its ancestors.
>   */
>  static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
>  					  unsigned int nr_pages,
> +					  unsigned char cost_shift,
>  					  u64 max_overage)
>  {
>  	unsigned long penalty_jiffies;
> @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
>  	if (!max_overage)
>  		return 0;
>  
> +	if (cost_shift)
> +		max_overage >>= cost_shift;
> +
>  	/*
>  	 * We use overage compared to memory.high to calculate the number of
>  	 * jiffies to sleep (penalty_jiffies). Ideally this value should be
> @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void)
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
>  	 */
> -	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
> +	penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0,
>  					       mem_find_max_overage(memcg));
>  
> +	/*
> +	 * Make the swap curve more gradual, swap can be considered "cheaper",
> +	 * and is allocated in larger chunks. We want the delays to be gradual.
> +	 */
> +	penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2,
> +						swap_find_max_overage(memcg));
> +
>  	/*
>  	 * Clamp the max delay per usermode return so as to still keep the
>  	 * application moving forwards and also permit diagnostics, albeit
> @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * reclaim, the cost of mismatch is negligible.
>  	 */
>  	do {
> -		if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) {
> -			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> +		bool mem_high, swap_high;
> +
> +		mem_high = page_counter_read(&memcg->memory) >
> +			READ_ONCE(memcg->high);
> +		swap_high = page_counter_read(&memcg->swap) >
> +			READ_ONCE(memcg->swap_high);
> +
> +		/* Don't bother a random interrupted task */
> +		if (in_interrupt()) {
> +			if (mem_high) {
>  				schedule_work(&memcg->high_work);
>  				break;
>  			}
> +			continue;
> +		}
> +
> +		if (mem_high || swap_high) {
>  			current->memcg_nr_pages_over_high += batch;
>  			set_notify_resume(current);
>  			break;
> @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  
>  	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
> +	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
>  	if (parent) {
>  		memcg->swappiness = mem_cgroup_swappiness(parent);
>  		memcg->oom_kill_disable = parent->oom_kill_disable;
> @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	page_counter_set_low(&memcg->memory, 0);
>  	WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX);
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
> +	WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX);
>  	memcg_wb_domain_size_changed(memcg);
>  }
>  
> @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page)
>  	if (!memcg)
>  		return false;
>  
> -	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
> -		if (page_counter_read(&memcg->swap) * 2 >=
> -		    READ_ONCE(memcg->swap.max))
> +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> +		unsigned long usage = page_counter_read(&memcg->swap);
> +
> +		if (usage * 2 >= READ_ONCE(memcg->swap_high) ||
> +		    usage * 2 >= READ_ONCE(memcg->swap.max))
>  			return true;
> +	}
>  
>  	return false;
>  }
> @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
>  	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
>  }
>  
> +static int swap_high_show(struct seq_file *m, void *v)
> +{
> +	unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high);
> +
> +	return seq_puts_memcg_tunable(m, high);
> +}
> +
> +static ssize_t swap_high_write(struct kernfs_open_file *of,
> +			       char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long high;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "max", &high);
> +	if (err)
> +		return err;
> +
> +	WRITE_ONCE(memcg->swap_high, high);
> +
> +	return nbytes;
> +}
> +
>  static int swap_max_show(struct seq_file *m, void *v)
>  {
>  	return seq_puts_memcg_tunable(m,
> @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>  
> +	seq_printf(m, "high %lu\n",
> +		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
>  	seq_printf(m, "max %lu\n",
>  		   atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
>  	seq_printf(m, "fail %lu\n",
> @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = {
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.read_u64 = swap_current_read,
>  	},
> +	{
> +		.name = "swap.high",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = swap_high_show,
> +		.write = swap_high_write,
> +	},
>  	{
>  		.name = "swap.max",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> -- 
> 2.25.4

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation
  2020-05-12  7:08   ` Michal Hocko
@ 2020-05-12 17:28     ` Jakub Kicinski
  2020-05-13  8:06       ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-12 17:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Tue, 12 May 2020 09:08:58 +0200 Michal Hocko wrote:
> On Mon 11-05-20 15:55:14, Jakub Kicinski wrote:
> > Slice the memory overage calculation logic a little bit so we can
> > reuse it to apply a similar penalty to the swap. The logic which
> > accesses the memory-specific fields (use and high values) has to
> > be taken out of calculate_high_delay().
> > 
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>  
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> some recommendations below.

Thank you!

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 05dcb72314b5..8a9b671c3249 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work)
> >   #define MEMCG_DELAY_PRECISION_SHIFT 20
> >   #define MEMCG_DELAY_SCALING_SHIFT 14
> >  
> > -/*
> > - * Get the number of jiffies that we should penalise a mischievous cgroup which
> > - * is exceeding its memory.high by checking both it and its ancestors.
> > - */
> > -static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
> > -					  unsigned int nr_pages)
> > +static u64 calculate_overage(unsigned long usage, unsigned long high)  
> 
> the naming is slightly confusing. I would concider the return value
> to be in memory units rather than time because I would read it as
> overrage of high. calculate_throttle_penalty would be more clear to me.

Hm. The unit is the fraction of high. Here is the code, it's quite hard
to read in diff form (I should have used --histogram, sorry):

static u64 calculate_overage(unsigned long usage, unsigned long high)
{
	u64 overage;

	if (usage <= high)
		return 0;

	/*
	 * Prevent division by 0 in overage calculation by acting as if
	 * it was a threshold of 1 page
	 */
	high = max(high, 1UL);

	overage = usage - high;
	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
	return div64_u64(overage, high);
}

calculate_throttle_penalty() sounds like it returns time. How about
something like calc_overage_frac() ? Or calc_overage_perc()?
(abbreviating to "calc" so the caller fits on a line)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-12  7:26   ` Michal Hocko
@ 2020-05-12 17:55     ` Jakub Kicinski
  2020-05-13  8:32       ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-12 17:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:
> On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> > Use swap.high when deciding if swap is full.  
> 
> Please be more specific why.

How about:

    Use swap.high when deciding if swap is full to influence ongoing
    swap reclaim in a best effort manner.

> > Perform reclaim and count memory over high events.  
> 
> Please expand on this and explain how this is working and why the
> semantic is subtly different from MEMCG_HIGH. I suspect the reason
> is that there is no reclaim for the swap so you are only emitting an
> event on the memcg which is actually throttled. This is in line with
> memory.high but the difference is that we do reclaim each memcg subtree
> in the high limit excess. That means that the counter tells us how many
> times the specific memcg was in excess which would be impossible with
> your implementation.

Right, with memory all cgroups over high get penalized with the extra
reclaim work. For swap we just have the delay, so the event is
associated with the worst offender, anything lower didn't really matter.

But it's easy enough to change if you prefer. Otherwise I'll just add
this to the commit message:

  Count swap over high events. Note that unlike memory over high events
  we only count them for the worst offender. This is because the
  delay penalties for both swap and memory over high are not cumulative,
  i.e. we use the max delay.

> I would also suggest to explain or ideally even separate the swap
> penalty scaling logic to a seprate patch. What kind of data it is based
> on?

It's a hard thing to get production data for since, as we mentioned we
don't expect the limit to be hit. It was more of a process of
experimentation and finding a gradual slope that "felt right"...

Is there a more scientific process we can follow here? We want the
delay to be small at first for a first few pages and then grow to make
sure we stop the task from going too much over high. The square
function works pretty well IMHO.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation
  2020-05-12 17:28     ` Jakub Kicinski
@ 2020-05-13  8:06       ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2020-05-13  8:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Tue 12-05-20 10:28:19, Jakub Kicinski wrote:
> On Tue, 12 May 2020 09:08:58 +0200 Michal Hocko wrote:
> > On Mon 11-05-20 15:55:14, Jakub Kicinski wrote:
> > > Slice the memory overage calculation logic a little bit so we can
> > > reuse it to apply a similar penalty to the swap. The logic which
> > > accesses the memory-specific fields (use and high values) has to
> > > be taken out of calculate_high_delay().
> > > 
> > > Signed-off-by: Jakub Kicinski <kuba@kernel.org>  
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > some recommendations below.
> 
> Thank you!
> 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 05dcb72314b5..8a9b671c3249 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2321,41 +2321,48 @@ static void high_work_func(struct work_struct *work)
> > >   #define MEMCG_DELAY_PRECISION_SHIFT 20
> > >   #define MEMCG_DELAY_SCALING_SHIFT 14
> > >  
> > > -/*
> > > - * Get the number of jiffies that we should penalise a mischievous cgroup which
> > > - * is exceeding its memory.high by checking both it and its ancestors.
> > > - */
> > > -static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
> > > -					  unsigned int nr_pages)
> > > +static u64 calculate_overage(unsigned long usage, unsigned long high)  
> > 
> > the naming is slightly confusing. I would concider the return value
> > to be in memory units rather than time because I would read it as
> > overrage of high. calculate_throttle_penalty would be more clear to me.
> 
> Hm. The unit is the fraction of high. Here is the code, it's quite hard
> to read in diff form (I should have used --histogram, sorry):

Yeah, I have checked the resulting code.

> static u64 calculate_overage(unsigned long usage, unsigned long high)
> {
> 	u64 overage;
> 
> 	if (usage <= high)
> 		return 0;
> 
> 	/*
> 	 * Prevent division by 0 in overage calculation by acting as if
> 	 * it was a threshold of 1 page
> 	 */
> 	high = max(high, 1UL);
> 
> 	overage = usage - high;
> 	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
> 	return div64_u64(overage, high);
> }
> 
> calculate_throttle_penalty() sounds like it returns time. How about
> something like calc_overage_frac() ? Or calc_overage_perc()?
> (abbreviating to "calc" so the caller fits on a line)

heh, naming is hard and not the most important thing in the world. So if
_penalty doesn't really sound good to you then let's just stick with
what you've had. I do not really like the _perc/_frac much more because
this is more about the implementation of the function than the
intention. We shouldn't really care whether the throttling is based on
overage scaled linearly (aka fraction) or by other means. The
implementation might change in the future.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-12 17:55     ` Jakub Kicinski
@ 2020-05-13  8:32       ` Michal Hocko
  2020-05-13 18:36         ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-05-13  8:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:
> > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> > > Use swap.high when deciding if swap is full.  
> > 
> > Please be more specific why.
> 
> How about:
> 
>     Use swap.high when deciding if swap is full to influence ongoing
>     swap reclaim in a best effort manner.

This is still way too vague. The crux is why should we treat hard and
high swap limit the same for mem_cgroup_swap_full purpose. Please note
that I am not saying this is wrong. I am asking for a more detailed
explanation mostly because I would bet that somebody stumbles over this
sooner or later.

mem_cgroup_swap_full is an odd predicate. It doesn't really want to tell
that the swap is really full. I haven't studied the original intention
but it is more in line of mem_cgroup_swap_under_pressure based on the
current usage to (attempt) scale swap cache size.

> > > Perform reclaim and count memory over high events.  
> > 
> > Please expand on this and explain how this is working and why the
> > semantic is subtly different from MEMCG_HIGH. I suspect the reason
> > is that there is no reclaim for the swap so you are only emitting an
> > event on the memcg which is actually throttled. This is in line with
> > memory.high but the difference is that we do reclaim each memcg subtree
> > in the high limit excess. That means that the counter tells us how many
> > times the specific memcg was in excess which would be impossible with
> > your implementation.
> 
> Right, with memory all cgroups over high get penalized with the extra
> reclaim work. For swap we just have the delay, so the event is
> associated with the worst offender, anything lower didn't really matter.
> 
> But it's easy enough to change if you prefer. Otherwise I'll just add
> this to the commit message:
> 
>   Count swap over high events. Note that unlike memory over high events
>   we only count them for the worst offender. This is because the
>   delay penalties for both swap and memory over high are not cumulative,
>   i.e. we use the max delay.

Well, memory high penalty is in fact cumulative, because the reclaim
would happen for each memcg subtree up the hierarchy. Sure the
additional throttling is not cumulative but that is not really that
important because the exact amount of throttling is an implementation detail.
The swap high is an odd one here because we do not reclaim swap so the
cumulative effect of that is 0 and there is only the additional
throttling happening. I suspect that your current implementation is
exposing an internal implementation to the userspace but considering how
the current memory high event is documented
          high
                The number of times processes of the cgroup are
                throttled and routed to perform direct memory reclaim
                because the high memory boundary was exceeded.  For a
                cgroup whose memory usage is capped by the high limit
                rather than global memory pressure, this event's
                occurrences are expected.

it talks about throttling rather than excess (like max) so I am not
really sure. I believe that it would be much better if both events were
more explicit about counting an excess and a throttling is just a side
effect of that situation.

I do not expect that we will have any form of the swap reclaim anytime
soon (if ever) but I fail to see why to creat a small little trap like
this now.

> > I would also suggest to explain or ideally even separate the swap
> > penalty scaling logic to a seprate patch. What kind of data it is based
> > on?
> 
> It's a hard thing to get production data for since, as we mentioned we
> don't expect the limit to be hit. It was more of a process of
> experimentation and finding a gradual slope that "felt right"...
> 
> Is there a more scientific process we can follow here? We want the
> delay to be small at first for a first few pages and then grow to make
> sure we stop the task from going too much over high. The square
> function works pretty well IMHO.

If there is no data to showing this to be an improvement then I would
just not add an additional scaling factor. Why? Mostly because once we
have it there it would be extremely hard to change. MM is full of these
little heuristics that are copied over because nobody dares to touch
them. If a different scaling is really needed it can always be added
later with some data to back that.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
  2020-05-12  7:26   ` Michal Hocko
@ 2020-05-13  8:38   ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2020-05-13  8:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 5f12f203822e..c60226daa193 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, all its further allocations will be throttled to
> +	allow userspace to implement custom out-of-memory procedures.
> +
> +	This limit marks a point of no return for the cgroup. It is NOT
> +	designed to manage the amount of swapping a workload does
> +	during regular operation. Compare to memory.swap.max, which
> +	prohibits swapping past a set amount, but lets the cgroup
> +	continue unimpeded as long as other memory can be reclaimed.
> +
> +	Healthy workloads are not expected to reach this limit.
> +

Btw. I forgot to mention that before but you should also add a
documentation for the swap high event to this file.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-13  8:32       ` Michal Hocko
@ 2020-05-13 18:36         ` Jakub Kicinski
  2020-05-14  7:42           ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2020-05-13 18:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > Use swap.high when deciding if swap is full.    
> > > 
> > > Please be more specific why.  
> > 
> > How about:
> > 
> >     Use swap.high when deciding if swap is full to influence ongoing
> >     swap reclaim in a best effort manner.  
> 
> This is still way too vague. The crux is why should we treat hard and
> high swap limit the same for mem_cgroup_swap_full purpose. Please
> note that I am not saying this is wrong. I am asking for a more
> detailed explanation mostly because I would bet that somebody
> stumbles over this sooner or later.

Stumbles in what way? Isn't it expected for the kernel to take
reasonable precautions to avoid hitting limits?  We expect the
application which breaches swap.high to get terminated by user 
space OOM, kernel best be careful about approaching that limit, 
no?

> mem_cgroup_swap_full is an odd predicate. It doesn't really want to
> tell that the swap is really full. I haven't studied the original
> intention but it is more in line of mem_cgroup_swap_under_pressure
> based on the current usage to (attempt) scale swap cache size.

Perhaps Johannes has some experience here?  The 50% means full 
heuristic predates git :(

> > > > Perform reclaim and count memory over high events.    
> > > 
> > > Please expand on this and explain how this is working and why the
> > > semantic is subtly different from MEMCG_HIGH. I suspect the reason
> > > is that there is no reclaim for the swap so you are only emitting
> > > an event on the memcg which is actually throttled. This is in
> > > line with memory.high but the difference is that we do reclaim
> > > each memcg subtree in the high limit excess. That means that the
> > > counter tells us how many times the specific memcg was in excess
> > > which would be impossible with your implementation.  
> > 
> > Right, with memory all cgroups over high get penalized with the
> > extra reclaim work. For swap we just have the delay, so the event is
> > associated with the worst offender, anything lower didn't really
> > matter.
> > 
> > But it's easy enough to change if you prefer. Otherwise I'll just
> > add this to the commit message:
> > 
> >   Count swap over high events. Note that unlike memory over high
> > events we only count them for the worst offender. This is because
> > the delay penalties for both swap and memory over high are not
> > cumulative, i.e. we use the max delay.  
> 
> Well, memory high penalty is in fact cumulative, because the reclaim
> would happen for each memcg subtree up the hierarchy. Sure the
> additional throttling is not cumulative but that is not really that
> important because the exact amount of throttling is an implementation
> detail. The swap high is an odd one here because we do not reclaim
> swap so the cumulative effect of that is 0 and there is only the
> additional throttling happening. I suspect that your current
> implementation is exposing an internal implementation to the
> userspace but considering how the current memory high event is
> documented high
>                 The number of times processes of the cgroup are
>                 throttled and routed to perform direct memory reclaim
>                 because the high memory boundary was exceeded.  For a
>                 cgroup whose memory usage is capped by the high limit
>                 rather than global memory pressure, this event's
>                 occurrences are expected.
> 
> it talks about throttling rather than excess (like max) so I am not
> really sure. I believe that it would be much better if both events
> were more explicit about counting an excess and a throttling is just
> a side effect of that situation.
> 
> I do not expect that we will have any form of the swap reclaim anytime
> soon (if ever) but I fail to see why to creat a small little trap like
> this now.

Right, let me adjust then.

> > > I would also suggest to explain or ideally even separate the swap
> > > penalty scaling logic to a seprate patch. What kind of data it is
> > > based on?  
> > 
> > It's a hard thing to get production data for since, as we mentioned
> > we don't expect the limit to be hit. It was more of a process of
> > experimentation and finding a gradual slope that "felt right"...
> > 
> > Is there a more scientific process we can follow here? We want the
> > delay to be small at first for a first few pages and then grow to
> > make sure we stop the task from going too much over high. The square
> > function works pretty well IMHO.  
> 
> If there is no data to showing this to be an improvement then I would
> just not add an additional scaling factor. Why? Mostly because once we
> have it there it would be extremely hard to change. MM is full of
> these little heuristics that are copied over because nobody dares to
> touch them. If a different scaling is really needed it can always be
> added later with some data to back that.

Oh, I misunderstood the question, you were asking about the scaling
factor.. The allocation of swap is in larger batches, according to 
my tests, example below (AR - after reclaim, swap overage changes 
after memory reclaim). 
                                    mem overage AR
     swap pages over_high AR        |    swap overage AR
 swap pages over at call.   \       |    |      . mem sleep
   mem pages over_high.  \   \      |    |     /  . swap sleep
                       v  v   v     v    v    v  v
 [   73.360533] sleep (32/10->67) [-35|13379] 0+253
 [   73.631291] sleep (32/ 3->54) [-18|13430] 0+205
 [   73.851629] sleep (32/22->35) [-20|13443] 0+133
 [   74.021396] sleep (32/ 3->60) [-29|13500] 0+230
 [   74.263355] sleep (32/28->79) [-44|13551] 0+306
 [   74.585689] sleep (32/29->91) [-17|13627] 0+355
 [   74.958675] sleep (32/27->79) [-31|13679] 0+311
 [   75.293021] sleep (32/29->86) [ -9|13750] 0+344
 [   75.654218] sleep (32/22->72) [-24|13800] 0+290
 [   75.962467] sleep (32/22->73) [-39|13865] 0+296

That's for a process slowly leaking memory. Swap gets over the high by
about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope
I was trying to scale it back.

But you make a fair point, someone more knowledgeable can add the
heuristic later if it's really needed.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-13 18:36         ` Jakub Kicinski
@ 2020-05-14  7:42           ` Michal Hocko
  2020-05-14 20:21             ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2020-05-14  7:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: akpm, linux-mm, kernel-team, tj, hannes, chris, cgroups, shakeelb

On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > Use swap.high when deciding if swap is full.    
> > > > 
> > > > Please be more specific why.  
> > > 
> > > How about:
> > > 
> > >     Use swap.high when deciding if swap is full to influence ongoing
> > >     swap reclaim in a best effort manner.  
> > 
> > This is still way too vague. The crux is why should we treat hard and
> > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > note that I am not saying this is wrong. I am asking for a more
> > detailed explanation mostly because I would bet that somebody
> > stumbles over this sooner or later.
> 
> Stumbles in what way?

Reading the code and trying to understand why this particular decision
has been made. Because it might be surprising that the hard and high
limits are treated same here.

> Isn't it expected for the kernel to take reasonable precautions to
> avoid hitting limits?

Isn't the throttling itself the precautious? How does the swap cache
and its control via mem_cgroup_swap_full interact here. See? This is
what I am asking to have explained in the changelog.

[...]
> > > > I would also suggest to explain or ideally even separate the swap
> > > > penalty scaling logic to a seprate patch. What kind of data it is
> > > > based on?  
> > > 
> > > It's a hard thing to get production data for since, as we mentioned
> > > we don't expect the limit to be hit. It was more of a process of
> > > experimentation and finding a gradual slope that "felt right"...
> > > 
> > > Is there a more scientific process we can follow here? We want the
> > > delay to be small at first for a first few pages and then grow to
> > > make sure we stop the task from going too much over high. The square
> > > function works pretty well IMHO.  
> > 
> > If there is no data to showing this to be an improvement then I would
> > just not add an additional scaling factor. Why? Mostly because once we
> > have it there it would be extremely hard to change. MM is full of
> > these little heuristics that are copied over because nobody dares to
> > touch them. If a different scaling is really needed it can always be
> > added later with some data to back that.
> 
> Oh, I misunderstood the question, you were asking about the scaling
> factor.. The allocation of swap is in larger batches, according to 
> my tests, example below (AR - after reclaim, swap overage changes 
> after memory reclaim). 
>                                     mem overage AR
>      swap pages over_high AR        |    swap overage AR
>  swap pages over at call.   \       |    |      . mem sleep
>    mem pages over_high.  \   \      |    |     /  . swap sleep
>                        v  v   v     v    v    v  v
>  [   73.360533] sleep (32/10->67) [-35|13379] 0+253
>  [   73.631291] sleep (32/ 3->54) [-18|13430] 0+205
>  [   73.851629] sleep (32/22->35) [-20|13443] 0+133
>  [   74.021396] sleep (32/ 3->60) [-29|13500] 0+230
>  [   74.263355] sleep (32/28->79) [-44|13551] 0+306
>  [   74.585689] sleep (32/29->91) [-17|13627] 0+355
>  [   74.958675] sleep (32/27->79) [-31|13679] 0+311
>  [   75.293021] sleep (32/29->86) [ -9|13750] 0+344
>  [   75.654218] sleep (32/22->72) [-24|13800] 0+290
>  [   75.962467] sleep (32/22->73) [-39|13865] 0+296
> 
> That's for a process slowly leaking memory. Swap gets over the high by
> about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope
> I was trying to scale it back.
> 
> But you make a fair point, someone more knowledgeable can add the
> heuristic later if it's really needed.

Or just make it a separate patch with all that information. This would
allow anybody touching that code in the future to understand the initial
motivation.

I am still not sure this scaling is a good fit in general (e.g. how does
it work with THP swapping?) though but this can be discussed separately
at least.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-14  7:42           ` Michal Hocko
@ 2020-05-14 20:21             ` Johannes Weiner
  2020-05-15  7:14               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2020-05-14 20:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jakub Kicinski, akpm, linux-mm, kernel-team, tj, chris, cgroups,
	shakeelb

On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > > Use swap.high when deciding if swap is full.    
> > > > > 
> > > > > Please be more specific why.  
> > > > 
> > > > How about:
> > > > 
> > > >     Use swap.high when deciding if swap is full to influence ongoing
> > > >     swap reclaim in a best effort manner.  
> > > 
> > > This is still way too vague. The crux is why should we treat hard and
> > > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > > note that I am not saying this is wrong. I am asking for a more
> > > detailed explanation mostly because I would bet that somebody
> > > stumbles over this sooner or later.
> > 
> > Stumbles in what way?
> 
> Reading the code and trying to understand why this particular decision
> has been made. Because it might be surprising that the hard and high
> limits are treated same here.

I don't quite understand the controversy.

The idea behind "swap full" is that as long as the workload has plenty
of swap space available and it's not changing its memory contents, it
makes sense to generously hold on to copies of data in the swap
device, even after the swapin. A later reclaim cycle can drop the page
without any IO. Trading disk space for IO.

But the only two ways to reclaim a swap slot is when they're faulted
in and the references go away, or by scanning the virtual address space
like swapoff does - which is very expensive (one could argue it's too
expensive even for swapoff, it's often more practical to just reboot).

So at some point in the fill level, we have to start freeing up swap
slots on fault/swapin. Otherwise we could eventually run out of swap
slots while they're filled with copies of data that is also in RAM.

We don't want to OOM a workload because its available swap space is
filled with redundant cache.

That applies to physical swap limits, swap.max, and naturally also to
swap.high which is a limit to implement userspace OOM for swap space
exhaustion.

> > Isn't it expected for the kernel to take reasonable precautions to
> > avoid hitting limits?
> 
> Isn't the throttling itself the precautious? How does the swap cache
> and its control via mem_cgroup_swap_full interact here. See? This is
> what I am asking to have explained in the changelog.

It sounds like we need better documentation of what vm_swap_full() and
friends are there for. It should have been obvious why swap.high - a
limit on available swap space - hooks into it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use
  2020-05-14 20:21             ` Johannes Weiner
@ 2020-05-15  7:14               ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2020-05-15  7:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jakub Kicinski, akpm, linux-mm, kernel-team, tj, chris, cgroups,
	shakeelb

On Thu 14-05-20 16:21:30, Johannes Weiner wrote:
> On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote:
> > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote:
> > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote:
> > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote:
> > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote:  
> > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote:  
> > > > > > > Use swap.high when deciding if swap is full.    
> > > > > > 
> > > > > > Please be more specific why.  
> > > > > 
> > > > > How about:
> > > > > 
> > > > >     Use swap.high when deciding if swap is full to influence ongoing
> > > > >     swap reclaim in a best effort manner.  
> > > > 
> > > > This is still way too vague. The crux is why should we treat hard and
> > > > high swap limit the same for mem_cgroup_swap_full purpose. Please
> > > > note that I am not saying this is wrong. I am asking for a more
> > > > detailed explanation mostly because I would bet that somebody
> > > > stumbles over this sooner or later.
> > > 
> > > Stumbles in what way?
> > 
> > Reading the code and trying to understand why this particular decision
> > has been made. Because it might be surprising that the hard and high
> > limits are treated same here.
> 
> I don't quite understand the controversy.

I do not think there is any controversy. All I am asking for is a
clarification because this is non-intuitive.
 
> The idea behind "swap full" is that as long as the workload has plenty
> of swap space available and it's not changing its memory contents, it
> makes sense to generously hold on to copies of data in the swap
> device, even after the swapin. A later reclaim cycle can drop the page
> without any IO. Trading disk space for IO.
> 
> But the only two ways to reclaim a swap slot is when they're faulted
> in and the references go away, or by scanning the virtual address space
> like swapoff does - which is very expensive (one could argue it's too
> expensive even for swapoff, it's often more practical to just reboot).
> 
> So at some point in the fill level, we have to start freeing up swap
> slots on fault/swapin. Otherwise we could eventually run out of swap
> slots while they're filled with copies of data that is also in RAM.
> 
> We don't want to OOM a workload because its available swap space is
> filled with redundant cache.

Thanks this is a useful summary.
 
> That applies to physical swap limits, swap.max, and naturally also to
> swap.high which is a limit to implement userspace OOM for swap space
> exhaustion.
> 
> > > Isn't it expected for the kernel to take reasonable precautions to
> > > avoid hitting limits?
> > 
> > Isn't the throttling itself the precautious? How does the swap cache
> > and its control via mem_cgroup_swap_full interact here. See? This is
> > what I am asking to have explained in the changelog.
> 
> It sounds like we need better documentation of what vm_swap_full() and
> friends are there for. It should have been obvious why swap.high - a
> limit on available swap space - hooks into it.

Agreed. The primary source for a confusion is the naming here. Because
vm_swap_full doesn't really try to tell that the swap is full. It merely
tries to tell that it is getting full and so duplicated data should be
dropped.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-15  7:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 22:55 [PATCH mm v2 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
2020-05-11 22:55 ` [PATCH mm v2 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
2020-05-12  7:08   ` Michal Hocko
2020-05-12 17:28     ` Jakub Kicinski
2020-05-13  8:06       ` Michal Hocko
2020-05-11 22:55 ` [PATCH mm v2 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
2020-05-11 22:55 ` [PATCH mm v2 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
2020-05-12  7:26   ` Michal Hocko
2020-05-12 17:55     ` Jakub Kicinski
2020-05-13  8:32       ` Michal Hocko
2020-05-13 18:36         ` Jakub Kicinski
2020-05-14  7:42           ` Michal Hocko
2020-05-14 20:21             ` Johannes Weiner
2020-05-15  7:14               ` Michal Hocko
2020-05-13  8:38   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).