All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] memcg: robust enforcement of memory.high
@ 2022-02-10  8:14 ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt

Due to the semantics of memory.high enforcement i.e. throttle the
workload without oom-kill, we are trying to use it for right sizing the
workloads in our production environment. However we observed the
mechanism fails for some specific applications which does bug chunck of
allocations in a single syscall. The reason behind this failure is due
to the limitation of the memory.high enforcement's current
implementation. This patch series solves this issue by enforcing the
memory.high synchronously and making it more robust.

Shakeel Butt (4):
  memcg: refactor mem_cgroup_oom
  memcg: unify force charging conditions
  selftests: memcg: test high limit for single entry allocation
  memcg: synchronously enforce memory.high

 include/linux/page_counter.h                  |  10 +
 mm/memcontrol.c                               | 175 ++++++++++--------
 mm/page_counter.c                             |  59 ++++--
 tools/testing/selftests/cgroup/cgroup_util.c  |  15 +-
 tools/testing/selftests/cgroup/cgroup_util.h  |   1 +
 .../selftests/cgroup/test_memcontrol.c        |  78 ++++++++
 6 files changed, 240 insertions(+), 98 deletions(-)

-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/4] memcg: robust enforcement of memory.high
@ 2022-02-10  8:14 ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

Due to the semantics of memory.high enforcement i.e. throttle the
workload without oom-kill, we are trying to use it for right sizing the
workloads in our production environment. However we observed the
mechanism fails for some specific applications which does bug chunck of
allocations in a single syscall. The reason behind this failure is due
to the limitation of the memory.high enforcement's current
implementation. This patch series solves this issue by enforcing the
memory.high synchronously and making it more robust.

Shakeel Butt (4):
  memcg: refactor mem_cgroup_oom
  memcg: unify force charging conditions
  selftests: memcg: test high limit for single entry allocation
  memcg: synchronously enforce memory.high

 include/linux/page_counter.h                  |  10 +
 mm/memcontrol.c                               | 175 ++++++++++--------
 mm/page_counter.c                             |  59 ++++--
 tools/testing/selftests/cgroup/cgroup_util.c  |  15 +-
 tools/testing/selftests/cgroup/cgroup_util.h  |   1 +
 .../selftests/cgroup/test_memcontrol.c        |  78 ++++++++
 6 files changed, 240 insertions(+), 98 deletions(-)

-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] memcg: refactor mem_cgroup_oom
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt

The function mem_cgroup_oom returns enum which has four possible values
but the caller does not care about such values and only care if the
return value is OOM_SUCCESS or not. So, remove the enum altogether and
make mem_cgroup_oom returns a simple bool.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 40 +++++++++++++---------------------------
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0e9d9f12cf5..c40c27822802 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1795,20 +1795,12 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-enum oom_status {
-	OOM_SUCCESS,
-	OOM_FAILED,
-	OOM_ASYNC,
-	OOM_SKIPPED
-};
-
-static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	enum oom_status ret;
-	bool locked;
+	bool locked, ret = false;
 
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
-		return OOM_SKIPPED;
+		return ret;
 
 	memcg_memory_event(memcg, MEMCG_OOM);
 
@@ -1831,14 +1823,13 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	 * victim and then we have to bail out from the charge path.
 	 */
 	if (memcg->oom_kill_disable) {
-		if (!current->in_user_fault)
-			return OOM_SKIPPED;
-		css_get(&memcg->css);
-		current->memcg_in_oom = memcg;
-		current->memcg_oom_gfp_mask = mask;
-		current->memcg_oom_order = order;
-
-		return OOM_ASYNC;
+		if (current->in_user_fault) {
+			css_get(&memcg->css);
+			current->memcg_in_oom = memcg;
+			current->memcg_oom_gfp_mask = mask;
+			current->memcg_oom_order = order;
+		}
+		return ret;
 	}
 
 	mem_cgroup_mark_under_oom(memcg);
@@ -1849,10 +1840,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
-		ret = OOM_SUCCESS;
-	else
-		ret = OOM_FAILED;
+	ret = mem_cgroup_out_of_memory(memcg, mask, order);
 
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
@@ -2545,7 +2533,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	int nr_retries = MAX_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
-	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
 	bool passed_oom = false;
 	bool may_swap = true;
@@ -2648,9 +2635,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * a forward progress or bypass the charge if the oom killer
 	 * couldn't make any progress.
 	 */
-	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
-		       get_order(nr_pages * PAGE_SIZE));
-	if (oom_status == OOM_SUCCESS) {
+	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+			   get_order(nr_pages * PAGE_SIZE))) {
 		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 1/4] memcg: refactor mem_cgroup_oom
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

The function mem_cgroup_oom returns enum which has four possible values
but the caller does not care about such values and only care if the
return value is OOM_SUCCESS or not. So, remove the enum altogether and
make mem_cgroup_oom returns a simple bool.

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 40 +++++++++++++---------------------------
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0e9d9f12cf5..c40c27822802 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1795,20 +1795,12 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-enum oom_status {
-	OOM_SUCCESS,
-	OOM_FAILED,
-	OOM_ASYNC,
-	OOM_SKIPPED
-};
-
-static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	enum oom_status ret;
-	bool locked;
+	bool locked, ret = false;
 
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
-		return OOM_SKIPPED;
+		return ret;
 
 	memcg_memory_event(memcg, MEMCG_OOM);
 
@@ -1831,14 +1823,13 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	 * victim and then we have to bail out from the charge path.
 	 */
 	if (memcg->oom_kill_disable) {
-		if (!current->in_user_fault)
-			return OOM_SKIPPED;
-		css_get(&memcg->css);
-		current->memcg_in_oom = memcg;
-		current->memcg_oom_gfp_mask = mask;
-		current->memcg_oom_order = order;
-
-		return OOM_ASYNC;
+		if (current->in_user_fault) {
+			css_get(&memcg->css);
+			current->memcg_in_oom = memcg;
+			current->memcg_oom_gfp_mask = mask;
+			current->memcg_oom_order = order;
+		}
+		return ret;
 	}
 
 	mem_cgroup_mark_under_oom(memcg);
@@ -1849,10 +1840,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
-		ret = OOM_SUCCESS;
-	else
-		ret = OOM_FAILED;
+	ret = mem_cgroup_out_of_memory(memcg, mask, order);
 
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
@@ -2545,7 +2533,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	int nr_retries = MAX_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
-	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
 	bool passed_oom = false;
 	bool may_swap = true;
@@ -2648,9 +2635,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * a forward progress or bypass the charge if the oom killer
 	 * couldn't make any progress.
 	 */
-	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
-		       get_order(nr_pages * PAGE_SIZE));
-	if (oom_status == OOM_SUCCESS) {
+	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+			   get_order(nr_pages * PAGE_SIZE))) {
 		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt

Currently the kernel force charges the allocations which have __GFP_HIGH
flag without triggering the memory reclaim. __GFP_HIGH indicates that
the caller is high priority and since commit 869712fd3de5 ("mm:
memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
kernel let such allocations do force charging. Please note that
__GFP_ATOMIC has been replaced by __GFP_HIGH.

__GFP_HIGH does not tell if the caller can block or can trigger reclaim.
There are separate checks to determine that. So, there is no need to
skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
with __GFP_NOFAIL which also does force charging.

Please note that this is a noop change as there are no __GFP_HIGH
allocators in kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and
does not allow reclaim for now. The reason for this patch is to simplify
the reasoning of the following patches.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c40c27822802..ae73a40818b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2560,15 +2560,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto retry;
 	}
 
-	/*
-	 * Memcg doesn't have a dedicated reserve for atomic
-	 * allocations. But like the global atomic pool, we need to
-	 * put the burden of reclaim on regular allocation requests
-	 * and let these go through as privileged allocations.
-	 */
-	if (gfp_mask & __GFP_HIGH)
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2642,7 +2633,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto retry;
 	}
 nomem:
-	if (!(gfp_mask & __GFP_NOFAIL))
+	/*
+	 * Memcg doesn't have a dedicated reserve for atomic
+	 * allocations. But like the global atomic pool, we need to
+	 * put the burden of reclaim on regular allocation requests
+	 * and let these go through as privileged allocations.
+	 */
+	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
 		return -ENOMEM;
 force:
 	/*
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

Currently the kernel force charges the allocations which have __GFP_HIGH
flag without triggering the memory reclaim. __GFP_HIGH indicates that
the caller is high priority and since commit 869712fd3de5 ("mm:
memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
kernel let such allocations do force charging. Please note that
__GFP_ATOMIC has been replaced by __GFP_HIGH.

__GFP_HIGH does not tell if the caller can block or can trigger reclaim.
There are separate checks to determine that. So, there is no need to
skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
with __GFP_NOFAIL which also does force charging.

Please note that this is a noop change as there are no __GFP_HIGH
allocators in kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and
does not allow reclaim for now. The reason for this patch is to simplify
the reasoning of the following patches.

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c40c27822802..ae73a40818b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2560,15 +2560,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto retry;
 	}
 
-	/*
-	 * Memcg doesn't have a dedicated reserve for atomic
-	 * allocations. But like the global atomic pool, we need to
-	 * put the burden of reclaim on regular allocation requests
-	 * and let these go through as privileged allocations.
-	 */
-	if (gfp_mask & __GFP_HIGH)
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2642,7 +2633,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto retry;
 	}
 nomem:
-	if (!(gfp_mask & __GFP_NOFAIL))
+	/*
+	 * Memcg doesn't have a dedicated reserve for atomic
+	 * allocations. But like the global atomic pool, we need to
+	 * put the burden of reclaim on regular allocation requests
+	 * and let these go through as privileged allocations.
+	 */
+	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
 		return -ENOMEM;
 force:
 	/*
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] selftests: memcg: test high limit for single entry allocation
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt

Test the enforcement of memory.high limit for large amount of memory
allocation within a single kernel entry. There are valid use-cases
where the application can trigger large amount of memory allocation
within a single syscall e.g. mlock() or mmap(MAP_POPULATE). Make sure
memory.high limit enforcement works for such use-cases.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 tools/testing/selftests/cgroup/cgroup_util.c  | 15 +++-
 tools/testing/selftests/cgroup/cgroup_util.h  |  1 +
 .../selftests/cgroup/test_memcontrol.c        | 78 +++++++++++++++++++
 3 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 0cf7e90c0052..dbaa7aabbb4a 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -583,7 +583,7 @@ int clone_into_cgroup_run_wait(const char *cgroup)
 	return 0;
 }
 
-int cg_prepare_for_wait(const char *cgroup)
+static int __prepare_for_wait(const char *cgroup, const char *filename)
 {
 	int fd, ret = -1;
 
@@ -591,8 +591,7 @@ int cg_prepare_for_wait(const char *cgroup)
 	if (fd == -1)
 		return fd;
 
-	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
-				IN_MODIFY);
+	ret = inotify_add_watch(fd, cg_control(cgroup, filename), IN_MODIFY);
 	if (ret == -1) {
 		close(fd);
 		fd = -1;
@@ -601,6 +600,16 @@ int cg_prepare_for_wait(const char *cgroup)
 	return fd;
 }
 
+int cg_prepare_for_wait(const char *cgroup)
+{
+	return __prepare_for_wait(cgroup, "cgroup.events");
+}
+
+int memcg_prepare_for_wait(const char *cgroup)
+{
+	return __prepare_for_wait(cgroup, "memory.events");
+}
+
 int cg_wait_for(int fd)
 {
 	int ret = -1;
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 4f66d10626d2..628738532ac9 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -55,4 +55,5 @@ extern int clone_reap(pid_t pid, int options);
 extern int clone_into_cgroup_run_wait(const char *cgroup);
 extern int dirfd_open_opath(const char *dir);
 extern int cg_prepare_for_wait(const char *cgroup);
+extern int memcg_prepare_for_wait(const char *cgroup);
 extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c19a97dd02d4..36ccf2322e21 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 #include <netdb.h>
 #include <errno.h>
+#include <sys/mman.h>
 
 #include "../kselftest.h"
 #include "cgroup_util.h"
@@ -628,6 +629,82 @@ static int test_memcg_high(const char *root)
 	return ret;
 }
 
+static int alloc_anon_mlock(const char *cgroup, void *arg)
+{
+	size_t size = (size_t)arg;
+	void *buf;
+
+	buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON,
+		   0, 0);
+	if (buf == MAP_FAILED)
+		return -1;
+
+	mlock(buf, size);
+	munmap(buf, size);
+	return 0;
+}
+
+/*
+ * This test checks that memory.high is able to throttle big single shot
+ * allocation i.e. large allocation within one kernel entry.
+ */
+static int test_memcg_high_sync(const char *root)
+{
+	int ret = KSFT_FAIL, pid, fd = -1;
+	char *memcg;
+	long pre_high, pre_max;
+	long post_high, post_max;
+
+	memcg = cg_name(root, "memcg_test");
+	if (!memcg)
+		goto cleanup;
+
+	if (cg_create(memcg))
+		goto cleanup;
+
+	pre_high = cg_read_key_long(memcg, "memory.events", "high ");
+	pre_max = cg_read_key_long(memcg, "memory.events", "max ");
+	if (pre_high < 0 || pre_max < 0)
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.swap.max", "0"))
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.high", "30M"))
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.max", "140M"))
+		goto cleanup;
+
+	fd = memcg_prepare_for_wait(memcg);
+	if (fd < 0)
+		goto cleanup;
+
+	pid = cg_run_nowait(memcg, alloc_anon_mlock, (void *)MB(200));
+	if (pid < 0)
+		goto cleanup;
+
+	cg_wait_for(fd);
+
+	post_high = cg_read_key_long(memcg, "memory.events", "high ");
+	post_max = cg_read_key_long(memcg, "memory.events", "max ");
+	if (post_high < 0 || post_max < 0)
+		goto cleanup;
+
+	if (pre_high == post_high || pre_max != post_max)
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (fd >= 0)
+		close(fd);
+	cg_destroy(memcg);
+	free(memcg);
+
+	return ret;
+}
+
 /*
  * This test checks that memory.max limits the amount of
  * memory which can be consumed by either anonymous memory
@@ -1180,6 +1257,7 @@ struct memcg_test {
 	T(test_memcg_min),
 	T(test_memcg_low),
 	T(test_memcg_high),
+	T(test_memcg_high_sync),
 	T(test_memcg_max),
 	T(test_memcg_oom_events),
 	T(test_memcg_swap_max),
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] selftests: memcg: test high limit for single entry allocation
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

Test the enforcement of memory.high limit for large amount of memory
allocation within a single kernel entry. There are valid use-cases
where the application can trigger large amount of memory allocation
within a single syscall e.g. mlock() or mmap(MAP_POPULATE). Make sure
memory.high limit enforcement works for such use-cases.

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/cgroup/cgroup_util.c  | 15 +++-
 tools/testing/selftests/cgroup/cgroup_util.h  |  1 +
 .../selftests/cgroup/test_memcontrol.c        | 78 +++++++++++++++++++
 3 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 0cf7e90c0052..dbaa7aabbb4a 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -583,7 +583,7 @@ int clone_into_cgroup_run_wait(const char *cgroup)
 	return 0;
 }
 
-int cg_prepare_for_wait(const char *cgroup)
+static int __prepare_for_wait(const char *cgroup, const char *filename)
 {
 	int fd, ret = -1;
 
@@ -591,8 +591,7 @@ int cg_prepare_for_wait(const char *cgroup)
 	if (fd == -1)
 		return fd;
 
-	ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
-				IN_MODIFY);
+	ret = inotify_add_watch(fd, cg_control(cgroup, filename), IN_MODIFY);
 	if (ret == -1) {
 		close(fd);
 		fd = -1;
@@ -601,6 +600,16 @@ int cg_prepare_for_wait(const char *cgroup)
 	return fd;
 }
 
+int cg_prepare_for_wait(const char *cgroup)
+{
+	return __prepare_for_wait(cgroup, "cgroup.events");
+}
+
+int memcg_prepare_for_wait(const char *cgroup)
+{
+	return __prepare_for_wait(cgroup, "memory.events");
+}
+
 int cg_wait_for(int fd)
 {
 	int ret = -1;
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 4f66d10626d2..628738532ac9 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -55,4 +55,5 @@ extern int clone_reap(pid_t pid, int options);
 extern int clone_into_cgroup_run_wait(const char *cgroup);
 extern int dirfd_open_opath(const char *dir);
 extern int cg_prepare_for_wait(const char *cgroup);
+extern int memcg_prepare_for_wait(const char *cgroup);
 extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c19a97dd02d4..36ccf2322e21 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -16,6 +16,7 @@
 #include <netinet/in.h>
 #include <netdb.h>
 #include <errno.h>
+#include <sys/mman.h>
 
 #include "../kselftest.h"
 #include "cgroup_util.h"
@@ -628,6 +629,82 @@ static int test_memcg_high(const char *root)
 	return ret;
 }
 
+static int alloc_anon_mlock(const char *cgroup, void *arg)
+{
+	size_t size = (size_t)arg;
+	void *buf;
+
+	buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON,
+		   0, 0);
+	if (buf == MAP_FAILED)
+		return -1;
+
+	mlock(buf, size);
+	munmap(buf, size);
+	return 0;
+}
+
+/*
+ * This test checks that memory.high is able to throttle big single shot
+ * allocation i.e. large allocation within one kernel entry.
+ */
+static int test_memcg_high_sync(const char *root)
+{
+	int ret = KSFT_FAIL, pid, fd = -1;
+	char *memcg;
+	long pre_high, pre_max;
+	long post_high, post_max;
+
+	memcg = cg_name(root, "memcg_test");
+	if (!memcg)
+		goto cleanup;
+
+	if (cg_create(memcg))
+		goto cleanup;
+
+	pre_high = cg_read_key_long(memcg, "memory.events", "high ");
+	pre_max = cg_read_key_long(memcg, "memory.events", "max ");
+	if (pre_high < 0 || pre_max < 0)
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.swap.max", "0"))
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.high", "30M"))
+		goto cleanup;
+
+	if (cg_write(memcg, "memory.max", "140M"))
+		goto cleanup;
+
+	fd = memcg_prepare_for_wait(memcg);
+	if (fd < 0)
+		goto cleanup;
+
+	pid = cg_run_nowait(memcg, alloc_anon_mlock, (void *)MB(200));
+	if (pid < 0)
+		goto cleanup;
+
+	cg_wait_for(fd);
+
+	post_high = cg_read_key_long(memcg, "memory.events", "high ");
+	post_max = cg_read_key_long(memcg, "memory.events", "max ");
+	if (post_high < 0 || post_max < 0)
+		goto cleanup;
+
+	if (pre_high == post_high || pre_max != post_max)
+		goto cleanup;
+
+	ret = KSFT_PASS;
+
+cleanup:
+	if (fd >= 0)
+		close(fd);
+	cg_destroy(memcg);
+	free(memcg);
+
+	return ret;
+}
+
 /*
  * This test checks that memory.max limits the amount of
  * memory which can be consumed by either anonymous memory
@@ -1180,6 +1257,7 @@ struct memcg_test {
 	T(test_memcg_min),
 	T(test_memcg_low),
 	T(test_memcg_high),
+	T(test_memcg_high_sync),
 	T(test_memcg_max),
 	T(test_memcg_oom_events),
 	T(test_memcg_swap_max),
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt

The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However due
to the limitation of the implementation of high limit enforcement, we
observed the mechanism fails for some real workloads.

The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.

To make high limit enforcement more robust, this patch make the limit
enforcement synchronous. However there are couple of open questions to
enforce high limit synchronously. What should be the behavior of
__GFP_NORETRY allocaion on hitting high limit? Similar question arise
for allocations which do not allow blocking. This patch took the
approach to keep the previous behavior i.e. let such allocations not
throttle synchronously but rely on the return-to-userspace mechanism to
throttle processes triggering such allocations.

This patch does not remove the return-to-userspace high limit
enforcement due to the reason mentioned in the previous para. Also the
allocations where the memory usage is below high limit but the swap
usage is above swap's high limit, such allocators are throttled in the
return-to-userspace.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/page_counter.h |  10 +++
 mm/memcontrol.c              | 124 ++++++++++++++++++++++-------------
 mm/page_counter.c            |  59 +++++++++++++----
 3 files changed, 132 insertions(+), 61 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..08413a5c73f9 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -60,6 +60,16 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 bool page_counter_try_charge(struct page_counter *counter,
 			     unsigned long nr_pages,
 			     struct page_counter **fail);
+
+enum charge_status {
+	CHG_SUCCESS,
+	CHG_FAILED_HIGH,
+	CHG_FAILED_MAX
+};
+enum charge_status page_counter_try_charge_high(struct page_counter *counter,
+						unsigned long nr_pages,
+						struct page_counter **fail);
+
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae73a40818b0..97833cade59e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1290,18 +1290,20 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
+ * @high: check high limit instead of max
  *
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg, bool high)
 {
 	unsigned long margin = 0;
 	unsigned long count;
 	unsigned long limit;
 
 	count = page_counter_read(&memcg->memory);
-	limit = READ_ONCE(memcg->memory.max);
+	limit = high ? READ_ONCE(memcg->memory.high) :
+			READ_ONCE(memcg->memory.max);
 	if (count < limit)
 		margin = limit - count;
 
@@ -1607,7 +1609,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mutex_lock_killable(&oom_lock))
 		return true;
 
-	if (mem_cgroup_margin(memcg) >= (1 << order))
+	if (mem_cgroup_margin(memcg, false) >= (1 << order))
 		goto unlock;
 
 	/*
@@ -2443,6 +2445,39 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+static unsigned long calculate_penalty_jiffies(struct mem_cgroup *memcg,
+					       unsigned long nr_pages)
+{
+	unsigned long penalty_jiffies;
+
+	/*
+	 * memory.high is breached and reclaim is unable to keep up. Throttle
+	 * allocators proactively to slow down excessive growth.
+	 */
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
+
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+						swap_find_max_overage(memcg));
+
+	/*
+	 * Clamp the max delay per usermode return so as to still keep the
+	 * application moving forwards and also permit diagnostics, albeit
+	 * extremely slowly.
+	 */
+	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+	/*
+	 * Don't sleep if the amount of jiffies this memcg owes us is so low
+	 * that it's not even worth doing, in an attempt to be nice to those who
+	 * go only a small amount over their memory.high value and maybe haven't
+	 * been aggressively reclaimed enough yet.
+	 */
+	if (penalty_jiffies <= HZ / 100)
+		return 0;
+	return penalty_jiffies;
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -2477,30 +2512,8 @@ void mem_cgroup_handle_over_high(void)
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    GFP_KERNEL);
 
-	/*
-	 * memory.high is breached and reclaim is unable to keep up. Throttle
-	 * allocators proactively to slow down excessive growth.
-	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
-					       mem_find_max_overage(memcg));
-
-	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
-						swap_find_max_overage(memcg));
-
-	/*
-	 * Clamp the max delay per usermode return so as to still keep the
-	 * application moving forwards and also permit diagnostics, albeit
-	 * extremely slowly.
-	 */
-	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
-
-	/*
-	 * Don't sleep if the amount of jiffies this memcg owes us is so low
-	 * that it's not even worth doing, in an attempt to be nice to those who
-	 * go only a small amount over their memory.high value and maybe haven't
-	 * been aggressively reclaimed enough yet.
-	 */
-	if (penalty_jiffies <= HZ / 100)
+	penalty_jiffies = calculate_penalty_jiffies(memcg, nr_pages);
+	if (!penalty_jiffies)
 		goto out;
 
 	/*
@@ -2534,6 +2547,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
 	unsigned long nr_reclaimed;
+	enum charge_status status;
 	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
@@ -2545,7 +2559,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		status = page_counter_try_charge_high(&memcg->memory, batch,
+						      &counter);
+		if (status == CHG_SUCCESS)
 			goto done_restock;
 		if (do_memsw_account())
 			page_counter_uncharge(&memcg->memsw, batch);
@@ -2553,6 +2569,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	} else {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		may_swap = false;
+		status = CHG_FAILED_MAX;
 	}
 
 	if (batch > nr_pages) {
@@ -2575,14 +2592,15 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
-	memcg_memory_event(mem_over_limit, MEMCG_MAX);
+	memcg_memory_event(mem_over_limit,
+			   status == CHG_FAILED_MAX ? MEMCG_MAX : MEMCG_HIGH);
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 	psi_memstall_leave(&pflags);
 
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
+	if (mem_cgroup_margin(mem_over_limit, status == CHG_FAILED_HIGH) >= nr_pages)
 		goto retry;
 
 	if (!drained) {
@@ -2614,23 +2632,34 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (nr_retries--)
 		goto retry;
 
-	if (gfp_mask & __GFP_RETRY_MAYFAIL)
-		goto nomem;
+	if (status == CHG_FAILED_MAX) {
+		if (gfp_mask & __GFP_RETRY_MAYFAIL)
+			goto nomem;
 
-	/* Avoid endless loop for tasks bypassed by the oom killer */
-	if (passed_oom && task_is_dying())
-		goto nomem;
+		/* Avoid endless loop for tasks bypassed by the oom killer */
+		if (passed_oom && task_is_dying())
+			goto nomem;
 
-	/*
-	 * keep retrying as long as the memcg oom killer is able to make
-	 * a forward progress or bypass the charge if the oom killer
-	 * couldn't make any progress.
-	 */
-	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
-			   get_order(nr_pages * PAGE_SIZE))) {
-		passed_oom = true;
-		nr_retries = MAX_RECLAIM_RETRIES;
-		goto retry;
+		/*
+		 * keep retrying as long as the memcg oom killer is able to make
+		 * a forward progress or bypass the charge if the oom killer
+		 * couldn't make any progress.
+		 */
+		if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+				   get_order(nr_pages * PAGE_SIZE))) {
+			passed_oom = true;
+			nr_retries = MAX_RECLAIM_RETRIES;
+			goto retry;
+		}
+	} else {
+		unsigned long penalty_jiffies = calculate_penalty_jiffies(memcg,
+								nr_pages);
+
+		if (penalty_jiffies) {
+			psi_memstall_enter(&pflags);
+			schedule_timeout_killable(penalty_jiffies);
+			psi_memstall_leave(&pflags);
+		}
 	}
 nomem:
 	/*
@@ -2639,7 +2668,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * put the burden of reclaim on regular allocation requests
 	 * and let these go through as privileged allocations.
 	 */
-	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
+	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)) && status != CHG_FAILED_HIGH)
 		return -ENOMEM;
 force:
 	/*
@@ -2651,7 +2680,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 
-	return 0;
+	if (status != CHG_FAILED_HIGH)
+		return 0;
 
 done_restock:
 	if (batch > nr_pages)
diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..35c9360af334 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -86,19 +86,11 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 	}
 }
 
-/**
- * page_counter_try_charge - try to hierarchically charge pages
- * @counter: counter
- * @nr_pages: number of pages to charge
- * @fail: points first counter to hit its limit, if any
- *
- * Returns %true on success, or %false and @fail if the counter or one
- * of its ancestors has hit its configured limit.
- */
-bool page_counter_try_charge(struct page_counter *counter,
-			     unsigned long nr_pages,
-			     struct page_counter **fail)
+static enum charge_status __page_counter_try_charge(
+			struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail, bool check_high)
 {
+	enum charge_status status = CHG_SUCCESS;
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
@@ -127,6 +119,12 @@ bool page_counter_try_charge(struct page_counter *counter,
 			 */
 			data_race(c->failcnt++);
 			*fail = c;
+			status = CHG_FAILED_MAX;
+			goto failed;
+		} else if (check_high && new > c->high) {
+			atomic_long_sub(nr_pages, &c->usage);
+			*fail = c;
+			status = CHG_FAILED_HIGH;
 			goto failed;
 		}
 		propagate_protected_usage(c, new);
@@ -137,13 +135,46 @@ bool page_counter_try_charge(struct page_counter *counter,
 		if (new > READ_ONCE(c->watermark))
 			WRITE_ONCE(c->watermark, new);
 	}
-	return true;
+	return status;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
 		page_counter_cancel(c, nr_pages);
 
-	return false;
+	return status;
+}
+
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns %true on success, or %false and @fail if the counter or one
+ * of its ancestors has hit its configured limit.
+ */
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
+{
+	return __page_counter_try_charge(counter, nr_pages, fail, false) ==
+		CHG_SUCCESS;
+}
+
+/**
+ * page_counter_try_charge_high - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns CHG_SUCESS on success, and if the counter or one of its ancestors
+ * has hit its configured max or high limit, return corresponding failure.
+ */
+enum charge_status page_counter_try_charge_high(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
+{
+	return __page_counter_try_charge(counter, nr_pages, fail, true);
 }
 
 /**
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10  8:14   ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10  8:14 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin
  Cc: Chris Down, Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However due
to the limitation of the implementation of high limit enforcement, we
observed the mechanism fails for some real workloads.

The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.

To make high limit enforcement more robust, this patch make the limit
enforcement synchronous. However there are couple of open questions to
enforce high limit synchronously. What should be the behavior of
__GFP_NORETRY allocaion on hitting high limit? Similar question arise
for allocations which do not allow blocking. This patch took the
approach to keep the previous behavior i.e. let such allocations not
throttle synchronously but rely on the return-to-userspace mechanism to
throttle processes triggering such allocations.

This patch does not remove the return-to-userspace high limit
enforcement due to the reason mentioned in the previous para. Also the
allocations where the memory usage is below high limit but the swap
usage is above swap's high limit, such allocators are throttled in the
return-to-userspace.

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/page_counter.h |  10 +++
 mm/memcontrol.c              | 124 ++++++++++++++++++++++-------------
 mm/page_counter.c            |  59 +++++++++++++----
 3 files changed, 132 insertions(+), 61 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..08413a5c73f9 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -60,6 +60,16 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 bool page_counter_try_charge(struct page_counter *counter,
 			     unsigned long nr_pages,
 			     struct page_counter **fail);
+
+enum charge_status {
+	CHG_SUCCESS,
+	CHG_FAILED_HIGH,
+	CHG_FAILED_MAX
+};
+enum charge_status page_counter_try_charge_high(struct page_counter *counter,
+						unsigned long nr_pages,
+						struct page_counter **fail);
+
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae73a40818b0..97833cade59e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1290,18 +1290,20 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
+ * @high: check high limit instead of max
  *
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg, bool high)
 {
 	unsigned long margin = 0;
 	unsigned long count;
 	unsigned long limit;
 
 	count = page_counter_read(&memcg->memory);
-	limit = READ_ONCE(memcg->memory.max);
+	limit = high ? READ_ONCE(memcg->memory.high) :
+			READ_ONCE(memcg->memory.max);
 	if (count < limit)
 		margin = limit - count;
 
@@ -1607,7 +1609,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mutex_lock_killable(&oom_lock))
 		return true;
 
-	if (mem_cgroup_margin(memcg) >= (1 << order))
+	if (mem_cgroup_margin(memcg, false) >= (1 << order))
 		goto unlock;
 
 	/*
@@ -2443,6 +2445,39 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
 	return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
 }
 
+static unsigned long calculate_penalty_jiffies(struct mem_cgroup *memcg,
+					       unsigned long nr_pages)
+{
+	unsigned long penalty_jiffies;
+
+	/*
+	 * memory.high is breached and reclaim is unable to keep up. Throttle
+	 * allocators proactively to slow down excessive growth.
+	 */
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
+
+	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+						swap_find_max_overage(memcg));
+
+	/*
+	 * Clamp the max delay per usermode return so as to still keep the
+	 * application moving forwards and also permit diagnostics, albeit
+	 * extremely slowly.
+	 */
+	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+	/*
+	 * Don't sleep if the amount of jiffies this memcg owes us is so low
+	 * that it's not even worth doing, in an attempt to be nice to those who
+	 * go only a small amount over their memory.high value and maybe haven't
+	 * been aggressively reclaimed enough yet.
+	 */
+	if (penalty_jiffies <= HZ / 100)
+		return 0;
+	return penalty_jiffies;
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -2477,30 +2512,8 @@ void mem_cgroup_handle_over_high(void)
 				    in_retry ? SWAP_CLUSTER_MAX : nr_pages,
 				    GFP_KERNEL);
 
-	/*
-	 * memory.high is breached and reclaim is unable to keep up. Throttle
-	 * allocators proactively to slow down excessive growth.
-	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
-					       mem_find_max_overage(memcg));
-
-	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
-						swap_find_max_overage(memcg));
-
-	/*
-	 * Clamp the max delay per usermode return so as to still keep the
-	 * application moving forwards and also permit diagnostics, albeit
-	 * extremely slowly.
-	 */
-	penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
-
-	/*
-	 * Don't sleep if the amount of jiffies this memcg owes us is so low
-	 * that it's not even worth doing, in an attempt to be nice to those who
-	 * go only a small amount over their memory.high value and maybe haven't
-	 * been aggressively reclaimed enough yet.
-	 */
-	if (penalty_jiffies <= HZ / 100)
+	penalty_jiffies = calculate_penalty_jiffies(memcg, nr_pages);
+	if (!penalty_jiffies)
 		goto out;
 
 	/*
@@ -2534,6 +2547,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
 	unsigned long nr_reclaimed;
+	enum charge_status status;
 	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
@@ -2545,7 +2559,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		status = page_counter_try_charge_high(&memcg->memory, batch,
+						      &counter);
+		if (status == CHG_SUCCESS)
 			goto done_restock;
 		if (do_memsw_account())
 			page_counter_uncharge(&memcg->memsw, batch);
@@ -2553,6 +2569,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	} else {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		may_swap = false;
+		status = CHG_FAILED_MAX;
 	}
 
 	if (batch > nr_pages) {
@@ -2575,14 +2592,15 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
-	memcg_memory_event(mem_over_limit, MEMCG_MAX);
+	memcg_memory_event(mem_over_limit,
+			   status == CHG_FAILED_MAX ? MEMCG_MAX : MEMCG_HIGH);
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 	psi_memstall_leave(&pflags);
 
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
+	if (mem_cgroup_margin(mem_over_limit, status == CHG_FAILED_HIGH) >= nr_pages)
 		goto retry;
 
 	if (!drained) {
@@ -2614,23 +2632,34 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (nr_retries--)
 		goto retry;
 
-	if (gfp_mask & __GFP_RETRY_MAYFAIL)
-		goto nomem;
+	if (status == CHG_FAILED_MAX) {
+		if (gfp_mask & __GFP_RETRY_MAYFAIL)
+			goto nomem;
 
-	/* Avoid endless loop for tasks bypassed by the oom killer */
-	if (passed_oom && task_is_dying())
-		goto nomem;
+		/* Avoid endless loop for tasks bypassed by the oom killer */
+		if (passed_oom && task_is_dying())
+			goto nomem;
 
-	/*
-	 * keep retrying as long as the memcg oom killer is able to make
-	 * a forward progress or bypass the charge if the oom killer
-	 * couldn't make any progress.
-	 */
-	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
-			   get_order(nr_pages * PAGE_SIZE))) {
-		passed_oom = true;
-		nr_retries = MAX_RECLAIM_RETRIES;
-		goto retry;
+		/*
+		 * keep retrying as long as the memcg oom killer is able to make
+		 * a forward progress or bypass the charge if the oom killer
+		 * couldn't make any progress.
+		 */
+		if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+				   get_order(nr_pages * PAGE_SIZE))) {
+			passed_oom = true;
+			nr_retries = MAX_RECLAIM_RETRIES;
+			goto retry;
+		}
+	} else {
+		unsigned long penalty_jiffies = calculate_penalty_jiffies(memcg,
+								nr_pages);
+
+		if (penalty_jiffies) {
+			psi_memstall_enter(&pflags);
+			schedule_timeout_killable(penalty_jiffies);
+			psi_memstall_leave(&pflags);
+		}
 	}
 nomem:
 	/*
@@ -2639,7 +2668,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * put the burden of reclaim on regular allocation requests
 	 * and let these go through as privileged allocations.
 	 */
-	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
+	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)) && status != CHG_FAILED_HIGH)
 		return -ENOMEM;
 force:
 	/*
@@ -2651,7 +2680,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 
-	return 0;
+	if (status != CHG_FAILED_HIGH)
+		return 0;
 
 done_restock:
 	if (batch > nr_pages)
diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..35c9360af334 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -86,19 +86,11 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 	}
 }
 
-/**
- * page_counter_try_charge - try to hierarchically charge pages
- * @counter: counter
- * @nr_pages: number of pages to charge
- * @fail: points first counter to hit its limit, if any
- *
- * Returns %true on success, or %false and @fail if the counter or one
- * of its ancestors has hit its configured limit.
- */
-bool page_counter_try_charge(struct page_counter *counter,
-			     unsigned long nr_pages,
-			     struct page_counter **fail)
+static enum charge_status __page_counter_try_charge(
+			struct page_counter *counter, unsigned long nr_pages,
+			struct page_counter **fail, bool check_high)
 {
+	enum charge_status status = CHG_SUCCESS;
 	struct page_counter *c;
 
 	for (c = counter; c; c = c->parent) {
@@ -127,6 +119,12 @@ bool page_counter_try_charge(struct page_counter *counter,
 			 */
 			data_race(c->failcnt++);
 			*fail = c;
+			status = CHG_FAILED_MAX;
+			goto failed;
+		} else if (check_high && new > c->high) {
+			atomic_long_sub(nr_pages, &c->usage);
+			*fail = c;
+			status = CHG_FAILED_HIGH;
 			goto failed;
 		}
 		propagate_protected_usage(c, new);
@@ -137,13 +135,46 @@ bool page_counter_try_charge(struct page_counter *counter,
 		if (new > READ_ONCE(c->watermark))
 			WRITE_ONCE(c->watermark, new);
 	}
-	return true;
+	return status;
 
 failed:
 	for (c = counter; c != *fail; c = c->parent)
 		page_counter_cancel(c, nr_pages);
 
-	return false;
+	return status;
+}
+
+/**
+ * page_counter_try_charge - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns %true on success, or %false and @fail if the counter or one
+ * of its ancestors has hit its configured limit.
+ */
+bool page_counter_try_charge(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
+{
+	return __page_counter_try_charge(counter, nr_pages, fail, false) ==
+		CHG_SUCCESS;
+}
+
+/**
+ * page_counter_try_charge_high - try to hierarchically charge pages
+ * @counter: counter
+ * @nr_pages: number of pages to charge
+ * @fail: points first counter to hit its limit, if any
+ *
+ * Returns CHG_SUCESS on success, and if the counter or one of its ancestors
+ * has hit its configured max or high limit, return corresponding failure.
+ */
+enum charge_status page_counter_try_charge_high(struct page_counter *counter,
+			     unsigned long nr_pages,
+			     struct page_counter **fail)
+{
+	return __page_counter_try_charge(counter, nr_pages, fail, true);
 }
 
 /**
-- 
2.35.1.265.g69c8d7142f-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] memcg: refactor mem_cgroup_oom
@ 2022-02-10 19:52     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 19:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups, linux-mm, linux-kernel

On Thu, Feb 10, 2022 at 12:14:34AM -0800, Shakeel Butt wrote:
> The function mem_cgroup_oom returns enum which has four possible values
> but the caller does not care about such values and only care if the
> return value is OOM_SUCCESS or not. So, remove the enum altogether and
> make mem_cgroup_oom returns a simple bool.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

Nice!

Reviewed-by: Roman Gushchin <guro@fb.com>

> ---
>  mm/memcontrol.c | 40 +++++++++++++---------------------------
>  1 file changed, 13 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a0e9d9f12cf5..c40c27822802 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1795,20 +1795,12 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
>  }
>  
> -enum oom_status {
> -	OOM_SUCCESS,
> -	OOM_FAILED,
> -	OOM_ASYNC,
> -	OOM_SKIPPED
> -};
> -

The only thing, I'd add a small comment on the return value here. E.g.
"returns true if one or more tasks have been successfully killed" or something
like this.

Thanks!

> -static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> +static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	enum oom_status ret;
> -	bool locked;
> +	bool locked, ret = false;
>  
>  	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		return OOM_SKIPPED;
> +		return ret;
>  
>  	memcg_memory_event(memcg, MEMCG_OOM);
>  
> @@ -1831,14 +1823,13 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  	 * victim and then we have to bail out from the charge path.
>  	 */
>  	if (memcg->oom_kill_disable) {
> -		if (!current->in_user_fault)
> -			return OOM_SKIPPED;
> -		css_get(&memcg->css);
> -		current->memcg_in_oom = memcg;
> -		current->memcg_oom_gfp_mask = mask;
> -		current->memcg_oom_order = order;
> -
> -		return OOM_ASYNC;
> +		if (current->in_user_fault) {
> +			css_get(&memcg->css);
> +			current->memcg_in_oom = memcg;
> +			current->memcg_oom_gfp_mask = mask;
> +			current->memcg_oom_order = order;
> +		}
> +		return ret;
>  	}
>  
>  	mem_cgroup_mark_under_oom(memcg);
> @@ -1849,10 +1840,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> -		ret = OOM_SUCCESS;
> -	else
> -		ret = OOM_FAILED;
> +	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> @@ -2545,7 +2533,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> -	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
>  	bool passed_oom = false;
>  	bool may_swap = true;
> @@ -2648,9 +2635,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * a forward progress or bypass the charge if the oom killer
>  	 * couldn't make any progress.
>  	 */
> -	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> -	if (oom_status == OOM_SUCCESS) {
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +			   get_order(nr_pages * PAGE_SIZE))) {
>  		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -- 
> 2.35.1.265.g69c8d7142f-goog
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] memcg: refactor mem_cgroup_oom
@ 2022-02-10 19:52     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 19:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 10, 2022 at 12:14:34AM -0800, Shakeel Butt wrote:
> The function mem_cgroup_oom returns enum which has four possible values
> but the caller does not care about such values and only care if the
> return value is OOM_SUCCESS or not. So, remove the enum altogether and
> make mem_cgroup_oom returns a simple bool.
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Nice!

Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

> ---
>  mm/memcontrol.c | 40 +++++++++++++---------------------------
>  1 file changed, 13 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a0e9d9f12cf5..c40c27822802 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1795,20 +1795,12 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
>  }
>  
> -enum oom_status {
> -	OOM_SUCCESS,
> -	OOM_FAILED,
> -	OOM_ASYNC,
> -	OOM_SKIPPED
> -};
> -

The only thing, I'd add a small comment on the return value here. E.g.
"returns true if one or more tasks have been successfully killed" or something
like this.

Thanks!

> -static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> +static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	enum oom_status ret;
> -	bool locked;
> +	bool locked, ret = false;
>  
>  	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		return OOM_SKIPPED;
> +		return ret;
>  
>  	memcg_memory_event(memcg, MEMCG_OOM);
>  
> @@ -1831,14 +1823,13 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  	 * victim and then we have to bail out from the charge path.
>  	 */
>  	if (memcg->oom_kill_disable) {
> -		if (!current->in_user_fault)
> -			return OOM_SKIPPED;
> -		css_get(&memcg->css);
> -		current->memcg_in_oom = memcg;
> -		current->memcg_oom_gfp_mask = mask;
> -		current->memcg_oom_order = order;
> -
> -		return OOM_ASYNC;
> +		if (current->in_user_fault) {
> +			css_get(&memcg->css);
> +			current->memcg_in_oom = memcg;
> +			current->memcg_oom_gfp_mask = mask;
> +			current->memcg_oom_order = order;
> +		}
> +		return ret;
>  	}
>  
>  	mem_cgroup_mark_under_oom(memcg);
> @@ -1849,10 +1840,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> -		ret = OOM_SUCCESS;
> -	else
> -		ret = OOM_FAILED;
> +	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>  
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
> @@ -2545,7 +2533,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> -	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
>  	bool passed_oom = false;
>  	bool may_swap = true;
> @@ -2648,9 +2635,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * a forward progress or bypass the charge if the oom killer
>  	 * couldn't make any progress.
>  	 */
> -	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> -	if (oom_status == OOM_SUCCESS) {
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +			   get_order(nr_pages * PAGE_SIZE))) {
>  		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -- 
> 2.35.1.265.g69c8d7142f-goog
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10 20:03     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 20:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups, linux-mm, linux-kernel

On Thu, Feb 10, 2022 at 12:14:35AM -0800, Shakeel Butt wrote:
> Currently the kernel force charges the allocations which have __GFP_HIGH
> flag without triggering the memory reclaim. __GFP_HIGH indicates that
> the caller is high priority and since commit 869712fd3de5 ("mm:
> memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
> kernel let such allocations do force charging. Please note that
> __GFP_ATOMIC has been replaced by __GFP_HIGH.
> 
> __GFP_HIGH does not tell if the caller can block or can trigger reclaim.
> There are separate checks to determine that. So, there is no need to
> skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
> with __GFP_NOFAIL which also does force charging.

This sounds very reasonable. But shouldn't we check if __GFP_DIRECT_RECLAIM
is set and bail out otherwise?

Thanks!

> 
> Please note that this is a noop change as there are no __GFP_HIGH
> allocators in kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and
> does not allow reclaim for now. The reason for this patch is to simplify
> the reasoning of the following patches.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c40c27822802..ae73a40818b0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2560,15 +2560,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  
> -	/*
> -	 * Memcg doesn't have a dedicated reserve for atomic
> -	 * allocations. But like the global atomic pool, we need to
> -	 * put the burden of reclaim on regular allocation requests
> -	 * and let these go through as privileged allocations.
> -	 */
> -	if (gfp_mask & __GFP_HIGH)
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2642,7 +2633,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  nomem:
> -	if (!(gfp_mask & __GFP_NOFAIL))
> +	/*
> +	 * Memcg doesn't have a dedicated reserve for atomic
> +	 * allocations. But like the global atomic pool, we need to
> +	 * put the burden of reclaim on regular allocation requests
> +	 * and let these go through as privileged allocations.
> +	 */
> +	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
>  		return -ENOMEM;
>  force:
>  	/*
> -- 
> 2.35.1.265.g69c8d7142f-goog
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10 20:03     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 20:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 10, 2022 at 12:14:35AM -0800, Shakeel Butt wrote:
> Currently the kernel force charges the allocations which have __GFP_HIGH
> flag without triggering the memory reclaim. __GFP_HIGH indicates that
> the caller is high priority and since commit 869712fd3de5 ("mm:
> memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
> kernel let such allocations do force charging. Please note that
> __GFP_ATOMIC has been replaced by __GFP_HIGH.
> 
> __GFP_HIGH does not tell if the caller can block or can trigger reclaim.
> There are separate checks to determine that. So, there is no need to
> skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
> with __GFP_NOFAIL which also does force charging.

This sounds very reasonable. But shouldn't we check if __GFP_DIRECT_RECLAIM
is set and bail out otherwise?

Thanks!

> 
> Please note that this is a noop change as there are no __GFP_HIGH
> allocators in kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT) and
> does not allow reclaim for now. The reason for this patch is to simplify
> the reasoning of the following patches.
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  mm/memcontrol.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c40c27822802..ae73a40818b0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2560,15 +2560,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  
> -	/*
> -	 * Memcg doesn't have a dedicated reserve for atomic
> -	 * allocations. But like the global atomic pool, we need to
> -	 * put the burden of reclaim on regular allocation requests
> -	 * and let these go through as privileged allocations.
> -	 */
> -	if (gfp_mask & __GFP_HIGH)
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2642,7 +2633,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  nomem:
> -	if (!(gfp_mask & __GFP_NOFAIL))
> +	/*
> +	 * Memcg doesn't have a dedicated reserve for atomic
> +	 * allocations. But like the global atomic pool, we need to
> +	 * put the burden of reclaim on regular allocation requests
> +	 * and let these go through as privileged allocations.
> +	 */
> +	if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
>  		return -ENOMEM;
>  force:
>  	/*
> -- 
> 2.35.1.265.g69c8d7142f-goog
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 20:15     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 20:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups, linux-mm, linux-kernel

On Thu, Feb 10, 2022 at 12:14:37AM -0800, Shakeel Butt wrote:
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
> 
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
> 
> To make high limit enforcement more robust, this patch make the limit
> enforcement synchronous. However there are couple of open questions to
> enforce high limit synchronously. What should be the behavior of
> __GFP_NORETRY allocaion on hitting high limit? Similar question arise
> for allocations which do not allow blocking. This patch took the
> approach to keep the previous behavior i.e. let such allocations not
> throttle synchronously but rely on the return-to-userspace mechanism to
> throttle processes triggering such allocations.
> 
> This patch does not remove the return-to-userspace high limit
> enforcement due to the reason mentioned in the previous para. Also the
> allocations where the memory usage is below high limit but the swap
> usage is above swap's high limit, such allocators are throttled in the
> return-to-userspace.

Has this approach been extensively tested in the production?

Injecting sleeps at return-to-userspace moment is safe in terms of priority
inversions: a slowed down task will unlikely affect the rest of the system.

It way less predictable for a random allocation in the kernel mode, what if
the task is already holding a system-wide resource?

Someone might argue that it's not better than a system-wide memory shortage
and the same allocation might go into a direct reclaim anyway, but with
the way how memory.high is used it will happen way more often.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 20:15     ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 20:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 10, 2022 at 12:14:37AM -0800, Shakeel Butt wrote:
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
> 
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
> 
> To make high limit enforcement more robust, this patch make the limit
> enforcement synchronous. However there are couple of open questions to
> enforce high limit synchronously. What should be the behavior of
> __GFP_NORETRY allocaion on hitting high limit? Similar question arise
> for allocations which do not allow blocking. This patch took the
> approach to keep the previous behavior i.e. let such allocations not
> throttle synchronously but rely on the return-to-userspace mechanism to
> throttle processes triggering such allocations.
> 
> This patch does not remove the return-to-userspace high limit
> enforcement due to the reason mentioned in the previous para. Also the
> allocations where the memory usage is below high limit but the swap
> usage is above swap's high limit, such allocators are throttled in the
> return-to-userspace.

Has this approach been extensively tested in the production?

Injecting sleeps at return-to-userspace moment is safe in terms of priority
inversions: a slowed down task will unlikely affect the rest of the system.

It way less predictable for a random allocation in the kernel mode, what if
the task is already holding a system-wide resource?

Someone might argue that it's not better than a system-wide memory shortage
and the same allocation might go into a direct reclaim anyway, but with
the way how memory.high is used it will happen way more often.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 22:22       ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 22:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
>
[...]
>
> Has this approach been extensively tested in the production?
>
> Injecting sleeps at return-to-userspace moment is safe in terms of priority
> inversions: a slowed down task will unlikely affect the rest of the system.
>
> It way less predictable for a random allocation in the kernel mode, what if
> the task is already holding a system-wide resource?
>
> Someone might argue that it's not better than a system-wide memory shortage
> and the same allocation might go into a direct reclaim anyway, but with
> the way how memory.high is used it will happen way more often.
>

Thanks for the review.

This patchset is tested in the test environment for now and I do plan
to test this in production but that is a slow process and will take
some time.

Let me answer the main concern you have raised i.e. the safety of
throttling a task synchronously in the charge code path. Please note
that synchronous memory reclaim and oom-killing can already cause the
priority inversion issues you have mentioned. The way we usually
tackle such issues are through userspace controllers. For example oomd
is the userspace solution for catering such issues related to
oom-killing. Here we have a similar userspace daemon monitoring the
workload and deciding if it should let the workload grow or kill it.

Now should we keep the current high limit enforcement implementation
and let it be ineffective for some real workloads or should we make
the enforcement more robust and let the userspace tackle some corner
case priority inversion issues. I think we should follow the second
option as we already have precedence of doing the same for reclaim and
oom-killing.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 22:22       ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 22:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
>
[...]
>
> Has this approach been extensively tested in the production?
>
> Injecting sleeps at return-to-userspace moment is safe in terms of priority
> inversions: a slowed down task will unlikely affect the rest of the system.
>
> It way less predictable for a random allocation in the kernel mode, what if
> the task is already holding a system-wide resource?
>
> Someone might argue that it's not better than a system-wide memory shortage
> and the same allocation might go into a direct reclaim anyway, but with
> the way how memory.high is used it will happen way more often.
>

Thanks for the review.

This patchset is tested in the test environment for now and I do plan
to test this in production but that is a slow process and will take
some time.

Let me answer the main concern you have raised i.e. the safety of
throttling a task synchronously in the charge code path. Please note
that synchronous memory reclaim and oom-killing can already cause the
priority inversion issues you have mentioned. The way we usually
tackle such issues are through userspace controllers. For example oomd
is the userspace solution for catering such issues related to
oom-killing. Here we have a similar userspace daemon monitoring the
workload and deciding if it should let the workload grow or kill it.

Now should we keep the current high limit enforcement implementation
and let it be ineffective for some real workloads or should we make
the enforcement more robust and let the userspace tackle some corner
case priority inversion issues. I think we should follow the second
option as we already have precedence of doing the same for reclaim and
oom-killing.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] memcg: refactor mem_cgroup_oom
  2022-02-10 19:52     ` Roman Gushchin
  (?)
@ 2022-02-10 22:23     ` Shakeel Butt
  -1 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 22:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 11:53 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Feb 10, 2022 at 12:14:34AM -0800, Shakeel Butt wrote:
> > The function mem_cgroup_oom returns enum which has four possible values
> > but the caller does not care about such values and only care if the
> > return value is OOM_SUCCESS or not. So, remove the enum altogether and
> > make mem_cgroup_oom returns a simple bool.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
>
> Nice!
>
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks.
>
[...]
>
> The only thing, I'd add a small comment on the return value here. E.g.
> "returns true if one or more tasks have been successfully killed" or something
> like this.
>

Will do in the next version.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10 22:25       ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 22:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 12:03 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Feb 10, 2022 at 12:14:35AM -0800, Shakeel Butt wrote:
> > Currently the kernel force charges the allocations which have __GFP_HIGH
> > flag without triggering the memory reclaim. __GFP_HIGH indicates that
> > the caller is high priority and since commit 869712fd3de5 ("mm:
> > memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
> > kernel let such allocations do force charging. Please note that
> > __GFP_ATOMIC has been replaced by __GFP_HIGH.
> >
> > __GFP_HIGH does not tell if the caller can block or can trigger reclaim.
> > There are separate checks to determine that. So, there is no need to
> > skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
> > with __GFP_NOFAIL which also does force charging.
>
> This sounds very reasonable. But shouldn't we check if __GFP_DIRECT_RECLAIM
> is set and bail out otherwise?
>

We already have a gfpflags_allow_blocking() check which checks for
__GFP_DIRECT_RECLAIM.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] memcg: unify force charging conditions
@ 2022-02-10 22:25       ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 22:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 12:03 PM Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
>
> On Thu, Feb 10, 2022 at 12:14:35AM -0800, Shakeel Butt wrote:
> > Currently the kernel force charges the allocations which have __GFP_HIGH
> > flag without triggering the memory reclaim. __GFP_HIGH indicates that
> > the caller is high priority and since commit 869712fd3de5 ("mm:
> > memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
> > kernel let such allocations do force charging. Please note that
> > __GFP_ATOMIC has been replaced by __GFP_HIGH.
> >
> > __GFP_HIGH does not tell if the caller can block or can trigger reclaim.
> > There are separate checks to determine that. So, there is no need to
> > skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
> > with __GFP_NOFAIL which also does force charging.
>
> This sounds very reasonable. But shouldn't we check if __GFP_DIRECT_RECLAIM
> is set and bail out otherwise?
>

We already have a gfpflags_allow_blocking() check which checks for
__GFP_DIRECT_RECLAIM.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] memcg: unify force charging conditions
  2022-02-10 22:25       ` Shakeel Butt
  (?)
@ 2022-02-10 23:15       ` Roman Gushchin
  -1 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 23:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 02:25:01PM -0800, Shakeel Butt wrote:
> On Thu, Feb 10, 2022 at 12:03 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Feb 10, 2022 at 12:14:35AM -0800, Shakeel Butt wrote:
> > > Currently the kernel force charges the allocations which have __GFP_HIGH
> > > flag without triggering the memory reclaim. __GFP_HIGH indicates that
> > > the caller is high priority and since commit 869712fd3de5 ("mm:
> > > memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
> > > kernel let such allocations do force charging. Please note that
> > > __GFP_ATOMIC has been replaced by __GFP_HIGH.
> > >
> > > __GFP_HIGH does not tell if the caller can block or can trigger reclaim.
> > > There are separate checks to determine that. So, there is no need to
> > > skip reclaim for __GFP_HIGH allocations. So, handle __GFP_HIGH together
> > > with __GFP_NOFAIL which also does force charging.
> >
> > This sounds very reasonable. But shouldn't we check if __GFP_DIRECT_RECLAIM
> > is set and bail out otherwise?
> >
> 
> We already have a gfpflags_allow_blocking() check which checks for
> __GFP_DIRECT_RECLAIM.

Indeed. Please, feel free to add
Reviewed-by: Roman Gushchin <guro@fb.com> to the patch.

Thank you!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 23:29         ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 23:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 02:22:36PM -0800, Shakeel Butt wrote:
> On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
> >
> [...]
> >
> > Has this approach been extensively tested in the production?
> >
> > Injecting sleeps at return-to-userspace moment is safe in terms of priority
> > inversions: a slowed down task will unlikely affect the rest of the system.
> >
> > It way less predictable for a random allocation in the kernel mode, what if
> > the task is already holding a system-wide resource?
> >
> > Someone might argue that it's not better than a system-wide memory shortage
> > and the same allocation might go into a direct reclaim anyway, but with
> > the way how memory.high is used it will happen way more often.
> >
> 
> Thanks for the review.
> 
> This patchset is tested in the test environment for now and I do plan
> to test this in production but that is a slow process and will take
> some time.
> 
> Let me answer the main concern you have raised i.e. the safety of
> throttling a task synchronously in the charge code path. Please note
> that synchronous memory reclaim and oom-killing can already cause the
> priority inversion issues you have mentioned. The way we usually
> tackle such issues are through userspace controllers. For example oomd
> is the userspace solution for catering such issues related to
> oom-killing. Here we have a similar userspace daemon monitoring the
> workload and deciding if it should let the workload grow or kill it.
> 
> Now should we keep the current high limit enforcement implementation
> and let it be ineffective for some real workloads or should we make
> the enforcement more robust and let the userspace tackle some corner
> case priority inversion issues. I think we should follow the second
> option as we already have precedence of doing the same for reclaim and
> oom-killing.

Well, in a theory it sounds good and I have no intention to oppose the
idea. However in practice we might easily get quite serious problems.
So I think we should be extra careful here. In the end we don't want to
pull and then revert this patch.

The difference between the system-wide direct reclaim and this case is that
usually kswapd is doing a good job of refilling the empty buffer, so we don't
usually work in the circumstances of the global memory shortage. And when we do,
often it's not working out quite well, this is why oomd and other similar
solutions are required.

Another option is to use your approach only for special cases (e.g. huge
allocations) and keep the existing approach for most other allocations.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
@ 2022-02-10 23:29         ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-10 23:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 02:22:36PM -0800, Shakeel Butt wrote:
> On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org> wrote:
> >
> [...]
> >
> > Has this approach been extensively tested in the production?
> >
> > Injecting sleeps at return-to-userspace moment is safe in terms of priority
> > inversions: a slowed down task will unlikely affect the rest of the system.
> >
> > It way less predictable for a random allocation in the kernel mode, what if
> > the task is already holding a system-wide resource?
> >
> > Someone might argue that it's not better than a system-wide memory shortage
> > and the same allocation might go into a direct reclaim anyway, but with
> > the way how memory.high is used it will happen way more often.
> >
> 
> Thanks for the review.
> 
> This patchset is tested in the test environment for now and I do plan
> to test this in production but that is a slow process and will take
> some time.
> 
> Let me answer the main concern you have raised i.e. the safety of
> throttling a task synchronously in the charge code path. Please note
> that synchronous memory reclaim and oom-killing can already cause the
> priority inversion issues you have mentioned. The way we usually
> tackle such issues are through userspace controllers. For example oomd
> is the userspace solution for catering such issues related to
> oom-killing. Here we have a similar userspace daemon monitoring the
> workload and deciding if it should let the workload grow or kill it.
> 
> Now should we keep the current high limit enforcement implementation
> and let it be ineffective for some real workloads or should we make
> the enforcement more robust and let the userspace tackle some corner
> case priority inversion issues. I think we should follow the second
> option as we already have precedence of doing the same for reclaim and
> oom-killing.

Well, in a theory it sounds good and I have no intention to oppose the
idea. However in practice we might easily get quite serious problems.
So I think we should be extra careful here. In the end we don't want to
pull and then revert this patch.

The difference between the system-wide direct reclaim and this case is that
usually kswapd is doing a good job of refilling the empty buffer, so we don't
usually work in the circumstances of the global memory shortage. And when we do,
often it's not working out quite well, this is why oomd and other similar
solutions are required.

Another option is to use your approach only for special cases (e.g. huge
allocations) and keep the existing approach for most other allocations.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
  2022-02-10 23:29         ` Roman Gushchin
  (?)
@ 2022-02-10 23:53         ` Shakeel Butt
  2022-02-11  2:44           ` Roman Gushchin
  -1 siblings, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2022-02-10 23:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 3:29 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Feb 10, 2022 at 02:22:36PM -0800, Shakeel Butt wrote:
> > On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > [...]
> > >
> > > Has this approach been extensively tested in the production?
> > >
> > > Injecting sleeps at return-to-userspace moment is safe in terms of priority
> > > inversions: a slowed down task will unlikely affect the rest of the system.
> > >
> > > It way less predictable for a random allocation in the kernel mode, what if
> > > the task is already holding a system-wide resource?
> > >
> > > Someone might argue that it's not better than a system-wide memory shortage
> > > and the same allocation might go into a direct reclaim anyway, but with
> > > the way how memory.high is used it will happen way more often.
> > >
> >
> > Thanks for the review.
> >
> > This patchset is tested in the test environment for now and I do plan
> > to test this in production but that is a slow process and will take
> > some time.
> >
> > Let me answer the main concern you have raised i.e. the safety of
> > throttling a task synchronously in the charge code path. Please note
> > that synchronous memory reclaim and oom-killing can already cause the
> > priority inversion issues you have mentioned. The way we usually
> > tackle such issues are through userspace controllers. For example oomd
> > is the userspace solution for catering such issues related to
> > oom-killing. Here we have a similar userspace daemon monitoring the
> > workload and deciding if it should let the workload grow or kill it.
> >
> > Now should we keep the current high limit enforcement implementation
> > and let it be ineffective for some real workloads or should we make
> > the enforcement more robust and let the userspace tackle some corner
> > case priority inversion issues. I think we should follow the second
> > option as we already have precedence of doing the same for reclaim and
> > oom-killing.
>
> Well, in a theory it sounds good and I have no intention to oppose the
> idea. However in practice we might easily get quite serious problems.
> So I think we should be extra careful here. In the end we don't want to
> pull and then revert this patch.
>
> The difference between the system-wide direct reclaim and this case is that
> usually kswapd is doing a good job of refilling the empty buffer, so we don't
> usually work in the circumstances of the global memory shortage. And when we do,
> often it's not working out quite well, this is why oomd and other similar
> solutions are required.
>.
> Another option is to use your approach only for special cases (e.g. huge
> allocations) and keep the existing approach for most other allocations.
>

These are not necessarily huge allocations and can be a large number
of small allocations. However I think we can make this idea work by
checking current->memcg_nr_pages_over_high. If
order(current->memcg_nr_pages_over_high) is, let's say, larger than
PAGE_ALLOC_COSTLY_ORDER, then throttle synchronously. WDYT?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] memcg: synchronously enforce memory.high
  2022-02-10 23:53         ` Shakeel Butt
@ 2022-02-11  2:44           ` Roman Gushchin
  0 siblings, 0 replies; 26+ messages in thread
From: Roman Gushchin @ 2022-02-11  2:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Chris Down, Andrew Morton,
	Cgroups, Linux MM, LKML

On Thu, Feb 10, 2022 at 03:53:00PM -0800, Shakeel Butt wrote:
> On Thu, Feb 10, 2022 at 3:29 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Feb 10, 2022 at 02:22:36PM -0800, Shakeel Butt wrote:
> > > On Thu, Feb 10, 2022 at 12:15 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > [...]
> > > >
> > > > Has this approach been extensively tested in the production?
> > > >
> > > > Injecting sleeps at return-to-userspace moment is safe in terms of priority
> > > > inversions: a slowed down task will unlikely affect the rest of the system.
> > > >
> > > > It way less predictable for a random allocation in the kernel mode, what if
> > > > the task is already holding a system-wide resource?
> > > >
> > > > Someone might argue that it's not better than a system-wide memory shortage
> > > > and the same allocation might go into a direct reclaim anyway, but with
> > > > the way how memory.high is used it will happen way more often.
> > > >
> > >
> > > Thanks for the review.
> > >
> > > This patchset is tested in the test environment for now and I do plan
> > > to test this in production but that is a slow process and will take
> > > some time.
> > >
> > > Let me answer the main concern you have raised i.e. the safety of
> > > throttling a task synchronously in the charge code path. Please note
> > > that synchronous memory reclaim and oom-killing can already cause the
> > > priority inversion issues you have mentioned. The way we usually
> > > tackle such issues are through userspace controllers. For example oomd
> > > is the userspace solution for catering such issues related to
> > > oom-killing. Here we have a similar userspace daemon monitoring the
> > > workload and deciding if it should let the workload grow or kill it.
> > >
> > > Now should we keep the current high limit enforcement implementation
> > > and let it be ineffective for some real workloads or should we make
> > > the enforcement more robust and let the userspace tackle some corner
> > > case priority inversion issues. I think we should follow the second
> > > option as we already have precedence of doing the same for reclaim and
> > > oom-killing.
> >
> > Well, in a theory it sounds good and I have no intention to oppose the
> > idea. However in practice we might easily get quite serious problems.
> > So I think we should be extra careful here. In the end we don't want to
> > pull and then revert this patch.
> >
> > The difference between the system-wide direct reclaim and this case is that
> > usually kswapd is doing a good job of refilling the empty buffer, so we don't
> > usually work in the circumstances of the global memory shortage. And when we do,
> > often it's not working out quite well, this is why oomd and other similar
> > solutions are required.
> >.
> > Another option is to use your approach only for special cases (e.g. huge
> > allocations) and keep the existing approach for most other allocations.
> >
> 
> These are not necessarily huge allocations and can be a large number
> of small allocations. However I think we can make this idea work by
> checking current->memcg_nr_pages_over_high. If
> order(current->memcg_nr_pages_over_high) is, let's say, larger than
> PAGE_ALLOC_COSTLY_ORDER, then throttle synchronously. WDYT?

Yes, I really like this idea: the majority of allocations will be handled in
a proven way, however we'll have coverage for large outliers as well.

Not sure about PAGE_ALLOC_COSTLY_ORDER though, I'd probably chose a larger
constant, but we can discuss it separately.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-02-11  2:44 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-10  8:14 [PATCH 0/4] memcg: robust enforcement of memory.high Shakeel Butt
2022-02-10  8:14 ` Shakeel Butt
2022-02-10  8:14 ` [PATCH 1/4] memcg: refactor mem_cgroup_oom Shakeel Butt
2022-02-10  8:14   ` Shakeel Butt
2022-02-10 19:52   ` Roman Gushchin
2022-02-10 19:52     ` Roman Gushchin
2022-02-10 22:23     ` Shakeel Butt
2022-02-10  8:14 ` [PATCH 2/4] memcg: unify force charging conditions Shakeel Butt
2022-02-10  8:14   ` Shakeel Butt
2022-02-10 20:03   ` Roman Gushchin
2022-02-10 20:03     ` Roman Gushchin
2022-02-10 22:25     ` Shakeel Butt
2022-02-10 22:25       ` Shakeel Butt
2022-02-10 23:15       ` Roman Gushchin
2022-02-10  8:14 ` [PATCH 3/4] selftests: memcg: test high limit for single entry allocation Shakeel Butt
2022-02-10  8:14   ` Shakeel Butt
2022-02-10  8:14 ` [PATCH 4/4] memcg: synchronously enforce memory.high Shakeel Butt
2022-02-10  8:14   ` Shakeel Butt
2022-02-10 20:15   ` Roman Gushchin
2022-02-10 20:15     ` Roman Gushchin
2022-02-10 22:22     ` Shakeel Butt
2022-02-10 22:22       ` Shakeel Butt
2022-02-10 23:29       ` Roman Gushchin
2022-02-10 23:29         ` Roman Gushchin
2022-02-10 23:53         ` Shakeel Butt
2022-02-11  2:44           ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.