linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] memcg oom: bail out from the charge path if no victim found
@ 2020-04-18 15:13 Yafang Shao
  2020-04-18 15:13 ` [PATCH 1/3] mm: change the return type of out_of_memory() Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yafang Shao @ 2020-04-18 15:13 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

Without considering the manually triggered OOM, if no victim found in
system OOM, the system will be deadlocked on memory, however if no
victim found in memcg OOM, it can charge successfully and runs well.
This behavior in memcg oom is not proper because that can prevent the
memcg from being limited.

Take an easy example.
        $ cd /sys/fs/cgroup/foo/
        $ echo $$ > cgroup.procs
        $ echo 200M > memory.max
        $ cat memory.max
        209715200
        $ echo -1000 > /proc/$$/oom_score_adj
Then, let's run a memhog task in memcg foo, which will allocate 1G
memory and keeps running.
        $ /home/yafang/test/memhog &
Then memory.current will be greater than memory.max. Run bellow command
in another shell.
        $ cat /sys/fs/cgroup/foo/memory.current
        1097228288
The tasks which have already allocated memory and won't allocate new
memory still runs well. This behavior makes nonsense.

This patch is to improve it.
If no victim found in memcg oom, we should force the current task to
wait until there's available pages. That is similar with the behavior in
memcg1 when oom_kill_disable is set.

Patch #1 and #2 are the preparation of patch #3.

Yafang Shao (3):
  mm: change the return type of out_of_memory()
  mm, memcg: introduce a new helper task_in_memcg_oom_set()
  memcg oom: bail out from the charge path if no victim found

 include/linux/memcontrol.h | 30 ++++++++++++++++++
 include/linux/oom.h        |  9 +++++-
 include/linux/sched.h      |  1 +
 mm/memcontrol.c            | 63 ++++++++++++++++++++------------------
 mm/oom_kill.c              | 36 +++++++++++++++++-----
 mm/page_alloc.c            |  3 +-
 6 files changed, 104 insertions(+), 38 deletions(-)

-- 
2.18.2



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] mm: change the return type of out_of_memory()
  2020-04-18 15:13 [PATCH 0/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
@ 2020-04-18 15:13 ` Yafang Shao
  2020-04-18 15:13 ` [PATCH 2/3] mm, memcg: introduce a new helper task_in_memcg_oom_set() Yafang Shao
  2020-04-18 15:13 ` [PATCH 3/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
  2 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2020-04-18 15:13 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

Change the return type of out_of_memory() from bool to enum oom_status
for later use. The definition of enum oom_status is moved from
mm/memcontrol.c to include/linux/oom.h.
No functional change.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  1 +
 include/linux/oom.h        |  9 ++++++++-
 mm/memcontrol.c            | 30 +++++++++++++-----------------
 mm/oom_kill.c              | 14 +++++++-------
 mm/page_alloc.c            |  3 ++-
 5 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b4150ff64be..98bd8fb2f5c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
 #include <linux/vmstat.h>
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
+#include <linux/oom.h>
 
 struct mem_cgroup;
 struct page;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c265f019..3dca5ce189e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -22,6 +22,13 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum oom_status {
+	OOM_SUCCESS,
+	OOM_FAILED,
+	OOM_ASYNC,
+	OOM_SKIPPED
+};
+
 /*
  * Details of the page allocation that triggered the oom killer that are used to
  * determine what should be killed.
@@ -110,7 +117,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm);
 extern unsigned long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
-extern bool out_of_memory(struct oom_control *oc);
+enum oom_status out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..22418b55804f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1570,8 +1570,9 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
 	return page_counter_read(&memcg->memory);
 }
 
-static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				     int order)
+static enum oom_status mem_cgroup_out_of_memory(struct mem_cgroup *memcg,
+						gfp_t gfp_mask,
+						int order)
 {
 	struct oom_control oc = {
 		.zonelist = NULL,
@@ -1580,16 +1581,20 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
+	enum oom_status ret;
 
 	if (mutex_lock_killable(&oom_lock))
-		return true;
+		return OOM_SUCCESS;
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	if (should_force_charge())
+		return OOM_SUCCESS;
+
+	ret = out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+
 	return ret;
 }
 
@@ -1767,13 +1772,6 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-enum oom_status {
-	OOM_SUCCESS,
-	OOM_FAILED,
-	OOM_ASYNC,
-	OOM_SKIPPED
-};
-
 static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
 	enum oom_status ret;
@@ -1821,10 +1819,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
-		ret = OOM_SUCCESS;
-	else
-		ret = OOM_FAILED;
+	ret = mem_cgroup_out_of_memory(memcg, mask, order);
 
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
@@ -6102,7 +6097,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 		}
 
 		memcg_memory_event(memcg, MEMCG_OOM);
-		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
+		if (mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0) !=
+		    OOM_SUCCESS)
 			break;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfc357614e56..d5a941bea2d7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1042,18 +1042,18 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-bool out_of_memory(struct oom_control *oc)
+enum oom_status out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 
 	if (oom_killer_disabled)
-		return false;
+		return OOM_FAILED;
 
 	if (!is_memcg_oom(oc)) {
 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 		if (freed > 0)
 			/* Got some memory back in the last second. */
-			return true;
+			return OOM_SUCCESS;
 	}
 
 	/*
@@ -1064,7 +1064,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
 		wake_oom_reaper(current);
-		return true;
+		return OOM_SUCCESS;
 	}
 
 	/*
@@ -1075,7 +1075,7 @@ bool out_of_memory(struct oom_control *oc)
 	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
 	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
-		return true;
+		return OOM_SUCCESS;
 
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
@@ -1093,7 +1093,7 @@ bool out_of_memory(struct oom_control *oc)
 		get_task_struct(current);
 		oc->chosen = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
-		return true;
+		return OOM_SUCCESS;
 	}
 
 	select_bad_process(oc);
@@ -1112,7 +1112,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-	return !!oc->chosen;
+	return oc->chosen ? OOM_SUCCESS : OOM_FAILED;
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..0926117eb921 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3917,7 +3917,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/* Exhausted what can be done so it's blame time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc) == OOM_SUCCESS ||
+	    WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
 
 		/*
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] mm, memcg: introduce a new helper task_in_memcg_oom_set()
  2020-04-18 15:13 [PATCH 0/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
  2020-04-18 15:13 ` [PATCH 1/3] mm: change the return type of out_of_memory() Yafang Shao
@ 2020-04-18 15:13 ` Yafang Shao
  2020-04-18 15:13 ` [PATCH 3/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
  2 siblings, 0 replies; 11+ messages in thread
From: Yafang Shao @ 2020-04-18 15:13 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

A new helper task_in_memcg_oom_set() is introduced for later use. It
will be used to bail out from the charge path if no victim can be found
in memcg oom.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 25 +++++++++++++++++++++++++
 mm/memcontrol.c            | 12 ++----------
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 98bd8fb2f5c7..767bac135787 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -565,6 +565,23 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 	return p->memcg_in_oom;
 }
 
+/* Bail out from the charge path if no victim found in memcg oom */
+static inline enum oom_status task_in_memcg_oom_set(struct task_struct *p,
+						    struct mem_cgroup *memcg,
+						    gfp_t mask,
+						    int order)
+{
+	if (!current->in_user_fault)
+		return OOM_SKIPPED;
+
+	css_get(&memcg->css);
+	p->memcg_in_oom = memcg;
+	p->memcg_oom_gfp_mask = mask;
+	p->memcg_oom_order = order;
+
+	return OOM_ASYNC;
+}
+
 bool mem_cgroup_oom_synchronize(bool wait);
 struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
 					    struct mem_cgroup *oom_domain);
@@ -1031,6 +1048,14 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 	return false;
 }
 
+static inline enum oom_status task_in_memcg_oom_set(struct task_struct *p,
+						    struct mem_cgroup *memcg,
+						    gfp_t mask,
+						    int order)
+{
+	return OOM_SUCCESS;
+}
+
 static inline bool mem_cgroup_oom_synchronize(bool wait)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 22418b55804f..d6cb1b786045 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1800,16 +1800,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	 * Please note that mem_cgroup_out_of_memory might fail to find a
 	 * victim and then we have to bail out from the charge path.
 	 */
-	if (memcg->oom_kill_disable) {
-		if (!current->in_user_fault)
-			return OOM_SKIPPED;
-		css_get(&memcg->css);
-		current->memcg_in_oom = memcg;
-		current->memcg_oom_gfp_mask = mask;
-		current->memcg_oom_order = order;
-
-		return OOM_ASYNC;
-	}
+	if (memcg->oom_kill_disable)
+		return task_in_memcg_oom_set(current, memcg, mask, order);
 
 	mem_cgroup_mark_under_oom(memcg);
 
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-18 15:13 [PATCH 0/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
  2020-04-18 15:13 ` [PATCH 1/3] mm: change the return type of out_of_memory() Yafang Shao
  2020-04-18 15:13 ` [PATCH 2/3] mm, memcg: introduce a new helper task_in_memcg_oom_set() Yafang Shao
@ 2020-04-18 15:13 ` Yafang Shao
  2020-04-20  8:13   ` Michal Hocko
  2 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2020-04-18 15:13 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm; +Cc: linux-mm, Yafang Shao

Without considering the manually triggered OOM, if no victim found in
system OOM, the system will be deadlocked on memory, however if no
victim found in memcg OOM, it can charge successfully and runs well.
This behavior in memcg oom is not proper because that can prevent the
memcg from being limited.

Take an easy example.
        $ cd /sys/fs/cgroup/foo/
	$ echo $$ > cgroup.procs
	$ echo 200M > memory.max
	$ cat memory.max
	209715200
	$ echo -1000 > /proc/$$/oom_score_adj
Then, let's run a memhog task in memcg foo, which will allocate 1G
memory and keeps running.
	$ /home/yafang/test/memhog &
Then memory.current will be greater than memory.max. Run bellow command
in another shell.
	$ cat /sys/fs/cgroup/foo/memory.current
	1097228288
The tasks which have already allocated memory and won't allocate new
memory still runs well. This behavior makes nonsense.

This patch is to improve it.
If no victim found in memcg oom, we should force the current task to
wait until there's available pages. That is similar with the behavior in
memcg1 when oom_kill_disable is set.

In memcg2, the memcg oom can also be triggered manually - by reducing
memory.max. We should distinguish this manually triggered memcg oom with
other reguler memcg oom, so a magic key "oom_control.order == -2" is set
in this situation. The tasks waiting in memcg oom will be waked up when we
enlarge the memory.max. As these tasks is killable, we can also kill
them directly.

In memcg1, it cooperates well with memory.oom_control(oom_kill_disable).
The tasks waiting in memcg1 oom can be waked up by enlarging
memory.limit_in_bytes or disabling oom_kill_disable. As oom_kill_disable
can be set manually, we should distinguish it with other reguler memcg
oom as well. A member named memcg_oom_wait is introduced in struct
task_struct to handle it.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  8 ++++++--
 include/linux/sched.h      |  1 +
 mm/memcontrol.c            | 25 +++++++++++++++++++++----
 mm/oom_kill.c              | 22 ++++++++++++++++++++++
 4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 767bac135787..5df2c08e2720 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -569,7 +569,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 static inline enum oom_status task_in_memcg_oom_set(struct task_struct *p,
 						    struct mem_cgroup *memcg,
 						    gfp_t mask,
-						    int order)
+						    int order,
+						    bool force)
 {
 	if (!current->in_user_fault)
 		return OOM_SKIPPED;
@@ -578,6 +579,8 @@ static inline enum oom_status task_in_memcg_oom_set(struct task_struct *p,
 	p->memcg_in_oom = memcg;
 	p->memcg_oom_gfp_mask = mask;
 	p->memcg_oom_order = order;
+	if (force)
+		p->memcg_oom_wait = true;
 
 	return OOM_ASYNC;
 }
@@ -1051,7 +1054,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 static inline enum oom_status task_in_memcg_oom_set(struct task_struct *p,
 						    struct mem_cgroup *memcg,
 						    gfp_t mask,
-						    int order)
+						    int order,
+						    bool force)
 {
 	return OOM_SUCCESS;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..cc1c7de7c248 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -774,6 +774,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
+	unsigned			memcg_oom_wait:1;
 #endif
 #ifdef CONFIG_COMPAT_BRK
 	unsigned			brk_randomized:1;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6cb1b786045..a637e13a4964 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1801,7 +1801,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	 * victim and then we have to bail out from the charge path.
 	 */
 	if (memcg->oom_kill_disable)
-		return task_in_memcg_oom_set(current, memcg, mask, order);
+		return task_in_memcg_oom_set(current, memcg, mask,
+					     order, false);
 
 	mem_cgroup_mark_under_oom(memcg);
 
@@ -1863,7 +1864,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
 
-	if (locked && !memcg->oom_kill_disable) {
+	if (locked && !memcg->oom_kill_disable &&
+	    !current->memcg_oom_wait) {
 		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
@@ -1883,7 +1885,10 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		 */
 		memcg_oom_recover(memcg);
 	}
+
 cleanup:
+	if (current->memcg_oom_wait)
+		current->memcg_oom_wait = false;
 	current->memcg_in_oom = NULL;
 	css_put(&memcg->css);
 	return true;
@@ -6056,6 +6061,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
 	bool drained = false;
+	bool enlarge = false;
 	unsigned long max;
 	int err;
 
@@ -6069,7 +6075,12 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
 
-		if (nr_pages <= max)
+		if (nr_pages < max) {
+			enlarge = true;
+			break;
+		}
+
+		if (nr_pages == max)
 			break;
 
 		if (signal_pending(current))
@@ -6081,6 +6092,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 			continue;
 		}
 
+		if (memcg->under_oom)
+			break;
+
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
 							  GFP_KERNEL, true))
@@ -6089,11 +6103,14 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 		}
 
 		memcg_memory_event(memcg, MEMCG_OOM);
-		if (mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0) !=
+		if (mem_cgroup_out_of_memory(memcg, GFP_KERNEL, -2) !=
 		    OOM_SUCCESS)
 			break;
 	}
 
+	if (enlarge)
+		memcg_oom_recover(memcg);
+
 	memcg_wb_domain_size_changed(memcg);
 	return nbytes;
 }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d5a941bea2d7..df564495c8b2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -158,6 +158,15 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
 	return oc->order == -1;
 }
 
+/*
+ * order == -2 means the oom kill is triggered by reducing memcg max,
+ * otherwise only for display puerposes.
+ */
+static bool is_max_write_oom(struct oom_control *oc)
+{
+	return oc->order == -2;
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p)
 {
@@ -1108,6 +1117,19 @@ enum oom_status out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+
+		/* Bail out from the charge path if we can't find a victim. */
+		if (is_memcg_oom(oc)) {
+			if (is_max_write_oom(oc))
+				return OOM_SKIPPED;
+
+			return task_in_memcg_oom_set(current,
+						     oc->memcg,
+						     oc->gfp_mask,
+						     oc->order,
+						     true);
+		}
+
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-18 15:13 ` [PATCH 3/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
@ 2020-04-20  8:13   ` Michal Hocko
  2020-04-20  8:52     ` Yafang Shao
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2020-04-20  8:13 UTC (permalink / raw)
  To: Yafang Shao; +Cc: hannes, vdavydov.dev, akpm, linux-mm

On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> Without considering the manually triggered OOM, if no victim found in
> system OOM, the system will be deadlocked on memory, however if no
> victim found in memcg OOM, it can charge successfully and runs well.
> This behavior in memcg oom is not proper because that can prevent the
> memcg from being limited.
> 
> Take an easy example.
>         $ cd /sys/fs/cgroup/foo/
> 	$ echo $$ > cgroup.procs
> 	$ echo 200M > memory.max
> 	$ cat memory.max
> 	209715200
> 	$ echo -1000 > /proc/$$/oom_score_adj
> Then, let's run a memhog task in memcg foo, which will allocate 1G
> memory and keeps running.
> 	$ /home/yafang/test/memhog &

Well, echo -1000 is a privileged operation. And it has to be used with
an extreme care because you know that you are creating an unkillable
task. So the above test is a clear example of the misconfiguration.

> Then memory.current will be greater than memory.max. Run bellow command
> in another shell.
> 	$ cat /sys/fs/cgroup/foo/memory.current
> 	1097228288
> The tasks which have already allocated memory and won't allocate new
> memory still runs well. This behavior makes nonsense.
> 
> This patch is to improve it.
> If no victim found in memcg oom, we should force the current task to
> wait until there's available pages. That is similar with the behavior in
> memcg1 when oom_kill_disable is set.

The primary reason why we force the charge is because we _cannot_ wait
indefinitely in the charge path because the current call chain might
hold locks or other resources which could block a large part of the
system. You are essentially reintroducing that behavior.

Is the above example a real usecase or you have just tried a test case
that would trigger the problem?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20  8:13   ` Michal Hocko
@ 2020-04-20  8:52     ` Yafang Shao
  2020-04-20  9:14       ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2020-04-20  8:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> > Without considering the manually triggered OOM, if no victim found in
> > system OOM, the system will be deadlocked on memory, however if no
> > victim found in memcg OOM, it can charge successfully and runs well.
> > This behavior in memcg oom is not proper because that can prevent the
> > memcg from being limited.
> >
> > Take an easy example.
> >         $ cd /sys/fs/cgroup/foo/
> >       $ echo $$ > cgroup.procs
> >       $ echo 200M > memory.max
> >       $ cat memory.max
> >       209715200
> >       $ echo -1000 > /proc/$$/oom_score_adj
> > Then, let's run a memhog task in memcg foo, which will allocate 1G
> > memory and keeps running.
> >       $ /home/yafang/test/memhog &
>
> Well, echo -1000 is a privileged operation. And it has to be used with
> an extreme care because you know that you are creating an unkillable
> task. So the above test is a clear example of the misconfiguration.
>

Right. This issue is really tiggered by the misconfiguration.

> > Then memory.current will be greater than memory.max. Run bellow command
> > in another shell.
> >       $ cat /sys/fs/cgroup/foo/memory.current
> >       1097228288
> > The tasks which have already allocated memory and won't allocate new
> > memory still runs well. This behavior makes nonsense.
> >
> > This patch is to improve it.
> > If no victim found in memcg oom, we should force the current task to
> > wait until there's available pages. That is similar with the behavior in
> > memcg1 when oom_kill_disable is set.
>
> The primary reason why we force the charge is because we _cannot_ wait
> indefinitely in the charge path because the current call chain might
> hold locks or other resources which could block a large part of the
> system. You are essentially reintroducing that behavior.
>

Seems my poor English misleads you ?
The task is NOT waiting in the charge path, while it is really waiting
at the the end of the page fault, so it doesn't hold any locks.
See the comment above mem_cgroup_oom_synchronize()

/*
 *  ...
 * Memcg supports userspace OOM handling where failed allocations must
 * sleep on a waitqueue until the userspace task resolves the
 * situation.  Sleeping directly in the charge context with all kinds
 * of locks held is not a good idea, instead we remember an OOM state
 * in the task and mem_cgroup_oom_synchronize() has to be called at
 * the end of the page fault to complete the OOM handling.
 * ...
 */
bool mem_cgroup_oom_synchronize(bool handle)


> Is the above example a real usecase or you have just tried a test case
> that would trigger the problem?

On my server I found the memory usage of a container was greater than
the limit of it.
From the dmesg I know there's no killable tasks becasue the
oom_score_adj is set with -1000.
Then I tried this test case to produce this issue.
This issue can be triggerer by the misconfiguration of oom_score_adj,
and can also be tiggered by a memoy leak in the task  with
oom_score_adj -1000.


Thanks
Yafang


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20  8:52     ` Yafang Shao
@ 2020-04-20  9:14       ` Michal Hocko
  2020-04-20  9:58         ` Yafang Shao
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2020-04-20  9:14 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
[...]
> > > This patch is to improve it.
> > > If no victim found in memcg oom, we should force the current task to
> > > wait until there's available pages. That is similar with the behavior in
> > > memcg1 when oom_kill_disable is set.
> >
> > The primary reason why we force the charge is because we _cannot_ wait
> > indefinitely in the charge path because the current call chain might
> > hold locks or other resources which could block a large part of the
> > system. You are essentially reintroducing that behavior.
> >
> 
> Seems my poor English misleads you ?
> The task is NOT waiting in the charge path, while it is really waiting
> at the the end of the page fault, so it doesn't hold any locks.

How is that supposed to work? Sorry I didn't really study your patch
very closely because it doesn't apply on the current Linus' tree and
your previous 2 patches have reshuffled the code so it is not really
trivial to have a good picture of the overall logic change.

> See the comment above mem_cgroup_oom_synchronize()

Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
oom handling is disabled (aka handed over to the userspace). All other
paths should handle the oom in the charge path. Please have a look at 
29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
for more background and motivation.

mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
effect was that all other charge paths outside of #PF were failing
allocations prematurely and that had an effect to user space.

> > Is the above example a real usecase or you have just tried a test case
> > that would trigger the problem?
> 
> On my server I found the memory usage of a container was greater than
> the limit of it.
> From the dmesg I know there's no killable tasks becasue the
> oom_score_adj is set with -1000.

I would really recommend to address this problem in the userspace
configuration. Either by increasing the memory limit or fixing the
oom disabled userspace to not consume that much of a memory.

> Then I tried this test case to produce this issue.
> This issue can be triggerer by the misconfiguration of oom_score_adj,
> and can also be tiggered by a memoy leak in the task  with
> oom_score_adj -1000.

Please note that there is not much the system can do about oom disabled
tasks that leak memory. Even the global case would slowly kill all other
userspace until it panics due to no eligible tasks. The oom_score_adj
has a very strong consequences. Do not use it without a very careful
consideration.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20  9:14       ` Michal Hocko
@ 2020-04-20  9:58         ` Yafang Shao
  2020-04-20 10:31           ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2020-04-20  9:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon, Apr 20, 2020 at 5:14 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> > On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> [...]
> > > > This patch is to improve it.
> > > > If no victim found in memcg oom, we should force the current task to
> > > > wait until there's available pages. That is similar with the behavior in
> > > > memcg1 when oom_kill_disable is set.
> > >
> > > The primary reason why we force the charge is because we _cannot_ wait
> > > indefinitely in the charge path because the current call chain might
> > > hold locks or other resources which could block a large part of the
> > > system. You are essentially reintroducing that behavior.
> > >
> >
> > Seems my poor English misleads you ?
> > The task is NOT waiting in the charge path, while it is really waiting
> > at the the end of the page fault, so it doesn't hold any locks.
>
> How is that supposed to work? Sorry I didn't really study your patch
> very closely because it doesn't apply on the current Linus' tree and
> your previous 2 patches have reshuffled the code so it is not really
> trivial to have a good picture of the overall logic change.
>

My patch is based on the commit 8632e9b5645b, and I can rebase my
patch for better reviewing.
Here is the overall logic of the patch.
do_page_fault
    mem_cgroup_try_charge
        mem_cgroup_out_of_memory  <<< over the limit of this memcg
            out_of_memory
                  if (!oc->chosen)  <<<< no killable tasks found
                        Set_an_OOM _state_in_the_task <<<< set the oom state
    mm_fault_error
        pagefault_out_of_memory  <<<< VM_FAULT_OOM is returned by the
previously error
            mem_cgroup_oom_synchronize(true)
                 Check_the_OOM_state_and_then_wait_here <<<< check the oom state




> > See the comment above mem_cgroup_oom_synchronize()
>
> Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
> oom handling is disabled (aka handed over to the userspace). All other
> paths should handle the oom in the charge path.

Right. Now this patch introduces another patch to enter
mem_cgroup_oom_synchronize().

>  Please have a look at
> 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> for more background and motivation.
>

Before I send this patch, I have read it carefully.

> mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
> effect was that all other charge paths outside of #PF were failing
> allocations prematurely and that had an effect to user space.
>

I guess this side effect is caused by the precision of the page
counter, for example, the page counter isn't modified immdiately after
uncharging the pages - that's the issue we should improve IMHO.

> > > Is the above example a real usecase or you have just tried a test case
> > > that would trigger the problem?
> >
> > On my server I found the memory usage of a container was greater than
> > the limit of it.
> > From the dmesg I know there's no killable tasks becasue the
> > oom_score_adj is set with -1000.
>
> I would really recommend to address this problem in the userspace
> configuration. Either by increasing the memory limit or fixing the
> oom disabled userspace to not consume that much of a memory.
>

This issue can be addressed in the usespace configuration.
But note that there're many containers running on one single host,
what we should do is try to keep the isolation as strong as possible.
If we don't take any action in the kernel, the users will complain to
us that their service is easily effected by the weak isolation of the
container.

> > Then I tried this test case to produce this issue.
> > This issue can be triggerer by the misconfiguration of oom_score_adj,
> > and can also be tiggered by a memoy leak in the task  with
> > oom_score_adj -1000.
>
> Please note that there is not much the system can do about oom disabled
> tasks that leak memory. Even the global case would slowly kill all other
> userspace until it panics due to no eligible tasks. The oom_score_adj
> has a very strong consequences. Do not use it without a very careful
> consideration.

global case -> kill others until the system panic.
container case -> kill others until no tasks can run in the contianer

I think this is the consistent behavior.

Thanks
Yafang


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20  9:58         ` Yafang Shao
@ 2020-04-20 10:31           ` Michal Hocko
  2020-04-20 10:51             ` Yafang Shao
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2020-04-20 10:31 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon 20-04-20 17:58:03, Yafang Shao wrote:
> On Mon, Apr 20, 2020 at 5:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> > > On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> > [...]
> > > > > This patch is to improve it.
> > > > > If no victim found in memcg oom, we should force the current task to
> > > > > wait until there's available pages. That is similar with the behavior in
> > > > > memcg1 when oom_kill_disable is set.
> > > >
> > > > The primary reason why we force the charge is because we _cannot_ wait
> > > > indefinitely in the charge path because the current call chain might
> > > > hold locks or other resources which could block a large part of the
> > > > system. You are essentially reintroducing that behavior.
> > > >
> > >
> > > Seems my poor English misleads you ?
> > > The task is NOT waiting in the charge path, while it is really waiting
> > > at the the end of the page fault, so it doesn't hold any locks.
> >
> > How is that supposed to work? Sorry I didn't really study your patch
> > very closely because it doesn't apply on the current Linus' tree and
> > your previous 2 patches have reshuffled the code so it is not really
> > trivial to have a good picture of the overall logic change.
> >
> 
> My patch is based on the commit 8632e9b5645b, and I can rebase my
> patch for better reviewing.
> Here is the overall logic of the patch.
> do_page_fault
>     mem_cgroup_try_charge
>         mem_cgroup_out_of_memory  <<< over the limit of this memcg
>             out_of_memory
>                   if (!oc->chosen)  <<<< no killable tasks found
>                         Set_an_OOM _state_in_the_task <<<< set the oom state
>     mm_fault_error
>         pagefault_out_of_memory  <<<< VM_FAULT_OOM is returned by the
> previously error
>             mem_cgroup_oom_synchronize(true)
>                  Check_the_OOM_state_and_then_wait_here <<<< check the oom state

OK, I see. So this is a hybrid model. My primary concern would be that
issues seen previously with the #PF based approach will happen again.
It will be in a reduced form. It is hard to judge whether this is good.
A rare error case might be even worse because it is harder to predict.
 
> > > See the comment above mem_cgroup_oom_synchronize()
> >
> > Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
> > oom handling is disabled (aka handed over to the userspace). All other
> > paths should handle the oom in the charge path.
> 
> Right. Now this patch introduces another patch to enter
> mem_cgroup_oom_synchronize().
> 
> >  Please have a look at
> > 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> > for more background and motivation.
> >
> 
> Before I send this patch, I have read it carefully.
> 
> > mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
> > effect was that all other charge paths outside of #PF were failing
> > allocations prematurely and that had an effect to user space.
> >
> 
> I guess this side effect is caused by the precision of the page
> counter, for example, the page counter isn't modified immdiately after
> uncharging the pages - that's the issue we should improve IMHO.

No, this is not really the case. If you have a look at gup users or any
kernel memory charged then you can find many places to return ENOMEM
even when it is not really expected. The most notable example was
MAP_POPULATE failures.
 
> > > > Is the above example a real usecase or you have just tried a test case
> > > > that would trigger the problem?
> > >
> > > On my server I found the memory usage of a container was greater than
> > > the limit of it.
> > > From the dmesg I know there's no killable tasks becasue the
> > > oom_score_adj is set with -1000.
> >
> > I would really recommend to address this problem in the userspace
> > configuration. Either by increasing the memory limit or fixing the
> > oom disabled userspace to not consume that much of a memory.
> >
> 
> This issue can be addressed in the usespace configuration.
> But note that there're many containers running on one single host,
> what we should do is try to keep the isolation as strong as possible.

I do agree with this but there is a line between isolation and proper
configuration that this requires. You will never get 100% isolation in
the first place. You have to be careful to not share some resources
which are inherently hard to isolate - e.g. page cache, file system
metadata and many others.

While your example is not the same it is similar. Disabling the oom
killer for a task implies that all the resources held by that task
cannot be reclaimed/rebalanced and so isolation is much harder to
achieve if possible at all. There is only only relieable way to handle
OOM in this situation and that is breaking the contract and kill those
tasks. I am not really convinced this is what we want to do.

> If we don't take any action in the kernel, the users will complain to
> us that their service is easily effected by the weak isolation of the
> container.

Yes I can imagine that and we will carefully explain that disabling oom
for some tasks is a very dangerous thing to do. Maybe we are not
explicit about that in our documentation now and all the consequences
are not really clear.

If this turns out to be infeasible then we should be addressing that
problem for all possible cases and that means to allow breaking the
oom_score_adj contract and kill also hidden tasks.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20 10:31           ` Michal Hocko
@ 2020-04-20 10:51             ` Yafang Shao
  2020-04-20 11:10               ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Yafang Shao @ 2020-04-20 10:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon, Apr 20, 2020 at 6:31 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-04-20 17:58:03, Yafang Shao wrote:
> > On Mon, Apr 20, 2020 at 5:14 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> > > > On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> > > [...]
> > > > > > This patch is to improve it.
> > > > > > If no victim found in memcg oom, we should force the current task to
> > > > > > wait until there's available pages. That is similar with the behavior in
> > > > > > memcg1 when oom_kill_disable is set.
> > > > >
> > > > > The primary reason why we force the charge is because we _cannot_ wait
> > > > > indefinitely in the charge path because the current call chain might
> > > > > hold locks or other resources which could block a large part of the
> > > > > system. You are essentially reintroducing that behavior.
> > > > >
> > > >
> > > > Seems my poor English misleads you ?
> > > > The task is NOT waiting in the charge path, while it is really waiting
> > > > at the the end of the page fault, so it doesn't hold any locks.
> > >
> > > How is that supposed to work? Sorry I didn't really study your patch
> > > very closely because it doesn't apply on the current Linus' tree and
> > > your previous 2 patches have reshuffled the code so it is not really
> > > trivial to have a good picture of the overall logic change.
> > >
> >
> > My patch is based on the commit 8632e9b5645b, and I can rebase my
> > patch for better reviewing.
> > Here is the overall logic of the patch.
> > do_page_fault
> >     mem_cgroup_try_charge
> >         mem_cgroup_out_of_memory  <<< over the limit of this memcg
> >             out_of_memory
> >                   if (!oc->chosen)  <<<< no killable tasks found
> >                         Set_an_OOM _state_in_the_task <<<< set the oom state
> >     mm_fault_error
> >         pagefault_out_of_memory  <<<< VM_FAULT_OOM is returned by the
> > previously error
> >             mem_cgroup_oom_synchronize(true)
> >                  Check_the_OOM_state_and_then_wait_here <<<< check the oom state
>
> OK, I see. So this is a hybrid model. My primary concern would be that
> issues seen previously with the #PF based approach will happen again.
> It will be in a reduced form. It is hard to judge whether this is good.
> A rare error case might be even worse because it is harder to predict.
>
> > > > See the comment above mem_cgroup_oom_synchronize()
> > >
> > > Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
> > > oom handling is disabled (aka handed over to the userspace). All other
> > > paths should handle the oom in the charge path.
> >
> > Right. Now this patch introduces another patch to enter
> > mem_cgroup_oom_synchronize().
> >
> > >  Please have a look at
> > > 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> > > for more background and motivation.
> > >
> >
> > Before I send this patch, I have read it carefully.
> >
> > > mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
> > > effect was that all other charge paths outside of #PF were failing
> > > allocations prematurely and that had an effect to user space.
> > >
> >
> > I guess this side effect is caused by the precision of the page
> > counter, for example, the page counter isn't modified immdiately after
> > uncharging the pages - that's the issue we should improve IMHO.
>
> No, this is not really the case. If you have a look at gup users or any
> kernel memory charged then you can find many places to return ENOMEM
> even when it is not really expected. The most notable example was
> MAP_POPULATE failures.
>

Thanks for your explanation. I will take a look at these code.

> > > > > Is the above example a real usecase or you have just tried a test case
> > > > > that would trigger the problem?
> > > >
> > > > On my server I found the memory usage of a container was greater than
> > > > the limit of it.
> > > > From the dmesg I know there's no killable tasks becasue the
> > > > oom_score_adj is set with -1000.
> > >
> > > I would really recommend to address this problem in the userspace
> > > configuration. Either by increasing the memory limit or fixing the
> > > oom disabled userspace to not consume that much of a memory.
> > >
> >
> > This issue can be addressed in the usespace configuration.
> > But note that there're many containers running on one single host,
> > what we should do is try to keep the isolation as strong as possible.
>
> I do agree with this but there is a line between isolation and proper
> configuration that this requires. You will never get 100% isolation in
> the first place. You have to be careful to not share some resources
> which are inherently hard to isolate - e.g. page cache, file system
> metadata and many others.
>
> While your example is not the same it is similar. Disabling the oom
> killer for a task implies that all the resources held by that task
> cannot be reclaimed/rebalanced and so isolation is much harder to
> achieve if possible at all. There is only only relieable way to handle
> OOM in this situation and that is breaking the contract and kill those
> tasks. I am not really convinced this is what we want to do.
>
> > If we don't take any action in the kernel, the users will complain to
> > us that their service is easily effected by the weak isolation of the
> > container.
>
> Yes I can imagine that and we will carefully explain that disabling oom
> for some tasks is a very dangerous thing to do. Maybe we are not
> explicit about that in our documentation now and all the consequences
> are not really clear.
>
> If this turns out to be infeasible then we should be addressing that
> problem for all possible cases and that means to allow breaking the
> oom_score_adj contract and kill also hidden tasks.

Breaking the oom_score_adj contract seems another possible way, that
would be accepted by the user - misconfiguration or bugs in user code
should be punished.  IOW this is the fault of the user and the kernel
should tell the user the result of this fault.

Thanks
Yafang


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found
  2020-04-20 10:51             ` Yafang Shao
@ 2020-04-20 11:10               ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2020-04-20 11:10 UTC (permalink / raw)
  To: Yafang Shao; +Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Linux MM

On Mon 20-04-20 18:51:50, Yafang Shao wrote:
> On Mon, Apr 20, 2020 at 6:31 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > If this turns out to be infeasible then we should be addressing that
> > problem for all possible cases and that means to allow breaking the
> > oom_score_adj contract and kill also hidden tasks.
> 
> Breaking the oom_score_adj contract seems another possible way, that
> would be accepted by the user - misconfiguration or bugs in user code
> should be punished.  IOW this is the fault of the user and the kernel
> should tell the user the result of this fault.

This is not how the kernel behaves in the vast majority of cases. We
allow users to shoot their feet. Especially for root only interfaces.
We simply rely that admins know what they are doing.

So let me repeat there has to be a very strong justification - e.g. it
is impossible to remove oom_score_adj OOM killer disabling for some
reason and this really happens out there in real deployments. We are
surely not going to break the contract based on artificial misconfigured
test cases.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-04-20 11:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-18 15:13 [PATCH 0/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
2020-04-18 15:13 ` [PATCH 1/3] mm: change the return type of out_of_memory() Yafang Shao
2020-04-18 15:13 ` [PATCH 2/3] mm, memcg: introduce a new helper task_in_memcg_oom_set() Yafang Shao
2020-04-18 15:13 ` [PATCH 3/3] memcg oom: bail out from the charge path if no victim found Yafang Shao
2020-04-20  8:13   ` Michal Hocko
2020-04-20  8:52     ` Yafang Shao
2020-04-20  9:14       ` Michal Hocko
2020-04-20  9:58         ` Yafang Shao
2020-04-20 10:31           ` Michal Hocko
2020-04-20 10:51             ` Yafang Shao
2020-04-20 11:10               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).