linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] cgroup-aware OOM killer
@ 2017-06-01 18:35 Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm; +Cc: Roman Gushchin

This patchset makes the OOM killer cgroup-aware.

Patches 1-3 are simple refactorings of the OOM killer code,
required to reuse the code in the memory controller.
Patches 4 & 5 are introducing new memcg settings:
oom_kill_all_tasks and oom_score_adj.
Patch 6 introduces the cgroup-aware OOM killer.
Patch 7 is docs update.

v1:
  https://lkml.org/lkml/2017/5/18/969

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset


Roman Gushchin (7):
  mm, oom: refactor select_bad_process() to take memcg as an argument
  mm, oom: split oom_kill_process() and export __oom_kill_process()
  mm, oom: export oom_evaluate_task() and select_bad_process()
  mm, oom: introduce oom_kill_all_tasks option for memory cgroups
  mm, oom: introduce oom_score_adj for memory cgroups
  mm, oom: cgroup-aware OOM killer
  mm,oom,docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  47 ++++++++-
 include/linux/memcontrol.h  |  19 ++++
 include/linux/oom.h         |   6 ++
 mm/memcontrol.c             | 247 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c               | 137 +++++++++++++-----------
 5 files changed, 392 insertions(+), 64 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-04 19:25   ` Vladimir Davydov
  2017-06-04 22:50   ` David Rientjes
  2017-06-01 18:35 ` [RFC PATCH v2 2/7] mm, oom: split oom_kill_process() and export __oom_kill_process() Roman Gushchin
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

The select_bad_process() function will be used further
to select a process to kill in the victim cgroup.
This cgroup doesn't necessary match oc->memcg,
which is a cgroup, which limits were caused cgroup-wide OOM
(or NULL in case of global OOM).

So, refactor select_bad_process() to take a pointer to
a cgroup to iterate over as an argument.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..f8b0fb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -343,10 +343,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
  * Simple selection loop. We choose the process with the highest number of
  * 'points'. In case scan was aborted, oc->chosen is set to -1.
  */
-static void select_bad_process(struct oom_control *oc)
+static void select_bad_process(struct oom_control *oc,
+			       struct mem_cgroup *memcg)
 {
-	if (is_memcg_oom(oc))
-		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	if (memcg)
+		mem_cgroup_scan_tasks(memcg, oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -1032,7 +1033,7 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	select_bad_process(oc);
+	select_bad_process(oc, oc->memcg);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 2/7] mm, oom: split oom_kill_process() and export __oom_kill_process()
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 3/7] mm, oom: export oom_evaluate_task() and select_bad_process() Roman Gushchin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
a first one is responsible for considering task's children as
a potential victim and some debug output.
The second half is responsible for sending SIGKILL to all
tasks sharing mm with the given victim.

This commit splits the oom_kill_process() function
and exports __oom_kill_process (second half) to reuse it
from the memory cgroup code.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/oom.h |   2 +
 mm/oom_kill.c       | 121 +++++++++++++++++++++++++++-------------------------
 2 files changed, 66 insertions(+), 57 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..47d9495 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -79,6 +79,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern void __oom_kill_process(struct task_struct *victim);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f8b0fb1..11b60a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -803,67 +803,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -930,6 +875,68 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	mmdrop(mm);
 	put_task_struct(victim);
 }
+
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
 #undef K
 
 /*
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 3/7] mm, oom: export oom_evaluate_task() and select_bad_process()
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 2/7] mm, oom: split oom_kill_process() and export __oom_kill_process() Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-01 18:35 ` [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups Roman Gushchin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

Export the oom_evaluate_task() and select_bad_process()
functions to reuse them for victim selection logic
in the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/oom.h | 3 +++
 mm/oom_kill.c       | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 47d9495..edf7a77 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -80,6 +80,9 @@ extern void oom_killer_enable(void);
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
 extern void __oom_kill_process(struct task_struct *victim);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+extern void select_bad_process(struct oom_control *oc,
+			       struct mem_cgroup *memcg);
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 11b60a5..8cf77fb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -343,8 +343,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
  * Simple selection loop. We choose the process with the highest number of
  * 'points'. In case scan was aborted, oc->chosen is set to -1.
  */
-static void select_bad_process(struct oom_control *oc,
-			       struct mem_cgroup *memcg)
+void select_bad_process(struct oom_control *oc, struct mem_cgroup *memcg)
 {
 	if (memcg)
 		mem_cgroup_scan_tasks(memcg, oom_evaluate_task, oc);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-06-01 18:35 ` [RFC PATCH v2 3/7] mm, oom: export oom_evaluate_task() and select_bad_process() Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-04 19:30   ` Vladimir Davydov
  2017-06-01 18:35 ` [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj " Roman Gushchin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

This option defines whether a cgroup should be treated
as a single entity by the OOM killer.

If set, the OOM killer will compare the whole cgroup with other
memory consumers (other cgroups and tasks in the root cgroup),
and in case of an OOM will kill all belonging tasks.

Disabled by default.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 899949b..8a308c9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -195,6 +195,9 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks below in case of OOM */
+	bool oom_kill_all_tasks;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c131f7e..d4ffa79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5156,6 +5156,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
+
+	seq_printf(m, "%d\n", oom_kill_all_tasks);
+
+	return 0;
+}
+
+static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_kill_all_tasks;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
+	if (err)
+		return err;
+
+	memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5265,6 +5292,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_kill_all_tasks",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_kill_all_tasks_show,
+		.write = memory_oom_kill_all_tasks_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj for memory cgroups
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-06-01 18:35 ` [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-04 19:39   ` Vladimir Davydov
  2017-06-01 18:35 ` [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

Introduce a per-memory-cgroup oom_score_adj setting.
A read-write single value file which exits on non-root
cgroups. The default is "0".

It will have a similar meaning to a per-process value,
available via /proc/<pid>/oom_score_adj.
Should be in a range [-1000, 1000].

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8a308c9..818a42e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -198,6 +198,9 @@ struct mem_cgroup {
 	/* kill all tasks below in case of OOM */
 	bool oom_kill_all_tasks;
 
+	/* OOM kill score adjustment */
+	short oom_score_adj;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4ffa79..f979ac7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5183,6 +5183,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_score_adj_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	short oom_score_adj = memcg->oom_score_adj;
+
+	seq_printf(m, "%d\n", oom_score_adj);
+
+	return 0;
+}
+
+static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_score_adj;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_score_adj);
+	if (err)
+		return err;
+
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	memcg->oom_score_adj = (short)oom_score_adj;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5298,6 +5328,12 @@ static struct cftype memory_files[] = {
 		.write = memory_oom_kill_all_tasks_write,
 	},
 	{
+		.name = "oom_score_adj",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_score_adj_show,
+		.write = memory_oom_score_adj_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-06-01 18:35 ` [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj " Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-04 20:43   ` Vladimir Davydov
  2017-06-01 18:35 ` [RFC PATCH v2 7/7] mm,oom,docs: describe the " Roman Gushchin
  2017-06-09 16:30 ` [RFC PATCH v2 0/7] " Michal Hocko
  7 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers. There are two main issues:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much more safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  13 ++++
 include/linux/oom.h        |   1 +
 mm/memcontrol.c            | 178 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |   6 ++
 4 files changed, 198 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 818a42e..67709a4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -34,6 +34,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -471,6 +472,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+bool mem_cgroup_kill_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -931,6 +935,15 @@ static inline void memcg_kmem_update_page_stat(struct page *page,
 				enum memcg_stat_item idx, int val)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
+static inline bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index edf7a77..a6086a2 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	unsigned long chosen_points;
+	struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f979ac7..855d335 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
+				   const nodemask_t *nodemask)
+{
+	long points = 0;
+	int nid;
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		for_each_node_state(nid, N_MEMORY) {
+			if (nodemask && !node_isset(nid, *nodemask))
+				continue;
+
+			points += mem_cgroup_node_nr_lru_pages(iter, nid,
+					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+		}
+
+		points += mem_cgroup_get_nr_swap_pages(iter);
+		points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
+			(PAGE_SIZE / 1024);
+		points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
+		points += memcg_page_state(iter, MEMCG_SOCK);
+	}
+
+	return points;
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct cgroup_subsys_state *css = NULL;
+	struct mem_cgroup *iter = NULL;
+	struct mem_cgroup *chosen_memcg = NULL;
+	struct mem_cgroup *parent = root_mem_cgroup;
+	unsigned long totalpages = oc->totalpages;
+	long chosen_memcg_points = 0;
+	long points = 0;
+
+	oc->chosen = NULL;
+	oc->chosen_memcg = NULL;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	pr_info("Choosing a victim memcg because of the %s",
+		oc->memcg ?
+		"memory limit reached of cgroup " :
+		"system-wide OOM\n");
+	if (oc->memcg) {
+		pr_cont_cgroup_path(oc->memcg->css.cgroup);
+		pr_cont("\n");
+
+		chosen_memcg = oc->memcg;
+		parent = oc->memcg;
+	}
+
+	rcu_read_lock();
+
+	for (;;) {
+		css = css_next_child(css, &parent->css);
+		if (css) {
+			iter = mem_cgroup_from_css(css);
+
+			points = mem_cgroup_oom_badness(iter, oc->nodemask);
+			points += iter->oom_score_adj * (totalpages / 1000);
+
+			pr_info("Cgroup ");
+			pr_cont_cgroup_path(iter->css.cgroup);
+			pr_cont(": %ld\n", points);
+
+			if (points > chosen_memcg_points) {
+				chosen_memcg = iter;
+				chosen_memcg_points = points;
+				oc->chosen_points = points;
+			}
+
+			continue;
+		}
+
+		if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
+			/* Go deeper in the cgroup hierarchy */
+			totalpages = chosen_memcg_points;
+			chosen_memcg_points = 0;
+
+			parent = chosen_memcg;
+			chosen_memcg = NULL;
+
+			continue;
+		}
+
+		if (!chosen_memcg && parent != root_mem_cgroup)
+			chosen_memcg = parent;
+
+		break;
+	}
+
+	if (!oc->memcg) {
+		/*
+		 * We should also consider tasks in the root cgroup
+		 * with badness larger than oc->chosen_points
+		 */
+
+		struct css_task_iter it;
+		struct task_struct *task;
+		int ret = 0;
+
+		css_task_iter_start(&root_mem_cgroup->css, &it);
+		while (!ret && (task = css_task_iter_next(&it)))
+			ret = oom_evaluate_task(task, oc);
+		css_task_iter_end(&it);
+	}
+
+	if (!oc->chosen && chosen_memcg) {
+		pr_info("Chosen cgroup ");
+		pr_cont_cgroup_path(chosen_memcg->css.cgroup);
+		pr_cont(": %ld\n", oc->chosen_points);
+
+		if (chosen_memcg->oom_kill_all_tasks) {
+			css_get(&chosen_memcg->css);
+			oc->chosen_memcg = chosen_memcg;
+		} else {
+			/*
+			 * If we don't need to kill all tasks in the cgroup,
+			 * let's select the biggest task.
+			 */
+			oc->chosen_points = 0;
+			select_bad_process(oc, chosen_memcg);
+		}
+	} else if (oc->chosen)
+		pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
+			oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
+
+	rcu_read_unlock();
+
+	oc->chosen_points = 0;
+	return !!oc->chosen || !!oc->chosen_memcg;
+}
+
+static int __oom_kill_task(struct task_struct *tsk, void *arg)
+{
+	if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
+		get_task_struct(tsk);
+		__oom_kill_process(tsk);
+	}
+	return 0;
+}
+
+bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
+{
+	if (oc->chosen_memcg) {
+		/*
+		 * Kill all tasks in the cgroup hierarchy
+		 */
+		mem_cgroup_scan_tasks(oc->chosen_memcg,
+				      __oom_kill_task, NULL);
+
+		/*
+		 * Release oc->chosen_memcg
+		 */
+		css_put(&oc->chosen_memcg->css);
+		oc->chosen_memcg = NULL;
+	}
+
+	if (oc->chosen && oc->chosen != (void *)-1UL) {
+		__oom_kill_process(oc->chosen);
+		return true;
+	}
+
+	/*
+	 * Reset points before falling back to an old
+	 * per-process OOM victim selection logic
+	 */
+	oc->chosen_points = 0;
+
+	return !!oc->chosen;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8cf77fb..1346565 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1039,6 +1039,12 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) &&
+	    mem_cgroup_kill_oom_victim(oc)) {
+		schedule_timeout_killable(1);
+		return true;
+	}
+
 	select_bad_process(oc, oc->memcg);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 7/7] mm,oom,docs: describe the cgroup-aware OOM killer
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (5 preceding siblings ...)
  2017-06-01 18:35 ` [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-06-01 18:35 ` Roman Gushchin
  2017-06-09 16:30 ` [RFC PATCH v2 0/7] " Michal Hocko
  7 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-01 18:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 47 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dc5e2dc..bb1e45c 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
     5-2-1. Memory Interface Files
     5-2-2. Usage Guidelines
     5-2-3. Memory Ownership
+    5-2-4. Cgroup-aware OOM Killer
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
@@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_kill_all_tasks
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	Defines whether the OOM killer should treat the cgroup
+	as a single entity during the victim selection.
+
+	If set, it will cause the OOM killer to kill all belonging
+	tasks, both in case of a system-wide or cgroup-wide OOM.
+
+  memory.oom_score_adj
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	OOM killer score adjustment, which has as similar meaning
+	to a per-process value, available via /proc/<pid>/oom_score_adj.
+	Should be in a range [-1000, 1000].
+
   memory.events
 
 	A read-only flat-keyed file which exists on non-root cgroups.
@@ -831,8 +852,7 @@ PAGE_SIZE multiple when read back.
 	  oom
 
 		The number of times the OOM killer has been invoked in
-		the cgroup.  This may not exactly match the number of
-		processes killed but should generally be close.
+		the cgroup.
 
   memory.stat
 
@@ -988,6 +1008,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+5-2-4. Cgroup-aware OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
 5-3. IO
 
 The "io" controller regulates the distribution of IO resources.  This
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
@ 2017-06-04 19:25   ` Vladimir Davydov
  2017-06-04 22:50   ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2017-06-04 19:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 01, 2017 at 07:35:09PM +0100, Roman Gushchin wrote:
> The select_bad_process() function will be used further
> to select a process to kill in the victim cgroup.
> This cgroup doesn't necessary match oc->memcg,
> which is a cgroup, which limits were caused cgroup-wide OOM
> (or NULL in case of global OOM).
> 
> So, refactor select_bad_process() to take a pointer to
> a cgroup to iterate over as an argument.

IMHO this patch, as well as patches 2-5, doesn't deserve to be submitted
separately: none of them make sense as a separate change; worse, patches
4 and 5 introduce user API that doesn't do anything without patch 6. All
of the changes are relatively small and singling them out doesn't really
facilitate review, so I'd merge them all in patch 6.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups
  2017-06-01 18:35 ` [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups Roman Gushchin
@ 2017-06-04 19:30   ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2017-06-04 19:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 01, 2017 at 07:35:12PM +0100, Roman Gushchin wrote:
> This option defines whether a cgroup should be treated
> as a single entity by the OOM killer.
> 
> If set, the OOM killer will compare the whole cgroup with other
> memory consumers (other cgroups and tasks in the root cgroup),
> and in case of an OOM will kill all belonging tasks.
> 
> Disabled by default.
> 
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> @@ -5265,6 +5292,12 @@ static struct cftype memory_files[] = {
>  		.write = memory_max_write,
>  	},
>  	{
> +		.name = "oom_kill_all_tasks",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_oom_kill_all_tasks_show,
> +		.write = memory_oom_kill_all_tasks_write,
> +	},
> +	{
>  		.name = "events",
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.file_offset = offsetof(struct mem_cgroup, events_file),

I don't really like the name of the new knob, but can't come up with
anything better :-( May be, drop '_tasks' suffix and call it just
'oom_kill_all'? Or perhaps we should emphasize the fact that this
cgroup is treated as a single entity by the OOM killer by calling it
'oom_entity' or 'oom_unit'? Dunno...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj for memory cgroups
  2017-06-01 18:35 ` [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj " Roman Gushchin
@ 2017-06-04 19:39   ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2017-06-04 19:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 01, 2017 at 07:35:13PM +0100, Roman Gushchin wrote:
> Introduce a per-memory-cgroup oom_score_adj setting.
> A read-write single value file which exits on non-root
> cgroups. The default is "0".
> 
> It will have a similar meaning to a per-process value,
> available via /proc/<pid>/oom_score_adj.
> Should be in a range [-1000, 1000].

IMHO OOM scoring (not only the user API, but the logic as well) should
be introduced by a separate patch following the main one (#6) in the
series. Rationale: we might want to commit the main patch right away,
while postponing OOM scoring for later, because some people might find
the API controversial and needing a further, deeper discussion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer
  2017-06-01 18:35 ` [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-06-04 20:43   ` Vladimir Davydov
  2017-06-06 15:59     ` Roman Gushchin
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2017-06-04 20:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 01, 2017 at 07:35:14PM +0100, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 
> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.
> 
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f979ac7..855d335 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
> +				   const nodemask_t *nodemask)
> +{
> +	long points = 0;
> +	int nid;
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, memcg) {

AFAIU this function is called on every iteration over the cgroup tree,
which might be costly in case of a deep hierarchy, as it has quadratic
complexity at worst. We could eliminate the nested loop by computing
badness of all eligible cgroups before starting looking for a victim and
saving the values in struct mem_cgroup. Not sure if it's worth it, as
OOM is a pretty cold path.

> +		for_each_node_state(nid, N_MEMORY) {
> +			if (nodemask && !node_isset(nid, *nodemask))
> +				continue;
> +
> +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));

Hmm, is there a reason why we shouldn't take into account file pages?

> +		}
> +
> +		points += mem_cgroup_get_nr_swap_pages(iter);

AFAICS mem_cgroup_get_nr_swap_pages() returns the number of pages that
can still be charged to the cgroup. IIUC we want to account pages that
have already been charged to the cgroup, i.e. the value of the 'swap'
page counter or MEMCG_SWAP stat counter.

> +		points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
> +			(PAGE_SIZE / 1024);
> +		points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
> +		points += memcg_page_state(iter, MEMCG_SOCK);
> +	}
> +
> +	return points;
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct cgroup_subsys_state *css = NULL;
> +	struct mem_cgroup *iter = NULL;
> +	struct mem_cgroup *chosen_memcg = NULL;
> +	struct mem_cgroup *parent = root_mem_cgroup;
> +	unsigned long totalpages = oc->totalpages;
> +	long chosen_memcg_points = 0;
> +	long points = 0;
> +
> +	oc->chosen = NULL;
> +	oc->chosen_memcg = NULL;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	pr_info("Choosing a victim memcg because of the %s",
> +		oc->memcg ?
> +		"memory limit reached of cgroup " :
> +		"system-wide OOM\n");
> +	if (oc->memcg) {
> +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> +		pr_cont("\n");
> +
> +		chosen_memcg = oc->memcg;
> +		parent = oc->memcg;
> +	}
> +
> +	rcu_read_lock();
> +
> +	for (;;) {
> +		css = css_next_child(css, &parent->css);
> +		if (css) {
> +			iter = mem_cgroup_from_css(css);
> +
> +			points = mem_cgroup_oom_badness(iter, oc->nodemask);
> +			points += iter->oom_score_adj * (totalpages / 1000);
> +
> +			pr_info("Cgroup ");
> +			pr_cont_cgroup_path(iter->css.cgroup);
> +			pr_cont(": %ld\n", points);

Not sure if everyone wants to see these messages in the log.

> +
> +			if (points > chosen_memcg_points) {
> +				chosen_memcg = iter;
> +				chosen_memcg_points = points;
> +				oc->chosen_points = points;
> +			}
> +
> +			continue;
> +		}
> +
> +		if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
> +			/* Go deeper in the cgroup hierarchy */
> +			totalpages = chosen_memcg_points;

We set 'totalpages' to the target cgroup limit (or the total RAM
size) when computing a victim score. Why do you prefer to use
chosen_memcg_points here instead? Why not the limit of the chosen
cgroup?

> +			chosen_memcg_points = 0;
> +
> +			parent = chosen_memcg;
> +			chosen_memcg = NULL;
> +
> +			continue;
> +		}
> +
> +		if (!chosen_memcg && parent != root_mem_cgroup)
> +			chosen_memcg = parent;
> +
> +		break;
> +	}
> +

> +	if (!oc->memcg) {
> +		/*
> +		 * We should also consider tasks in the root cgroup
> +		 * with badness larger than oc->chosen_points
> +		 */
> +
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		int ret = 0;
> +
> +		css_task_iter_start(&root_mem_cgroup->css, &it);
> +		while (!ret && (task = css_task_iter_next(&it)))
> +			ret = oom_evaluate_task(task, oc);
> +		css_task_iter_end(&it);
> +	}

IMHO it isn't quite correct to compare tasks from the root cgroup with
leaf cgroups, because they are at different levels. Shouldn't we compare
their scores only with the top level cgroups?

As an alternative approach, may be, we could remove this branch
altogether and ignore root tasks here (i.e. have any root task a higher
priority a priori)? Perhaps, it could be acceptable, because normally
the root cgroup only hosts kernel processes and init (at least this is
the default systemd setup IIRC).

> +
> +	if (!oc->chosen && chosen_memcg) {
> +		pr_info("Chosen cgroup ");
> +		pr_cont_cgroup_path(chosen_memcg->css.cgroup);
> +		pr_cont(": %ld\n", oc->chosen_points);
> +
> +		if (chosen_memcg->oom_kill_all_tasks) {
> +			css_get(&chosen_memcg->css);
> +			oc->chosen_memcg = chosen_memcg;
> +		} else {
> +			/*
> +			 * If we don't need to kill all tasks in the cgroup,
> +			 * let's select the biggest task.
> +			 */
> +			oc->chosen_points = 0;

> +			select_bad_process(oc, chosen_memcg);

I think we'd better use mem_cgroup_scan_task() here directly, without
exporting select_bad_process() from oom_kill.c. IMHO it would be more
straightforward, because select_bad_process() has a branch handling the
global OOM, which isn't used in this case. Come to think of it, wouldn't
it be better to return the chosen cgroup in @oc and let out_of_memory()
select a process within it or kill it as a whole depending on the value
of the oom_kill_all_tasks flag?

Also, if the chosen cgroup has no tasks (which is perfectly possible if
all memory within the cgroup is consumed by shmem e.g.), shouldn't we
retry the cgroup selection?

> +		}
> +	} else if (oc->chosen)
> +		pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
> +			oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
> +
> +	rcu_read_unlock();
> +
> +	oc->chosen_points = 0;
> +	return !!oc->chosen || !!oc->chosen_memcg;
> +}
> +
> +static int __oom_kill_task(struct task_struct *tsk, void *arg)
> +{
> +	if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
> +		get_task_struct(tsk);
> +		__oom_kill_process(tsk);
> +	}
> +	return 0;
> +}
> +
> +bool mem_cgroup_kill_oom_victim(struct oom_control *oc)

I think it'd be OK to define this function in oom_kill.c - we
have everything we need for that. We wouldn't have to export
__oom_kill_process without oom_kill_process then, which is kinda
ugly IMHO.

> +{
> +	if (oc->chosen_memcg) {
> +		/*
> +		 * Kill all tasks in the cgroup hierarchy
> +		 */
> +		mem_cgroup_scan_tasks(oc->chosen_memcg,
> +				      __oom_kill_task, NULL);
> +
> +		/*
> +		 * Release oc->chosen_memcg
> +		 */
> +		css_put(&oc->chosen_memcg->css);
> +		oc->chosen_memcg = NULL;
> +	}
> +
> +	if (oc->chosen && oc->chosen != (void *)-1UL) {

> +		__oom_kill_process(oc->chosen);

Why don't you use oom_kill_process (without leading underscores) here?

> +		return true;
> +	}
> +
> +	/*
> +	 * Reset points before falling back to an old
> +	 * per-process OOM victim selection logic
> +	 */
> +	oc->chosen_points = 0;
> +
> +	return !!oc->chosen;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
  2017-06-04 19:25   ` Vladimir Davydov
@ 2017-06-04 22:50   ` David Rientjes
  2017-06-06 16:20     ` Roman Gushchin
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2017-06-04 22:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

We use a heavily modified system and memcg oom killer and I'm wondering
if there is some opportunity for collaboration because we may have some
shared goals.

I can summarize how we currently use the oom killer at a high level so
that it is not overwhelming with implementation details and give some
rationale for why we have converged onto this strategy over the period of
a few years.

For victim selection, we have strict priority based oom killing both at
the memcg level and the process level.

Each process has its own "badness" value that is independent of
oom_score_adj, although some conversion is done if a third-party thread
chooses to disable itself from oom killing for backwards compatibility.
Lower values are more preferable victims, but that detail should not
matter significantly.  If two processes share the same badness value,
tiebreaks are done by selecting the largest rss.

Each memcg in a hierarchy also has its own badness value which
semantically means the same as the per-process value, although it
considers the entire memcg as a unit, similar to your approach, when
iterating the hierarchy to choose a process.  The benefit of the
per-memcg and per-process approach is that you can kill the lowest
priority process from the lowest priority memcg.

The above scoring is enabled with a VM sysctl for the system and is used
for both system (global) and memcg oom kills.  For system overcommit,
this means we can kill the lowest priority job on the system; for memcg,
we can allow users to define their oom kill priorities at each level of
their own hierarchy.

When the system or root of an oom memcg hierarchy encounters its limit,
we iterate each level of the memcg hierarchy to find the lowest priority
job.  This is done by comparing the badness of the sibling memcgs at
each level, finding the lowest, and iterating that subtree.  If there are
lower priority processes per the per-process badness value compared to
all sibling memcgs, that process is killed.

We also have complete userspace oom handling support.  This complements
the existing memory.oom_control notification when a memcg is oom with a
separate notifier that notifies when the kernel has oom killed a process.
It is possible to delay the oom killer from killing a process for memcg
oom kills with a configurable, per-memcg, oom delay.  If set, the kernel
will wait for userspace to respond to its oom notification and effect its
own policy decisions until memory is uncharged to that memcg hierarchy,
or the oom delay expires.  If the oom delay expires, the kernel oom
killer kills a process based on badness.

Our oom kill notification file used to get an fd to register with
cgroup.event_control also provides oom kill statistics based on system,
memcg, local, hierarchical, and user-induced oom kills when read().

We also have a convenient "kill all" knob that userspace can write when
handling oom conditions to iterate all threads attached to a particular
memcg and kill them.  This is merely to prevent races where userspace
does the oom killing itself, which is not problematic in itself, but
additional tasks continue to be attached to an oom memcg.

A caveat here is that we also support fully inclusive kmem accounting to
memcg hierarchies, so we call the oom killer as part of the memcg charge
path rather than only upon returning from fault with VM_FAULT_OOM.  We
have our own oom killer livelock detection that isn't necessarily
important in this thread, but we haven't encountered a situation where we
livelock by calling the oom killer during charge, and this is a
requirement for memcg charging as part of slab allocation.

I could post many patches to implement all of this functionality that we
have used for a few years, but I first wanted to send this email to see
if there is any common ground or to better understand your methodology
for using the kernel oom killer for both system and memcg oom kills.

Otherwise, very interesting stuff!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer
  2017-06-04 20:43   ` Vladimir Davydov
@ 2017-06-06 15:59     ` Roman Gushchin
  0 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-06 15:59 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Sun, Jun 04, 2017 at 11:43:33PM +0300, Vladimir Davydov wrote:
> On Thu, Jun 01, 2017 at 07:35:14PM +0100, Roman Gushchin wrote:
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f979ac7..855d335 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,184 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> >  	return ret;
> >  }
> >  
> > +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
> > +				   const nodemask_t *nodemask)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	struct mem_cgroup *iter;
> > +
> > +	for_each_mem_cgroup_tree(iter, memcg) {
> 
> AFAIU this function is called on every iteration over the cgroup tree,
> which might be costly in case of a deep hierarchy, as it has quadratic
> complexity at worst. We could eliminate the nested loop by computing
> badness of all eligible cgroups before starting looking for a victim and
> saving the values in struct mem_cgroup. Not sure if it's worth it, as
> OOM is a pretty cold path.

I've thought about it, but it really not obvious that we want to pay
with some additional memory usage (and code complexity) for optimization
of this path. So, I decided to leave it simple now, and postpone
all optimizations after we'll agree on everything else.

> 
> > +		for_each_node_state(nid, N_MEMORY) {
> > +			if (nodemask && !node_isset(nid, *nodemask))
> > +				continue;
> > +
> > +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> > +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> 
> Hmm, is there a reason why we shouldn't take into account file pages?

Because under the OOM conditions we should not have too much pagecache,
and killing a process will unlikely help us to release any additional memory.
But maybe I'm missing something... Lazy free?

> 
> > +		}
> > +
> > +		points += mem_cgroup_get_nr_swap_pages(iter);
> 
> AFAICS mem_cgroup_get_nr_swap_pages() returns the number of pages that
> can still be charged to the cgroup. IIUC we want to account pages that
> have already been charged to the cgroup, i.e. the value of the 'swap'
> page counter or MEMCG_SWAP stat counter.

Ok, I'll check it. Thank you!

> 
> > +		points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
> > +			(PAGE_SIZE / 1024);
> > +		points += memcg_page_state(iter, MEMCG_SLAB_UNRECLAIMABLE);
> > +		points += memcg_page_state(iter, MEMCG_SOCK);
> > +	}
> > +
> > +	return points;
> > +}
> > +
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > +	struct cgroup_subsys_state *css = NULL;
> > +	struct mem_cgroup *iter = NULL;
> > +	struct mem_cgroup *chosen_memcg = NULL;
> > +	struct mem_cgroup *parent = root_mem_cgroup;
> > +	unsigned long totalpages = oc->totalpages;
> > +	long chosen_memcg_points = 0;
> > +	long points = 0;
> > +
> > +	oc->chosen = NULL;
> > +	oc->chosen_memcg = NULL;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +		return false;
> > +
> > +	pr_info("Choosing a victim memcg because of the %s",
> > +		oc->memcg ?
> > +		"memory limit reached of cgroup " :
> > +		"system-wide OOM\n");
> > +	if (oc->memcg) {
> > +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > +		pr_cont("\n");
> > +
> > +		chosen_memcg = oc->memcg;
> > +		parent = oc->memcg;
> > +	}
> > +
> > +	rcu_read_lock();
> > +
> > +	for (;;) {
> > +		css = css_next_child(css, &parent->css);
> > +		if (css) {
> > +			iter = mem_cgroup_from_css(css);
> > +
> > +			points = mem_cgroup_oom_badness(iter, oc->nodemask);
> > +			points += iter->oom_score_adj * (totalpages / 1000);
> > +
> > +			pr_info("Cgroup ");
> > +			pr_cont_cgroup_path(iter->css.cgroup);
> > +			pr_cont(": %ld\n", points);
> 
> Not sure if everyone wants to see these messages in the log.

What do you suggest? Remove debug output at all (probably, we still want some),
ratelimit it, make optional?

> 
> > +
> > +			if (points > chosen_memcg_points) {
> > +				chosen_memcg = iter;
> > +				chosen_memcg_points = points;
> > +				oc->chosen_points = points;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
> > +			/* Go deeper in the cgroup hierarchy */
> > +			totalpages = chosen_memcg_points;
> 
> We set 'totalpages' to the target cgroup limit (or the total RAM
> size) when computing a victim score. Why do you prefer to use
> chosen_memcg_points here instead? Why not the limit of the chosen
> cgroup?

Because I'm trying to implement hierarchical oom_score_adj, so that if
a parent cgroup has oom_score_adj set to -1000, it's successors will
(almost) never selected.

> > +			chosen_memcg_points = 0;
> > +
> > +			parent = chosen_memcg;
> > +			chosen_memcg = NULL;
> > +
> > +			continue;
> > +		}
> > +
> > +		if (!chosen_memcg && parent != root_mem_cgroup)
> > +			chosen_memcg = parent;
> > +
> > +		break;
> > +	}
> > +
> 
> > +	if (!oc->memcg) {
> > +		/*
> > +		 * We should also consider tasks in the root cgroup
> > +		 * with badness larger than oc->chosen_points
> > +		 */
> > +
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		int ret = 0;
> > +
> > +		css_task_iter_start(&root_mem_cgroup->css, &it);
> > +		while (!ret && (task = css_task_iter_next(&it)))
> > +			ret = oom_evaluate_task(task, oc);
> > +		css_task_iter_end(&it);
> > +	}
> 
> IMHO it isn't quite correct to compare tasks from the root cgroup with
> leaf cgroups, because they are at different levels. Shouldn't we compare
> their scores only with the top level cgroups?

Not sure I follow your idea...
Of course, comparing tasks and cgroups is not really precise,
but hopefully should be good enough for the task.
 
> As an alternative approach, may be, we could remove this branch
> altogether and ignore root tasks here (i.e. have any root task a higher
> priority a priori)? Perhaps, it could be acceptable, because normally
> the root cgroup only hosts kernel processes and init (at least this is
> the default systemd setup IIRC).
> 
> > +
> > +	if (!oc->chosen && chosen_memcg) {
> > +		pr_info("Chosen cgroup ");
> > +		pr_cont_cgroup_path(chosen_memcg->css.cgroup);
> > +		pr_cont(": %ld\n", oc->chosen_points);
> > +
> > +		if (chosen_memcg->oom_kill_all_tasks) {
> > +			css_get(&chosen_memcg->css);
> > +			oc->chosen_memcg = chosen_memcg;
> > +		} else {
> > +			/*
> > +			 * If we don't need to kill all tasks in the cgroup,
> > +			 * let's select the biggest task.
> > +			 */
> > +			oc->chosen_points = 0;
> 
> > +			select_bad_process(oc, chosen_memcg);
> 
> I think we'd better use mem_cgroup_scan_task() here directly, without
> exporting select_bad_process() from oom_kill.c. IMHO it would be more
> straightforward, because select_bad_process() has a branch handling the
> global OOM, which isn't used in this case. Come to think of it, wouldn't
> it be better to return the chosen cgroup in @oc and let out_of_memory()
> select a process within it or kill it as a whole depending on the value
> of the oom_kill_all_tasks flag?
> 
> Also, if the chosen cgroup has no tasks (which is perfectly possible if
> all memory within the cgroup is consumed by shmem e.g.), shouldn't we
> retry the cgroup selection?

Good point. Whould we retry the cgroup selection or just ignore
non-populated cgroups during selection?

> 
> > +		}
> > +	} else if (oc->chosen)
> > +		pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
> > +			oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
> > +
> > +	rcu_read_unlock();
> > +
> > +	oc->chosen_points = 0;
> > +	return !!oc->chosen || !!oc->chosen_memcg;
> > +}
> > +
> > +static int __oom_kill_task(struct task_struct *tsk, void *arg)
> > +{
> > +	if (!is_global_init(tsk) && !(tsk->flags & PF_KTHREAD)) {
> > +		get_task_struct(tsk);
> > +		__oom_kill_process(tsk);
> > +	}
> > +	return 0;
> > +}
> > +
> > +bool mem_cgroup_kill_oom_victim(struct oom_control *oc)
> 
> I think it'd be OK to define this function in oom_kill.c - we
> have everything we need for that. We wouldn't have to export
> __oom_kill_process without oom_kill_process then, which is kinda
> ugly IMHO.
> 
> > +{
> > +	if (oc->chosen_memcg) {
> > +		/*
> > +		 * Kill all tasks in the cgroup hierarchy
> > +		 */
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg,
> > +				      __oom_kill_task, NULL);
> > +
> > +		/*
> > +		 * Release oc->chosen_memcg
> > +		 */
> > +		css_put(&oc->chosen_memcg->css);
> > +		oc->chosen_memcg = NULL;
> > +	}
> > +
> > +	if (oc->chosen && oc->chosen != (void *)-1UL) {
> 
> > +		__oom_kill_process(oc->chosen);
> 
> Why don't you use oom_kill_process (without leading underscores) here?

Because oom_kill_process() has some unwanted side-effects:
1) it can kill other than specified process, we don't need this optimization here,
2) bulky debug output.

Thank you for review!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-04 22:50   ` David Rientjes
@ 2017-06-06 16:20     ` Roman Gushchin
  2017-06-06 20:42       ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-06 16:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

On Sun, Jun 04, 2017 at 03:50:37PM -0700, David Rientjes wrote:
> We use a heavily modified system and memcg oom killer and I'm wondering
> if there is some opportunity for collaboration because we may have some
> shared goals.
> 
> I can summarize how we currently use the oom killer at a high level so
> that it is not overwhelming with implementation details and give some
> rationale for why we have converged onto this strategy over the period of
> a few years.
> 
> For victim selection, we have strict priority based oom killing both at
> the memcg level and the process level.
> 
> Each process has its own "badness" value that is independent of
> oom_score_adj, although some conversion is done if a third-party thread
> chooses to disable itself from oom killing for backwards compatibility.
> Lower values are more preferable victims, but that detail should not
> matter significantly.  If two processes share the same badness value,
> tiebreaks are done by selecting the largest rss.
> 
> Each memcg in a hierarchy also has its own badness value which
> semantically means the same as the per-process value, although it
> considers the entire memcg as a unit, similar to your approach, when
> iterating the hierarchy to choose a process.  The benefit of the
> per-memcg and per-process approach is that you can kill the lowest
> priority process from the lowest priority memcg.
> 
> The above scoring is enabled with a VM sysctl for the system and is used
> for both system (global) and memcg oom kills.  For system overcommit,
> this means we can kill the lowest priority job on the system; for memcg,
> we can allow users to define their oom kill priorities at each level of
> their own hierarchy.
> 
> When the system or root of an oom memcg hierarchy encounters its limit,
> we iterate each level of the memcg hierarchy to find the lowest priority
> job.  This is done by comparing the badness of the sibling memcgs at
> each level, finding the lowest, and iterating that subtree.  If there are
> lower priority processes per the per-process badness value compared to
> all sibling memcgs, that process is killed.
> 
> We also have complete userspace oom handling support.  This complements
> the existing memory.oom_control notification when a memcg is oom with a
> separate notifier that notifies when the kernel has oom killed a process.
> It is possible to delay the oom killer from killing a process for memcg
> oom kills with a configurable, per-memcg, oom delay.  If set, the kernel
> will wait for userspace to respond to its oom notification and effect its
> own policy decisions until memory is uncharged to that memcg hierarchy,
> or the oom delay expires.  If the oom delay expires, the kernel oom
> killer kills a process based on badness.
> 
> Our oom kill notification file used to get an fd to register with
> cgroup.event_control also provides oom kill statistics based on system,
> memcg, local, hierarchical, and user-induced oom kills when read().
> 
> We also have a convenient "kill all" knob that userspace can write when
> handling oom conditions to iterate all threads attached to a particular
> memcg and kill them.  This is merely to prevent races where userspace
> does the oom killing itself, which is not problematic in itself, but
> additional tasks continue to be attached to an oom memcg.
> 
> A caveat here is that we also support fully inclusive kmem accounting to
> memcg hierarchies, so we call the oom killer as part of the memcg charge
> path rather than only upon returning from fault with VM_FAULT_OOM.  We
> have our own oom killer livelock detection that isn't necessarily
> important in this thread, but we haven't encountered a situation where we
> livelock by calling the oom killer during charge, and this is a
> requirement for memcg charging as part of slab allocation.
> 
> I could post many patches to implement all of this functionality that we
> have used for a few years, but I first wanted to send this email to see
> if there is any common ground or to better understand your methodology
> for using the kernel oom killer for both system and memcg oom kills.
> 
> Otherwise, very interesting stuff!

Hi David!

Thank you for sharing this!

It's very interesting, and it looks like,
it's not that far from what I've suggested.

So we definitily need to come up with some common solution.

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-06 16:20     ` Roman Gushchin
@ 2017-06-06 20:42       ` David Rientjes
  2017-06-08 15:59         ` Roman Gushchin
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2017-06-06 20:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, 6 Jun 2017, Roman Gushchin wrote:

> Hi David!
> 
> Thank you for sharing this!
> 
> It's very interesting, and it looks like,
> it's not that far from what I've suggested.
> 
> So we definitily need to come up with some common solution.
> 

Hi Roman,

Yes, definitely.  I could post a series of patches to do everything that 
was listed in my email sans the fully inclusive kmem accounting, which may 
be pursued at a later date, if it would be helpful to see where there is 
common ground?

Another question is what you think about userspace oom handling?  We 
implement our own oom kill policies in userspace for both the system and 
for user-controlled memcg hierarchies because it often does not match the 
kernel implementation and there is some action that can be taken other 
than killing a process.  Have you tried to implement functionality to do 
userspace oom handling, or are you considering it?  This is the main 
motivation behind allowing an oom delay to be configured.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument
  2017-06-06 20:42       ` David Rientjes
@ 2017-06-08 15:59         ` Roman Gushchin
  0 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2017-06-08 15:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Tejun Heo, Johannes Weiner, Li Zefan, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Jun 06, 2017 at 01:42:29PM -0700, David Rientjes wrote:
> On Tue, 6 Jun 2017, Roman Gushchin wrote:
> 
> > Hi David!
> > 
> > Thank you for sharing this!
> > 
> > It's very interesting, and it looks like,
> > it's not that far from what I've suggested.
> > 
> > So we definitily need to come up with some common solution.
> > 
> 

Hi David,

> Yes, definitely.  I could post a series of patches to do everything that 
> was listed in my email sans the fully inclusive kmem accounting, which may 
> be pursued at a later date, if it would be helpful to see where there is 
> common ground?
> 
> Another question is what you think about userspace oom handling?  We 
> implement our own oom kill policies in userspace for both the system and 
> for user-controlled memcg hierarchies because it often does not match the 
> kernel implementation and there is some action that can be taken other 
> than killing a process.  Have you tried to implement functionality to do 
> userspace oom handling, or are you considering it?  This is the main 
> motivation behind allowing an oom delay to be configured.

cgroup v2 memory controller is built on the idea of preventing OOMs
by using the memory.high limit. This allows an userspace app to get notified
before OOM happens (by looking at memory.events control), so there is (hopefully)
no need in things like oom delay.

Actually, I'm trying to implement some minimal functionality in the kernel,
which will simplify and make more consistent the userspace part of the job.
But, of course, the main goal of the patchset is to fix the unfairness
of the current victim selection.

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer
  2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (6 preceding siblings ...)
  2017-06-01 18:35 ` [RFC PATCH v2 7/7] mm,oom,docs: describe the " Roman Gushchin
@ 2017-06-09 16:30 ` Michal Hocko
  2017-06-22 17:10   ` Roman Gushchin
  7 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-06-09 16:30 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm

On Thu 01-06-17 19:35:08, Roman Gushchin wrote:
> This patchset makes the OOM killer cgroup-aware.
> 
> Patches 1-3 are simple refactorings of the OOM killer code,
> required to reuse the code in the memory controller.
> Patches 4 & 5 are introducing new memcg settings:
> oom_kill_all_tasks and oom_score_adj.
> Patch 6 introduces the cgroup-aware OOM killer.
> Patch 7 is docs update.

I have only had a look at the cumulative diff (sorry I've been quite
busy throughout the week) and here are my high level comments. I can see
few rather serious issues which will need to be resolved before this
can move on.
- the first problem is a pre-existing one but it will get more urgent
  with the fact that more tasks will be killed with your approach. OOM
  victims are allowed to consume memory reserves without any bound. The
  current throttling is quite arguable and it relies on the fact that we
  try to limit the number of tasks to have this access to reserves.
  Theoretically, though, a heavily multithread application can deplete the
  reserves completely even now. With more processes being killed this
  will get much more likely. Johannes and me have already posted patches
  to address that. The last patch was
  http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
- I do not see any explicit lockout mechanism to prevent from too eager oom
  invocation while the previous oom killed memcg is still not torn down
  completely. We use tsk_is_oom_victim check in oom_evaluate_task for
  that purpose. You seem to rely on the fact that such a memcg would be
  still the largest one, right? I am not really sure this is sufficient.
- You seem to completely ignore per task oom_score_adj and override it
  by the memcg value. This makes some sense but it can lead to an
  unexpected behavior when somebody relies on the original behavior.
  E.g. a workload that would corrupt data when killed unexpectedly and
  so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
  break when running inside a container. I do not have a good answer
  what is the desirable behavior and maybe there is no universal answer.
  Maybe you just do not to kill those tasks? But then you have to be
  careful when selecting a memcg victim. Hairy...
- While we are at it oom_score_adj has turned out to be quite unusable
  for a sensible oom prioritization from my experience. Practically
  it reduced to disable/enforce the task for selection. The scale is
  quite small as well. There certainly is a need for prioritization
  and maybe a completely different api would be better. Maybe a simple
  priority in (0, infinity) range will be better. Priority could be used
  either as the only criterion or as a tie breaker when consumption of
  more memcgs is too close (this could be implemented for each strategy
  in a different way if we go modules way)
- oom_kill_all_tasks should be hierarchical and consistent within a
  hierarchy. Or maybe it should be applicable to memcgs with tasks (leaf
  cgroups). Although selecting a memcg higher in the hierarchy kill all
  tasks in that hierarchy makes some sense as well IMHO. Say you
  delegate a hierarchy to an unprivileged user and still want to contain
  that user.

I have likely forgot some points but the above ones should be the most
important ones I guess.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer
  2017-06-09 16:30 ` [RFC PATCH v2 0/7] " Michal Hocko
@ 2017-06-22 17:10   ` Roman Gushchin
  2017-06-23 13:43     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-22 17:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

Hi, Michal!

Thank you very much for the review. I've tried to address your
comments in v3 (sent yesterday), so that is why it took some time to reply.

Please, find my comments below.

On Fri, Jun 09, 2017 at 06:30:24PM +0200, Michal Hocko wrote:
> On Thu 01-06-17 19:35:08, Roman Gushchin wrote:
> > This patchset makes the OOM killer cgroup-aware.
> > 
> > Patches 1-3 are simple refactorings of the OOM killer code,
> > required to reuse the code in the memory controller.
> > Patches 4 & 5 are introducing new memcg settings:
> > oom_kill_all_tasks and oom_score_adj.
> > Patch 6 introduces the cgroup-aware OOM killer.
> > Patch 7 is docs update.
> 
> I have only had a look at the cumulative diff (sorry I've been quite
> busy throughout the week) and here are my high level comments. I can see
> few rather serious issues which will need to be resolved before this
> can move on.
> - the first problem is a pre-existing one but it will get more urgent
>   with the fact that more tasks will be killed with your approach. OOM
>   victims are allowed to consume memory reserves without any bound. The
>   current throttling is quite arguable and it relies on the fact that we
>   try to limit the number of tasks to have this access to reserves.
>   Theoretically, though, a heavily multithread application can deplete the
>   reserves completely even now. With more processes being killed this
>   will get much more likely. Johannes and me have already posted patches
>   to address that. The last patch was
>   http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
> - I do not see any explicit lockout mechanism to prevent from too eager oom
>   invocation while the previous oom killed memcg is still not torn down
>   completely. We use tsk_is_oom_victim check in oom_evaluate_task for
>   that purpose. You seem to rely on the fact that such a memcg would be
>   still the largest one, right? I am not really sure this is sufficient.

I've explicitly added a synchronization mechanism based on oom_victims counter.

> - You seem to completely ignore per task oom_score_adj and override it
>   by the memcg value. This makes some sense but it can lead to an
>   unexpected behavior when somebody relies on the original behavior.
>   E.g. a workload that would corrupt data when killed unexpectedly and
>   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
>   break when running inside a container. I do not have a good answer
>   what is the desirable behavior and maybe there is no universal answer.
>   Maybe you just do not to kill those tasks? But then you have to be
>   careful when selecting a memcg victim. Hairy...

I do not ignore it completely, but it matters only for root cgroup tasks
and inside a cgroup when oom_kill_all_tasks is off.

I believe, that cgroup v2 requirement is a good enough. I mean you can't
move from v1 to v2 without changing cgroup settings, and if we will provide
per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.

Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
oom killer off completely and provide compatibility mode.
We can't really save the old system-wide behavior of per-process oom_score_adj,
it makes no sense in the containerized environment.

> - While we are at it oom_score_adj has turned out to be quite unusable
>   for a sensible oom prioritization from my experience. Practically
>   it reduced to disable/enforce the task for selection. The scale is
>   quite small as well. There certainly is a need for prioritization
>   and maybe a completely different api would be better. Maybe a simple
>   priority in (0, infinity) range will be better. Priority could be used
>   either as the only criterion or as a tie breaker when consumption of
>   more memcgs is too close (this could be implemented for each strategy
>   in a different way if we go modules way)

I had such a version, but the the question how to compare root cgroup tasks
and cgroups becomes even harder. But if you have any ideas here, it's
just great. I do not like this [-1000, 1000] range at all.

> - oom_kill_all_tasks should be hierarchical and consistent within a
>   hierarchy. Or maybe it should be applicable to memcgs with tasks (leaf
>   cgroups). Although selecting a memcg higher in the hierarchy kill all
>   tasks in that hierarchy makes some sense as well IMHO. Say you
>   delegate a hierarchy to an unprivileged user and still want to contain
>   that user.

It's already hierarchical, or did I missed something? Please, explain
what you mean. If turned on for any cgroup (except root), it forces oom killer
to treat the whole cgroup as a single enitity, and kill all belonging tasks
if it is selected to be the oom victim.

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer
  2017-06-22 17:10   ` Roman Gushchin
@ 2017-06-23 13:43     ` Michal Hocko
  2017-06-23 18:39       ` Roman Gushchin
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-06-23 13:43 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm

On Thu 22-06-17 18:10:03, Roman Gushchin wrote:
> Hi, Michal!
> 
> Thank you very much for the review. I've tried to address your
> comments in v3 (sent yesterday), so that is why it took some time to reply.

I will try to look at it sometimes next week hopefully

[...]
> > - You seem to completely ignore per task oom_score_adj and override it
> >   by the memcg value. This makes some sense but it can lead to an
> >   unexpected behavior when somebody relies on the original behavior.
> >   E.g. a workload that would corrupt data when killed unexpectedly and
> >   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
> >   break when running inside a container. I do not have a good answer
> >   what is the desirable behavior and maybe there is no universal answer.
> >   Maybe you just do not to kill those tasks? But then you have to be
> >   careful when selecting a memcg victim. Hairy...
> 
> I do not ignore it completely, but it matters only for root cgroup tasks
> and inside a cgroup when oom_kill_all_tasks is off.
> 
> I believe, that cgroup v2 requirement is a good enough. I mean you can't
> move from v1 to v2 without changing cgroup settings, and if we will provide
> per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.
> 
> Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
> oom killer off completely and provide compatibility mode.
> We can't really save the old system-wide behavior of per-process oom_score_adj,
> it makes no sense in the containerized environment.

So what you are going to do with those applications that simply cannot
be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they
unsupported? How does a user find out? One way around this could be to
simply to not kill tasks with OOM_SCORE_ADJ_MIN.

> > - While we are at it oom_score_adj has turned out to be quite unusable
> >   for a sensible oom prioritization from my experience. Practically
> >   it reduced to disable/enforce the task for selection. The scale is
> >   quite small as well. There certainly is a need for prioritization
> >   and maybe a completely different api would be better. Maybe a simple
> >   priority in (0, infinity) range will be better. Priority could be used
> >   either as the only criterion or as a tie breaker when consumption of
> >   more memcgs is too close (this could be implemented for each strategy
> >   in a different way if we go modules way)
> 
> I had such a version, but the the question how to compare root cgroup tasks
> and cgroups becomes even harder. But if you have any ideas here, it's
> just great. I do not like this [-1000, 1000] range at all.

Dunno, would have to think about that some more.

> > - oom_kill_all_tasks should be hierarchical and consistent within a
> >   hierarchy. Or maybe it should be applicable to memcgs with tasks (leaf
> >   cgroups). Although selecting a memcg higher in the hierarchy kill all
> >   tasks in that hierarchy makes some sense as well IMHO. Say you
> >   delegate a hierarchy to an unprivileged user and still want to contain
> >   that user.
> 
> It's already hierarchical, or did I missed something? Please, explain
> what you mean. If turned on for any cgroup (except root), it forces oom killer
> to treat the whole cgroup as a single enitity, and kill all belonging tasks
> if it is selected to be the oom victim.

My fault! I can see that mem_cgroup_scan_tasks will iterate over whole
hierarchy (with oc->chosen_memcg root). Setting oom_kill_all_tasks
differently than up the hierarchy can be confusing but it makes sense
and it is consistent with other knobs.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer
  2017-06-23 13:43     ` Michal Hocko
@ 2017-06-23 18:39       ` Roman Gushchin
  2017-06-26 11:55         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2017-06-23 18:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On Fri, Jun 23, 2017 at 03:43:24PM +0200, Michal Hocko wrote:
> On Thu 22-06-17 18:10:03, Roman Gushchin wrote:
> > Hi, Michal!
> > 
> > Thank you very much for the review. I've tried to address your
> > comments in v3 (sent yesterday), so that is why it took some time to reply.
> 
> I will try to look at it sometimes next week hopefully

Thanks!

> > > - You seem to completely ignore per task oom_score_adj and override it
> > >   by the memcg value. This makes some sense but it can lead to an
> > >   unexpected behavior when somebody relies on the original behavior.
> > >   E.g. a workload that would corrupt data when killed unexpectedly and
> > >   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
> > >   break when running inside a container. I do not have a good answer
> > >   what is the desirable behavior and maybe there is no universal answer.
> > >   Maybe you just do not to kill those tasks? But then you have to be
> > >   careful when selecting a memcg victim. Hairy...
> > 
> > I do not ignore it completely, but it matters only for root cgroup tasks
> > and inside a cgroup when oom_kill_all_tasks is off.
> > 
> > I believe, that cgroup v2 requirement is a good enough. I mean you can't
> > move from v1 to v2 without changing cgroup settings, and if we will provide
> > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.
> > 
> > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
> > oom killer off completely and provide compatibility mode.
> > We can't really save the old system-wide behavior of per-process oom_score_adj,
> > it makes no sense in the containerized environment.
> 
> So what you are going to do with those applications that simply cannot
> be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they
> unsupported? How does a user find out? One way around this could be to
> simply to not kill tasks with OOM_SCORE_ADJ_MIN.

They won't be killed by cgroup OOM, but under some circumstances can be killed
by the global OOM (e.g. there are no other tasks in the selected cgroup,
cgroup v2 is used, and per-cgroup oom score adjustment is not set).

I believe, that per-process oom_score_adj should not play any role outside
of the containing cgroup, it's violation of isolation.

Right now if tasks with oom_score_adj=-1000 eating all memory in a cgroup,
they will be looping forever, OOM killer can't fix this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer
  2017-06-23 18:39       ` Roman Gushchin
@ 2017-06-26 11:55         ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-06-26 11:55 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm

On Fri 23-06-17 19:39:46, Roman Gushchin wrote:
> On Fri, Jun 23, 2017 at 03:43:24PM +0200, Michal Hocko wrote:
> > On Thu 22-06-17 18:10:03, Roman Gushchin wrote:
> > > Hi, Michal!
> > > 
> > > Thank you very much for the review. I've tried to address your
> > > comments in v3 (sent yesterday), so that is why it took some time to reply.
> > 
> > I will try to look at it sometimes next week hopefully
> 
> Thanks!
> 
> > > > - You seem to completely ignore per task oom_score_adj and override it
> > > >   by the memcg value. This makes some sense but it can lead to an
> > > >   unexpected behavior when somebody relies on the original behavior.
> > > >   E.g. a workload that would corrupt data when killed unexpectedly and
> > > >   so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will
> > > >   break when running inside a container. I do not have a good answer
> > > >   what is the desirable behavior and maybe there is no universal answer.
> > > >   Maybe you just do not to kill those tasks? But then you have to be
> > > >   careful when selecting a memcg victim. Hairy...
> > > 
> > > I do not ignore it completely, but it matters only for root cgroup tasks
> > > and inside a cgroup when oom_kill_all_tasks is off.
> > > 
> > > I believe, that cgroup v2 requirement is a good enough. I mean you can't
> > > move from v1 to v2 without changing cgroup settings, and if we will provide
> > > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior.
> > > 
> > > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware
> > > oom killer off completely and provide compatibility mode.
> > > We can't really save the old system-wide behavior of per-process oom_score_adj,
> > > it makes no sense in the containerized environment.
> > 
> > So what you are going to do with those applications that simply cannot
> > be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they
> > unsupported? How does a user find out? One way around this could be to
> > simply to not kill tasks with OOM_SCORE_ADJ_MIN.
> 
> They won't be killed by cgroup OOM, but under some circumstances can be killed
> by the global OOM (e.g. there are no other tasks in the selected cgroup,
> cgroup v2 is used, and per-cgroup oom score adjustment is not set).

Hmm, mem_cgroup_select_oom_victim will happily select a memcg which
contains OOM_SCORE_ADJ_MIN tasks because it ignores per-task score adj.
So memcg OOM killer can kill those tasks AFAICS. But that is not all
that important. Becasuse...

> I believe, that per-process oom_score_adj should not play any role outside
> of the containing cgroup, it's violation of isolation.
> 
> Right now if tasks with oom_score_adj=-1000 eating all memory in a cgroup,
> they will be looping forever, OOM killer can't fix this.

... Yes and that is a price we have to pay for the hard requirement
that oom killer never kills OOM_SCORE_ADJ_MIN task. It is hard to
change that without breaking any existing userspace which relies on the
configuration to protect from an unexpected SIGKILL.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-06-26 11:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-01 18:35 [RFC PATCH v2 0/7] cgroup-aware OOM killer Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 1/7] mm, oom: refactor select_bad_process() to take memcg as an argument Roman Gushchin
2017-06-04 19:25   ` Vladimir Davydov
2017-06-04 22:50   ` David Rientjes
2017-06-06 16:20     ` Roman Gushchin
2017-06-06 20:42       ` David Rientjes
2017-06-08 15:59         ` Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 2/7] mm, oom: split oom_kill_process() and export __oom_kill_process() Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 3/7] mm, oom: export oom_evaluate_task() and select_bad_process() Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 4/7] mm, oom: introduce oom_kill_all_tasks option for memory cgroups Roman Gushchin
2017-06-04 19:30   ` Vladimir Davydov
2017-06-01 18:35 ` [RFC PATCH v2 5/7] mm, oom: introduce oom_score_adj " Roman Gushchin
2017-06-04 19:39   ` Vladimir Davydov
2017-06-01 18:35 ` [RFC PATCH v2 6/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-06-04 20:43   ` Vladimir Davydov
2017-06-06 15:59     ` Roman Gushchin
2017-06-01 18:35 ` [RFC PATCH v2 7/7] mm,oom,docs: describe the " Roman Gushchin
2017-06-09 16:30 ` [RFC PATCH v2 0/7] " Michal Hocko
2017-06-22 17:10   ` Roman Gushchin
2017-06-23 13:43     ` Michal Hocko
2017-06-23 18:39       ` Roman Gushchin
2017-06-26 11:55         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).