linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [v3 0/6] cgroup-aware OOM killer
@ 2017-06-21 21:19 Roman Gushchin
  2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: Roman Gushchin

This patchset makes the OOM killer cgroup-aware.

Patch 1 causes out_of_memory() look at the oom_victim counter
      	to decide if a new victim is required.

Patch 2 is main patch which implements cgroup-aware OOM killer.

Patch 3 adds some debug output, which can be refined later.

Patch 4 introduces per-cgroup oom_score_adj knob.

Patch 5 fixes a problem with too many processes receiving an
      	access to the memory reserves.

Patch 6 is docs update.

v1:
  https://lkml.org/lkml/2017/5/18/969

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v3:
  - Fixed swap accounting
  - Switched to use oom_victims counter to prevent unnecessary kills
  - TIF_MEMDIE is set only when necessary
  - Moved all oom victim killing code into oom_kill.c
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic into a separate commit 4
  - Separated debug output into a separate commit 3

Roman Gushchin (6):
  mm, oom: use oom_victims counter to synchronize oom victim selection
  mm, oom: cgroup-aware OOM killer
  mm, oom: cgroup-aware OOM killer debug info
  mm, oom: introduce oom_score_adj for memory cgroups
  mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
  mm,oom,docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  44 ++++++++++
 include/linux/memcontrol.h  |  23 +++++
 include/linux/oom.h         |   3 +
 kernel/exit.c               |   2 +-
 mm/memcontrol.c             | 209 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c               | 202 ++++++++++++++++++++++++++++--------------
 6 files changed, 416 insertions(+), 67 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
       [not found]   ` <201706220040.v5M0eSnK074332@www262.sakura.ne.jp>
  2017-06-29  9:04   ` Michal Hocko
  2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

Oom killer should avoid unnecessary kills. To prevent them, during
the tasks list traverse we check for task which was previously
selected as oom victims. If there is such a task, new victim
is not selected.

This approach is sub-optimal (we're doing costly iteration over the task
list every time) and will not work for the cgroup-aware oom killer.

We already have oom_victims counter, which can be effectively used
for the task.

If there are victims in flight, don't do anything; if the counter
falls to 0, there are no more oom victims left.
So, it's a good time to start looking for a new victim.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..e3aaf5c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
 	if (oom_killer_disabled)
 		return false;
 
+	/*
+	 * If there are oom victims in flight, we don't need to select
+	 * a new victim.
+	 */
+	if (atomic_read(&oom_victims) > 0)
+		return true;
+
 	if (!is_memcg_oom(oc)) {
 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 		if (freed > 0)
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
  2017-07-10 23:05   ` David Rientjes
  2017-06-21 21:19 ` [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Roman Gushchin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers. There are two main issues:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much more safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  20 ++++++
 include/linux/oom.h        |   3 +
 mm/memcontrol.c            | 155 ++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              | 164 +++++++++++++++++++++++++++++----------------
 4 files changed, 285 insertions(+), 57 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3d..c59926c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,9 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks in the subtree in case of OOM */
+	bool oom_kill_all_tasks;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -342,6 +346,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +489,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -739,6 +750,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -926,6 +941,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 static inline void __inc_memcg_state(struct mem_cgroup *memcg,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..b7ec3bd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	unsigned long chosen_points;
+	struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 544d47e..bdb5103 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,128 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long mem_cgroup_oom_badness(struct mem_cgroup *memcg,
+				   const nodemask_t *nodemask)
+{
+	long points = 0;
+	int nid;
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		for_each_node_state(nid, N_MEMORY) {
+			if (nodemask && !node_isset(nid, *nodemask))
+				continue;
+
+			points += mem_cgroup_node_nr_lru_pages(iter, nid,
+					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+		}
+
+		points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) /
+			(PAGE_SIZE / 1024);
+		points += memcg_page_state(iter, NR_SLAB_UNRECLAIMABLE);
+		points += memcg_page_state(iter, MEMCG_SOCK);
+		points += memcg_page_state(iter, MEMCG_SWAP);
+	}
+
+	return points;
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct cgroup_subsys_state *css = NULL;
+	struct mem_cgroup *iter = NULL;
+	struct mem_cgroup *chosen_memcg = NULL;
+	struct mem_cgroup *parent = root_mem_cgroup;
+	unsigned long totalpages = oc->totalpages;
+	long chosen_memcg_points = 0;
+	long points = 0;
+
+	oc->chosen = NULL;
+	oc->chosen_memcg = NULL;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+	if (oc->memcg) {
+		chosen_memcg = oc->memcg;
+		parent = oc->memcg;
+	}
+
+	rcu_read_lock();
+
+	for (;;) {
+		css = css_next_child(css, &parent->css);
+		if (css) {
+			iter = mem_cgroup_from_css(css);
+
+			points = mem_cgroup_oom_badness(iter, oc->nodemask);
+
+			if (points > chosen_memcg_points) {
+				chosen_memcg = iter;
+				chosen_memcg_points = points;
+				oc->chosen_points = points;
+			}
+
+			continue;
+		}
+
+		if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) {
+			/* Go deeper in the cgroup hierarchy */
+			totalpages = chosen_memcg_points;
+			chosen_memcg_points = 0;
+
+			parent = chosen_memcg;
+			chosen_memcg = NULL;
+
+			continue;
+		}
+
+		if (!chosen_memcg && parent != root_mem_cgroup)
+			chosen_memcg = parent;
+
+		break;
+	}
+
+	if (!oc->memcg) {
+		/*
+		 * We should also consider tasks in the root cgroup
+		 * with badness larger than oc->chosen_points
+		 */
+
+		struct css_task_iter it;
+		struct task_struct *task;
+		int ret = 0;
+
+		css_task_iter_start(&root_mem_cgroup->css, &it);
+		while (!ret && (task = css_task_iter_next(&it)))
+			ret = oom_evaluate_task(task, oc);
+		css_task_iter_end(&it);
+	}
+
+	if (!oc->chosen && chosen_memcg) {
+		if (chosen_memcg->oom_kill_all_tasks) {
+			css_get(&chosen_memcg->css);
+			oc->chosen_memcg = chosen_memcg;
+		}
+
+		/*
+		 * Even if we have to kill all tasks in the cgroup,
+		 * we need to select the biggest task to start with.
+		 * This task will receive an access to the memory
+		 * reserves.
+		 */
+		oc->chosen_points = 0;
+		mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc);
+	}
+
+	rcu_read_unlock();
+
+	oc->chosen_points = 0;
+	return !!oc->chosen || !!oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5166,6 +5288,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
+
+	seq_printf(m, "%d\n", oom_kill_all_tasks);
+
+	return 0;
+}
+
+static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_kill_all_tasks;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
+	if (err)
+		return err;
+
+	memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5286,6 +5435,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_kill_all_tasks",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_kill_all_tasks_show,
+		.write = memory_oom_kill_all_tasks_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e3aaf5c8..489ab69 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -802,66 +802,14 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
 		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -932,10 +880,107 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		wake_oom_reaper(victim);
 
 	mmdrop(mm);
-	put_task_struct(victim);
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+	put_task_struct(victim);
+}
+
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+	if (!tsk_is_oom_victim(task))
+		__oom_kill_process(task);
+	return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+	bool ret = !!oc->chosen;
+
+	if (oc->chosen && oc->chosen != (void *)-1UL) {
+		__oom_kill_process(oc->chosen);
+		put_task_struct(oc->chosen);
+	}
+
+	if (oc->chosen_memcg) {
+		/*
+		 * Kill all tasks in the cgroup hierarchy
+		 */
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+		mem_cgroup_put(oc->chosen_memcg);
+		oc->chosen_memcg = NULL;
+	}
+
+	/*
+	 * Reset points before falling back to the
+	 * old per-process OOM victim selection logic
+	 */
+	oc->chosen_points = 0;
+	oc->chosen = NULL;
+	return ret;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1044,6 +1089,11 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
+		schedule_timeout_killable(1);
+		return true;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v3 3/6] mm, oom: cgroup-aware OOM killer debug info
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
  2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
  2017-06-21 21:19 ` [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Roman Gushchin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Johannes Weiner, Li Zefan,
	Michal Hocko, Vladimir Davydov, Tetsuo Handa, kernel-team,
	cgroups, linux-doc, linux-kernel

Dump the cgroup oom badness score, as well as the name
of chosen victim cgroup.

Here how it looks like in dmesg:
[   18.824495] Choosing a victim memcg because of the system-wide OOM
[   18.826911] Cgroup /A1: 200805
[   18.827996] Cgroup /A2: 273072
[   18.828937] Cgroup /A2/B3: 51
[   18.829795] Cgroup /A2/B4: 272969
[   18.830800] Cgroup /A2/B5: 52
[   18.831890] Chosen cgroup /A2/B4: 272969

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb5103..4face20 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2669,7 +2669,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
+
+	pr_info("Choosing a victim memcg because of the %s",
+		oc->memcg ?
+		"memory limit reached of cgroup " :
+		"system-wide OOM\n");
 	if (oc->memcg) {
+		pr_cont_cgroup_path(oc->memcg->css.cgroup);
+		pr_cont("\n");
+
 		chosen_memcg = oc->memcg;
 		parent = oc->memcg;
 	}
@@ -2683,6 +2691,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 
 			points = mem_cgroup_oom_badness(iter, oc->nodemask);
 
+			pr_info("Cgroup ");
+			pr_cont_cgroup_path(iter->css.cgroup);
+			pr_cont(": %ld\n", points);
+
 			if (points > chosen_memcg_points) {
 				chosen_memcg = iter;
 				chosen_memcg_points = points;
@@ -2731,6 +2743,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 			oc->chosen_memcg = chosen_memcg;
 		}
 
+		pr_info("Chosen cgroup ");
+		pr_cont_cgroup_path(chosen_memcg->css.cgroup);
+		pr_cont(": %ld\n", oc->chosen_points);
+
 		/*
 		 * Even if we have to kill all tasks in the cgroup,
 		 * we need to select the biggest task to start with.
@@ -2739,7 +2755,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 		 */
 		oc->chosen_points = 0;
 		mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc);
-	}
+	} else if (oc->chosen)
+		pr_info("Chosen task %s (%d) in root cgroup: %ld\n",
+			oc->chosen->comm, oc->chosen->pid, oc->chosen_points);
 
 	rcu_read_unlock();
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-06-21 21:19 ` [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
  2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
  2017-06-21 21:19 ` [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Roman Gushchin
  5 siblings, 0 replies; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

Introduce a per-memory-cgroup oom_score_adj setting.
A read-write single value file which exits on non-root
cgroups. The default is "0".

It will have a similar meaning to a per-process value,
available via /proc/<pid>/oom_score_adj.
Should be in a range [-1000, 1000].

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c59926c..b84a050 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,9 @@ struct mem_cgroup {
 	/* kill all tasks in the subtree in case of OOM */
 	bool oom_kill_all_tasks;
 
+	/* OOM kill score adjustment */
+	short oom_score_adj;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4face20..e474eba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5333,6 +5333,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_score_adj_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	short oom_score_adj = memcg->oom_score_adj;
+
+	seq_printf(m, "%d\n", oom_score_adj);
+
+	return 0;
+}
+
+static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_score_adj;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_score_adj);
+	if (err)
+		return err;
+
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	memcg->oom_score_adj = (short)oom_score_adj;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5459,6 +5489,12 @@ static struct cftype memory_files[] = {
 		.write = memory_oom_kill_all_tasks_write,
 	},
 	{
+		.name = "oom_score_adj",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_score_adj_show,
+		.write = memory_oom_score_adj_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-06-21 21:19 ` [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
  2017-06-29  8:53   ` Michal Hocko
  2017-06-21 21:19 ` [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Roman Gushchin
  5 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

We want to limit the number of tasks which are having an access
to the memory reserves. To ensure the progress it's enough
to have one such process at the time.

If we need to kill the whole cgroup, let's give an access to the
memory reserves only to the first process in the list, which is
(usually) the biggest process.
This will give us good chances that all other processes will be able
to quit without an access to the memory reserves.

Otherwise, to keep going forward, let's grant the access to the memory
reserves for tasks, which can't be reaped by the oom_reaper.
As it will be done from the oom reaper thread, which handles the
oom reaper queue consequently, there is no high risk to have too many
such processes at the same time.

To implement this solution, we need to stop using TIF_MEMDIE flag
as an universal marker for oom victims tasks. It's not a big issue,
as we have oom_mm pointer/tsk_is_oom_victim(), which are just better.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 kernel/exit.c |  2 +-
 mm/oom_kill.c | 31 ++++++++++++++++++++++---------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d211425..5b95d74 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -554,7 +554,7 @@ static void exit_mm(void)
 	task_unlock(current);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
+	if (tsk_is_oom_victim(current))
 		exit_oom_victim();
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 489ab69..b55bd18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES &&
+	       !__oom_reap_task_mm(tsk, mm)) {
+
+		/*
+		 * If the task has no access to the memory reserves,
+		 * grant it to help the task to exit.
+		 */
+		if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+			set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
 		schedule_timeout_idle(HZ/10);
+	}
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;
@@ -647,16 +657,13 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
  */
 static void mark_oom_victim(struct task_struct *tsk)
 {
-	struct mm_struct *mm = tsk->mm;
-
 	WARN_ON(oom_killer_disabled);
-	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
 
 	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
-		mmgrab(tsk->signal->oom_mm);
+	if (cmpxchg(&tsk->signal->oom_mm, NULL, tsk->mm) != NULL)
+		return;
+
+	mmgrab(tsk->signal->oom_mm);
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
@@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
 	__thaw_task(tsk);
-	atomic_inc(&oom_victims);
+
+	/*
+	 * If there are no oom victims in flight,
+	 * give the task an access to the memory reserves.
+	 */
+	if (atomic_inc_return(&oom_victims) == 1)
+		set_tsk_thread_flag(tsk, TIF_MEMDIE);
 }
 
 /**
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer
  2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
@ 2017-06-21 21:19 ` Roman Gushchin
  5 siblings, 0 replies; 21+ messages in thread
From: Roman Gushchin @ 2017-06-21 21:19 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index a86f3cb..7a1a1ac 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
     5-2-1. Memory Interface Files
     5-2-2. Usage Guidelines
     5-2-3. Memory Ownership
+    5-2-4. Cgroup-aware OOM Killer
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
@@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_kill_all_tasks
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	Defines whether the OOM killer should treat the cgroup
+	as a single entity during the victim selection.
+
+	If set, it will cause the OOM killer to kill all belonging
+	tasks, both in case of a system-wide or cgroup-wide OOM.
+
+  memory.oom_score_adj
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	OOM killer score adjustment, which has as similar meaning
+	to a per-process value, available via /proc/<pid>/oom_score_adj.
+	Should be in a range [-1000, 1000].
+
   memory.events
 
 	A read-only flat-keyed file which exists on non-root cgroups.
@@ -1028,6 +1049,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+5-2-4. Cgroup-aware OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
 5-3. IO
 
 The "io" controller regulates the distribution of IO resources.  This
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
       [not found]   ` <201706220040.v5M0eSnK074332@www262.sakura.ne.jp>
@ 2017-06-22 16:58     ` Roman Gushchin
  2017-06-22 20:37       ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-06-22 16:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (oom_killer_disabled)
> >  		return false;
> >  
> > +	/*
> > +	 * If there are oom victims in flight, we don't need to select
> > +	 * a new victim.
> > +	 */
> > +	if (atomic_read(&oom_victims) > 0)
> > +		return true;
> > +
> >  	if (!is_memcg_oom(oc)) {
> >  		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> >  		if (freed > 0)
> 
> Above in this patch and below in patch 5 are wrong.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> >  	 * that TIF_MEMDIE tasks should be ignored.
> >  	 */
> >  	__thaw_task(tsk);
> > -	atomic_inc(&oom_victims);
> > +
> > +	/*
> > +	 * If there are no oom victims in flight,
> > +	 * give the task an access to the memory reserves.
> > +	 */
> > +	if (atomic_inc_return(&oom_victims) == 1)
> > +		set_tsk_thread_flag(tsk, TIF_MEMDIE);
> >  }
> >  
> >  /**
> 
> The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> giveup is not permitted, but a multithreaded process might be selected as
> an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> mm increases possibility of preventing some OOM victim thread from terminating
> (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).

I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
be used directly. But can you, please, why do you find the first  chunk wrong?
Basically, it's exactly what we do now: we increment oom_victims for every oom
victim, and every process decrements this counter during it's exit path.
If there is at least one existing victim, we will select it again, so it's just
an optimization. Am I missing something? Why should we start new victim selection
if there processes that will likely quit and release memory soon?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
  2017-06-22 16:58     ` Roman Gushchin
@ 2017-06-22 20:37       ` Tetsuo Handa
       [not found]         ` <201706230537.IDB21366.SQHJVFOOFOMFLt-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-06-22 20:37 UTC (permalink / raw)
  To: guro
  Cc: linux-mm, mhocko, vdavydov.dev, hannes, tj, kernel-team, cgroups,
	linux-doc, linux-kernel

Roman Gushchin wrote:
> On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > >  	if (oom_killer_disabled)
> > >  		return false;
> > >  
> > > +	/*
> > > +	 * If there are oom victims in flight, we don't need to select
> > > +	 * a new victim.
> > > +	 */
> > > +	if (atomic_read(&oom_victims) > 0)
> > > +		return true;
> > > +
> > >  	if (!is_memcg_oom(oc)) {
> > >  		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> > >  		if (freed > 0)
> > 
> > Above in this patch and below in patch 5 are wrong.
> > 
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> > >  	 * that TIF_MEMDIE tasks should be ignored.
> > >  	 */
> > >  	__thaw_task(tsk);
> > > -	atomic_inc(&oom_victims);
> > > +
> > > +	/*
> > > +	 * If there are no oom victims in flight,
> > > +	 * give the task an access to the memory reserves.
> > > +	 */
> > > +	if (atomic_inc_return(&oom_victims) == 1)
> > > +		set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > >  }
> > >  
> > >  /**
> > 
> > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> > giveup is not permitted, but a multithreaded process might be selected as
> > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> > mm increases possibility of preventing some OOM victim thread from terminating
> > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).
> 
> I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
> be used directly. But can you, please, why do you find the first  chunk wrong?

Since you are checking oom_victims before checking task_will_free_mem(current),
only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without
the OOM reaper can get stuck forever.

> Basically, it's exactly what we do now: we increment oom_victims for every oom
> victim, and every process decrements this counter during it's exit path.
> If there is at least one existing victim, we will select it again, so it's just
> an optimization. Am I missing something? Why should we start new victim selection
> if there processes that will likely quit and release memory soon?

If you check oom_victims like
http://lkml.kernel.org/r/201706220053.v5M0rmOU078764@www262.sakura.ne.jp does,
all threads in a multithreaded OOM victim will have a chance to get TIF_MEMDIE.



By the way, below in patch 5 is also wrong.
You are not permitted to set TIF_MEMDIE if oom_killer_disabled == true.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES &&
+	       !__oom_reap_task_mm(tsk, mm)) {
+
+		/*
+		 * If the task has no access to the memory reserves,
+		 * grant it to help the task to exit.
+		 */
+		if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+			set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
 		schedule_timeout_idle(HZ/10);
+	}
 
 	if (attempts <= MAX_OOM_REAP_RETRIES)
 		goto done;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
       [not found]         ` <201706230537.IDB21366.SQHJVFOOFOMFLt-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
@ 2017-06-22 21:52           ` Tetsuo Handa
  2017-06-29 18:47             ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-06-22 21:52 UTC (permalink / raw)
  To: guro-b10kYP2dOMg
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	hannes-druUgvl0LCNAfugRpC6u6w, tj-DgEjT+Ai2ygdnm+yROfE0A,
	kernel-team-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp

Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > > Roman Gushchin wrote:
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > > >  	if (oom_killer_disabled)
> > > >  		return false;
> > > >  
> > > > +	/*
> > > > +	 * If there are oom victims in flight, we don't need to select
> > > > +	 * a new victim.
> > > > +	 */
> > > > +	if (atomic_read(&oom_victims) > 0)
> > > > +		return true;
> > > > +
> > > >  	if (!is_memcg_oom(oc)) {
> > > >  		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> > > >  		if (freed > 0)
> > > 
> > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> > > giveup is not permitted, but a multithreaded process might be selected as
> > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> > > mm increases possibility of preventing some OOM victim thread from terminating
> > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).
> > 
> > I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
> > be used directly. But can you, please, why do you find the first  chunk wrong?
> 
> Since you are checking oom_victims before checking task_will_free_mem(current),
> only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without
> the OOM reaper can get stuck forever.

Oops, I misinterpreted. This is where a multithreaded OOM victim with or without
the OOM reaper can get stuck forever. Think about a process with two threads is
selected by the OOM killer and only one of these two threads can get TIF_MEMDIE.

  Thread-1                 Thread-2                 The OOM killer           The OOM reaper

                           Calls down_write(&current->mm->mmap_sem).
  Enters __alloc_pages_slowpath().
                           Enters __alloc_pages_slowpath().
  Takes oom_lock.
  Calls out_of_memory().
                                                    Selects Thread-1 as an OOM victim.
  Gets SIGKILL.            Gets SIGKILL.
  Gets TIF_MEMDIE.
  Releases oom_lock.
  Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
                                                                             Takes oom_lock.
                                                                             Will do nothing because down_read_trylock() fails.
                                                                             Releases oom_lock.
                                                                             Gives up and sets MMF_OOM_SKIP after one second.
                           Takes oom_lock.
                           Calls out_of_memory().
                           Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
                           Releases oom_lock.
                           Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE.
                           Will not call up_write(&current->mm->mmap_sem).
  Reaches do_exit().
  Calls down_read(&current->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2.
  Will not call up_read(&current->mm->mmap_sem) in exit_mm() in do_exit().
  Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
  2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
@ 2017-06-29  8:53   ` Michal Hocko
  2017-06-29 18:45     ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-06-29  8:53 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> We want to limit the number of tasks which are having an access
> to the memory reserves. To ensure the progress it's enough
> to have one such process at the time.
> 
> If we need to kill the whole cgroup, let's give an access to the
> memory reserves only to the first process in the list, which is
> (usually) the biggest process.
> This will give us good chances that all other processes will be able
> to quit without an access to the memory reserves.

I don't like this to be honest. Is there any reason to go the reduced
memory reserves access to oom victims I was suggesting earlier [1]?

[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
  2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
       [not found]   ` <201706220040.v5M0eSnK074332@www262.sakura.ne.jp>
@ 2017-06-29  9:04   ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-06-29  9:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Wed 21-06-17 22:19:11, Roman Gushchin wrote:
> Oom killer should avoid unnecessary kills. To prevent them, during
> the tasks list traverse we check for task which was previously
> selected as oom victims. If there is such a task, new victim
> is not selected.
> 
> This approach is sub-optimal (we're doing costly iteration over the task
> list every time) and will not work for the cgroup-aware oom killer.
> 
> We already have oom_victims counter, which can be effectively used
> for the task.

A global counter will not work properly, I am afraid. a) you should
consider the oom domain and do not block oom on unrelated domains and b)
you have no guarantee that the oom victim will terminate reasonably.
That is why we have MMF_OOM_SKIP check in oom_evaluate_task.

I think you should have something similar for your memcg victim selection.
If you see a memcg in the oom hierarchy with oom victims which are alive
and not MMF_OOM_SKIP, you should abort the scanning.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
  2017-06-29  8:53   ` Michal Hocko
@ 2017-06-29 18:45     ` Roman Gushchin
  2017-06-30  8:25       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-06-29 18:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote:
> On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> > We want to limit the number of tasks which are having an access
> > to the memory reserves. To ensure the progress it's enough
> > to have one such process at the time.
> > 
> > If we need to kill the whole cgroup, let's give an access to the
> > memory reserves only to the first process in the list, which is
> > (usually) the biggest process.
> > This will give us good chances that all other processes will be able
> > to quit without an access to the memory reserves.
> 
> I don't like this to be honest. Is there any reason to go the reduced
> memory reserves access to oom victims I was suggesting earlier [1]?
> 
> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org

I've nothing against your approach. What's the state of this patchset?
Do you plan to bring it upstream?

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
  2017-06-22 21:52           ` Tetsuo Handa
@ 2017-06-29 18:47             ` Roman Gushchin
  2017-06-29 20:13               ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-06-29 18:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, mhocko, vdavydov.dev, hannes, tj, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > > > Roman Gushchin wrote:
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > > > >  	if (oom_killer_disabled)
> > > > >  		return false;
> > > > >  
> > > > > +	/*
> > > > > +	 * If there are oom victims in flight, we don't need to select
> > > > > +	 * a new victim.
> > > > > +	 */
> > > > > +	if (atomic_read(&oom_victims) > 0)
> > > > > +		return true;
> > > > > +
> > > > >  	if (!is_memcg_oom(oc)) {
> > > > >  		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> > > > >  		if (freed > 0)
> > > > 
> > > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> > > > giveup is not permitted, but a multithreaded process might be selected as
> > > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> > > > mm increases possibility of preventing some OOM victim thread from terminating
> > > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> > > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> > > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).
> > > 
> > > I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
> > > be used directly. But can you, please, why do you find the first  chunk wrong?
> > 
> > Since you are checking oom_victims before checking task_will_free_mem(current),
> > only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without
> > the OOM reaper can get stuck forever.
> 
> Oops, I misinterpreted. This is where a multithreaded OOM victim with or without
> the OOM reaper can get stuck forever. Think about a process with two threads is
> selected by the OOM killer and only one of these two threads can get TIF_MEMDIE.
> 
>   Thread-1                 Thread-2                 The OOM killer           The OOM reaper
> 
>                            Calls down_write(&current->mm->mmap_sem).
>   Enters __alloc_pages_slowpath().
>                            Enters __alloc_pages_slowpath().
>   Takes oom_lock.
>   Calls out_of_memory().
>                                                     Selects Thread-1 as an OOM victim.
>   Gets SIGKILL.            Gets SIGKILL.
>   Gets TIF_MEMDIE.
>   Releases oom_lock.
>   Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
>                                                                              Takes oom_lock.
>                                                                              Will do nothing because down_read_trylock() fails.
>                                                                              Releases oom_lock.
>                                                                              Gives up and sets MMF_OOM_SKIP after one second.
>                            Takes oom_lock.
>                            Calls out_of_memory().
>                            Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
>                            Releases oom_lock.
>                            Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE.
>                            Will not call up_write(&current->mm->mmap_sem).
>   Reaches do_exit().
>   Calls down_read(&current->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2.
>   Will not call up_read(&current->mm->mmap_sem) in exit_mm() in do_exit().
>   Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().

That's interesting... Does it mean, that we have to give an access to the reserves
to all threads to guarantee the forward progress?

What do you think about Michal's approach? He posted a link in the thread.

Thank you!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection
  2017-06-29 18:47             ` Roman Gushchin
@ 2017-06-29 20:13               ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2017-06-29 20:13 UTC (permalink / raw)
  To: guro
  Cc: linux-mm, mhocko, vdavydov.dev, hannes, tj, kernel-team, cgroups,
	linux-doc, linux-kernel

Roman Gushchin wrote:
> On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > Oops, I misinterpreted. This is where a multithreaded OOM victim with or without
> > the OOM reaper can get stuck forever. Think about a process with two threads is
> > selected by the OOM killer and only one of these two threads can get TIF_MEMDIE.
> > 
> >   Thread-1                 Thread-2                 The OOM killer           The OOM reaper
> > 
> >                            Calls down_write(&current->mm->mmap_sem).
> >   Enters __alloc_pages_slowpath().
> >                            Enters __alloc_pages_slowpath().
> >   Takes oom_lock.
> >   Calls out_of_memory().
> >                                                     Selects Thread-1 as an OOM victim.
> >   Gets SIGKILL.            Gets SIGKILL.
> >   Gets TIF_MEMDIE.
> >   Releases oom_lock.
> >   Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
> >                                                                              Takes oom_lock.
> >                                                                              Will do nothing because down_read_trylock() fails.
> >                                                                              Releases oom_lock.
> >                                                                              Gives up and sets MMF_OOM_SKIP after one second.
> >                            Takes oom_lock.
> >                            Calls out_of_memory().
> >                            Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
> >                            Releases oom_lock.
> >                            Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE.
> >                            Will not call up_write(&current->mm->mmap_sem).
> >   Reaches do_exit().
> >   Calls down_read(&current->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2.
> >   Will not call up_read(&current->mm->mmap_sem) in exit_mm() in do_exit().
> >   Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().
> 
> That's interesting... Does it mean, that we have to give an access to the reserves
> to all threads to guarantee the forward progress?

Yes, for we don't have __GFP_KILLABLE flag.

> 
> What do you think about Michal's approach? He posted a link in the thread.

Please read that thread.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE
  2017-06-29 18:45     ` Roman Gushchin
@ 2017-06-30  8:25       ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-06-30  8:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tejun Heo,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel

On Thu 29-06-17 14:45:13, Roman Gushchin wrote:
> On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote:
> > On Wed 21-06-17 22:19:15, Roman Gushchin wrote:
> > > We want to limit the number of tasks which are having an access
> > > to the memory reserves. To ensure the progress it's enough
> > > to have one such process at the time.
> > > 
> > > If we need to kill the whole cgroup, let's give an access to the
> > > memory reserves only to the first process in the list, which is
> > > (usually) the biggest process.
> > > This will give us good chances that all other processes will be able
> > > to quit without an access to the memory reserves.
> > 
> > I don't like this to be honest. Is there any reason to go the reduced
> > memory reserves access to oom victims I was suggesting earlier [1]?
> > 
> > [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org
> 
> I've nothing against your approach. What's the state of this patchset?
> Do you plan to bring it upstream?

Just the specific patch I have linked should be sufficient for what you
need here. The patchset had some issues which I didn't have time to fix
and as such the need for the above patch was not a high priority as
well.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-07-10 23:05   ` David Rientjes
  2017-07-11 12:51     ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-07-10 23:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, 21 Jun 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 

Yes, the original motivation was to limit killing to a single process, if 
possible.  To do that, we kill the process with the largest rss to free 
the most memory and rely on the user to configure /proc/pid/oom_score_adj 
if something else should be prioritized.

With containerization and overcommit of system memory, we concur that 
killing the single largest process isn't always preferable and neglects 
the priority of its memcg.  Your motivation seems to be to provide 
fairness between one memcg with a large process and one memcg with a large 
number of small processes; I'm curious if you are concerned about the 
priority of a memcg hierarchy (how important that "job" is) or whether you 
are strictly concerned with "largeness" of memcgs relative to each other.

> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

We agree.

> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 

This should only be a consequence of overcommiting memcgs at the top level 
so the system oom killer is actually ever invoked, otherwise per-process 
oom_score_adj works well for memcg oom killing.

> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 

These are two different things, right?  We can adjust how the system oom 
killer chooses victims when memcg hierarchies overcommit the system to not 
strictly prefer the single process with the largest rss without killing 
everything attached to the memcg.

Separately: do you not intend to support memcg priorities at all, but 
rather strictly consider the "largeness" of a memcg versus other memcgs?

In our methodology, each memcg is assigned a priority value and the 
iteration of the hierarchy simply compares and visits the memcg with the 
lowest priority at each level and then selects the largest process to 
kill.  This could also support a "kill-all" knob.

	struct mem_cgroup *memcg = root_mem_cgroup;
	struct mem_cgroup *low_memcg;
	unsigned long low_priority;

next:
	low_memcg = NULL;
	low_priority = ULONG_MAX;
	for_each_child_of_memcg(memcg) {
		unsigned long prio = memcg_oom_priority(memcg);

		if (prio < low_priority) {
			low_memcg = memcg;
			low_priority = prio;
		}		
	}
	if (low_memcg)
		goto next;
	oom_kill_process_from_memcg(memcg);

So this is a priority based model that is different than your aggregate 
usage model but I think it allows userspace to define a more powerful 
policy.  We certainly may want to kill from a memcg with a single large 
process, or we may want to kill from a memcg with several small processes, 
it depends on the importance of that job.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-07-10 23:05   ` David Rientjes
@ 2017-07-11 12:51     ` Roman Gushchin
  2017-07-11 20:56       ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-07-11 12:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Mon, Jul 10, 2017 at 04:05:49PM -0700, David Rientjes wrote:
> On Wed, 21 Jun 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers. There are two main issues:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> 
> Yes, the original motivation was to limit killing to a single process, if 
> possible.  To do that, we kill the process with the largest rss to free 
> the most memory and rely on the user to configure /proc/pid/oom_score_adj 
> if something else should be prioritized.
> 
> With containerization and overcommit of system memory, we concur that 
> killing the single largest process isn't always preferable and neglects 
> the priority of its memcg.  Your motivation seems to be to provide 
> fairness between one memcg with a large process and one memcg with a large 
> number of small processes; I'm curious if you are concerned about the 
> priority of a memcg hierarchy (how important that "job" is) or whether you 
> are strictly concerned with "largeness" of memcgs relative to each other.

I'm pretty sure we should provide some way to prioritize some cgroups
over other (in terms of oom killer preferences), but I'm not 100% sure yet,
what's the best way to do it. I've suggested something similar to the existing
oom_score_adj for tasks, mostly to folow the existing design.

One of the questions to answer in priority-based model is
how to compare tasks in the root cgroup with cgroups?

> > ...
> > By default, it will look for the biggest leaf cgroup, and kill
> > the largest task inside.
> > 
> > But a user can change this behavior by enabling the per-cgroup
> > oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > the whole cgroup as an indivisible memory consumer. In case if it's
> > selected as on OOM victim, all belonging tasks will be killed.
> > 
> 
> These are two different things, right?  We can adjust how the system oom 
> killer chooses victims when memcg hierarchies overcommit the system to not 
> strictly prefer the single process with the largest rss without killing 
> everything attached to the memcg.

They are different, and I thought about providing two independent knobs.
But after all I haven't found enough real life examples, where it can be useful.
Can you provide something here?

Also, they are different only for non-leaf cgroups; leaf cgroups
are always treated as indivisible memory consumers during victim selection.

I assume, that containerized systems will always set oom_kill_all_tasks for
top-level container memory cgroups. By default it's turned off
to provide backward compatibility with current behavior and avoid
excessive kills and support oom_score_adj==-1000 (I've added this to v4,
will post soon).

> Separately: do you not intend to support memcg priorities at all, but 
> rather strictly consider the "largeness" of a memcg versus other memcgs?

Answered upper.

> In our methodology, each memcg is assigned a priority value and the 
> iteration of the hierarchy simply compares and visits the memcg with the 
> lowest priority at each level and then selects the largest process to 
> kill.  This could also support a "kill-all" knob.
> 
> 	struct mem_cgroup *memcg = root_mem_cgroup;
> 	struct mem_cgroup *low_memcg;
> 	unsigned long low_priority;
> 
> next:
> 	low_memcg = NULL;
> 	low_priority = ULONG_MAX;
> 	for_each_child_of_memcg(memcg) {
> 		unsigned long prio = memcg_oom_priority(memcg);
> 
> 		if (prio < low_priority) {
> 			low_memcg = memcg;
> 			low_priority = prio;
> 		}		
> 	}
> 	if (low_memcg)
> 		goto next;
> 	oom_kill_process_from_memcg(memcg);
> 
> So this is a priority based model that is different than your aggregate 
> usage model but I think it allows userspace to define a more powerful 
> policy.  We certainly may want to kill from a memcg with a single large 
> process, or we may want to kill from a memcg with several small processes, 
> it depends on the importance of that job.

I believe, that both models have some advantages.
Priority-based model is more powerful, but requires support from the userspace
to set up these priorities (and, probably, adjust them dynamically).
Size-based model is limited, but provides reasonable behavior
without any additional configuration.

I will agree here with Michal Hocko, that bpf like mechanism can be used
here to provide an option for writing some custom oom policies. After we will
have necessary infrastructure to iterate over cgroup tree, select a cgroup
with largest oom score, and kill a biggest task/all the tasks there,
we can add an ability to customize the cgroup (and maybe tasks) evaluation.

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-07-11 12:51     ` Roman Gushchin
@ 2017-07-11 20:56       ` David Rientjes
  2017-07-12 12:11         ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-07-11 20:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, 11 Jul 2017, Roman Gushchin wrote:

> > Yes, the original motivation was to limit killing to a single process, if 
> > possible.  To do that, we kill the process with the largest rss to free 
> > the most memory and rely on the user to configure /proc/pid/oom_score_adj 
> > if something else should be prioritized.
> > 
> > With containerization and overcommit of system memory, we concur that 
> > killing the single largest process isn't always preferable and neglects 
> > the priority of its memcg.  Your motivation seems to be to provide 
> > fairness between one memcg with a large process and one memcg with a large 
> > number of small processes; I'm curious if you are concerned about the 
> > priority of a memcg hierarchy (how important that "job" is) or whether you 
> > are strictly concerned with "largeness" of memcgs relative to each other.
> 
> I'm pretty sure we should provide some way to prioritize some cgroups
> over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> what's the best way to do it. I've suggested something similar to the existing
> oom_score_adj for tasks, mostly to folow the existing design.
> 
> One of the questions to answer in priority-based model is
> how to compare tasks in the root cgroup with cgroups?
> 

We do this with an alternate scoring mechanism, that is purely priority 
based and tiebreaks based on largest rss.  An additional tunable is added 
for each process, under /proc/pid, and also to the memcg hierarchy, and is 
enabled via a system-wide sysctl.  I way to mesh the two scoring 
mechanisms together would be helpful, but for our purposes we don't use 
oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be 
oom disabled when written by third party apps.

For memcg oom conditions, iteration of the hierarchy begins at the oom 
memcg.  For system oom conditions, this is the root memcg.

All processes attached to the oom memcg have their priority based value 
and this is compared to all child memcg's priority value at that level.  
If a process has the lowest priority, it is killed and we're done; we 
could implement a "kill all" mechanism for this memcg that is checked 
before the process is killed.

If a memcg has the lowest priority compared to attached processes, it is 
iterated as well, and so on throughout the memcg hierarchy until we find 
the lowest priority process in the lowest priority leaf memcg.  This way, 
we can fully control which process is killed for both system and memcg oom 
conditions.  I can easily post patches for this, we have used it for 
years.

> > These are two different things, right?  We can adjust how the system oom 
> > killer chooses victims when memcg hierarchies overcommit the system to not 
> > strictly prefer the single process with the largest rss without killing 
> > everything attached to the memcg.
> 
> They are different, and I thought about providing two independent knobs.
> But after all I haven't found enough real life examples, where it can be useful.
> Can you provide something here?
> 

Yes, we have users who we chown their memcg hierarchy to and have full 
control over setting up their hierarchy however we want.  Our "Activity 
Manager", using Documentation/cgroup-v1/memory.txt terminology, only is 
aware of the top level memcg that was chown'd to the user.  That user runs 
a series of batch jobs that are submitted to it and each job is 
represented as a subcontainer to enforce strict limits on the amount of 
memory that job can use.  When it becomes oom, we have found that it is 
preferable to oom kill the entire batch job rather than leave it in an 
inconsistent state, so enabling such a knob here would be helpful.

Other top-level jobs are fine with individual processes being oom killed.  
It can be a low priority process for which they have full control over 
defining the priority through the new per-process and per-memcg value 
described above.  Easy example is scraping logs periodically or other 
best-effort tasks like cleanup.  They can happily be oom killed and 
rescheduled without taking down the entire first-class job.

> Also, they are different only for non-leaf cgroups; leaf cgroups
> are always treated as indivisible memory consumers during victim selection.
> 
> I assume, that containerized systems will always set oom_kill_all_tasks for
> top-level container memory cgroups. By default it's turned off
> to provide backward compatibility with current behavior and avoid
> excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> will post soon).
> 

We certainly would not be enabling it for top-level memcgs, there would be 
no way that we could because we have best-effort processes, but we would 
like to enable it for small batch jobs that are run on behalf of a user in 
their own subcontainer.  We have had this usecase for ~3 years and solely 
because of the problem that you pointed out earlier: it is often much more 
reliable for the kernel to do oom killing of multiple processes rather 
than userspace.

> > In our methodology, each memcg is assigned a priority value and the 
> > iteration of the hierarchy simply compares and visits the memcg with the 
> > lowest priority at each level and then selects the largest process to 
> > kill.  This could also support a "kill-all" knob.
> > 
> > 	struct mem_cgroup *memcg = root_mem_cgroup;
> > 	struct mem_cgroup *low_memcg;
> > 	unsigned long low_priority;
> > 
> > next:
> > 	low_memcg = NULL;
> > 	low_priority = ULONG_MAX;
> > 	for_each_child_of_memcg(memcg) {
> > 		unsigned long prio = memcg_oom_priority(memcg);
> > 
> > 		if (prio < low_priority) {
> > 			low_memcg = memcg;
> > 			low_priority = prio;
> > 		}		
> > 	}
> > 	if (low_memcg)
> > 		goto next;
> > 	oom_kill_process_from_memcg(memcg);
> > 
> > So this is a priority based model that is different than your aggregate 
> > usage model but I think it allows userspace to define a more powerful 
> > policy.  We certainly may want to kill from a memcg with a single large 
> > process, or we may want to kill from a memcg with several small processes, 
> > it depends on the importance of that job.
> 
> I believe, that both models have some advantages.
> Priority-based model is more powerful, but requires support from the userspace
> to set up these priorities (and, probably, adjust them dynamically).

It's a no-op if nobody sets up priorities or the system-wide sysctl is 
disabled.  Presumably, as in our model, the Activity Manager sets the 
sysctl and is responsible for configuring the priorities if present.  All 
memcgs at the sibling level or subcontainer level remain the default if 
not defined by the chown'd user, so this falls back to an rss model for 
backwards compatibility.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-07-11 20:56       ` David Rientjes
@ 2017-07-12 12:11         ` Roman Gushchin
  2017-07-12 20:26           ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2017-07-12 12:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Jul 11, 2017 at 01:56:30PM -0700, David Rientjes wrote:
> On Tue, 11 Jul 2017, Roman Gushchin wrote:
> 
> > > Yes, the original motivation was to limit killing to a single process, if 
> > > possible.  To do that, we kill the process with the largest rss to free 
> > > the most memory and rely on the user to configure /proc/pid/oom_score_adj 
> > > if something else should be prioritized.
> > > 
> > > With containerization and overcommit of system memory, we concur that 
> > > killing the single largest process isn't always preferable and neglects 
> > > the priority of its memcg.  Your motivation seems to be to provide 
> > > fairness between one memcg with a large process and one memcg with a large 
> > > number of small processes; I'm curious if you are concerned about the 
> > > priority of a memcg hierarchy (how important that "job" is) or whether you 
> > > are strictly concerned with "largeness" of memcgs relative to each other.
> > 
> > I'm pretty sure we should provide some way to prioritize some cgroups
> > over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> > what's the best way to do it. I've suggested something similar to the existing
> > oom_score_adj for tasks, mostly to folow the existing design.
> > 
> > One of the questions to answer in priority-based model is
> > how to compare tasks in the root cgroup with cgroups?
> > 
> 
> We do this with an alternate scoring mechanism, that is purely priority 
> based and tiebreaks based on largest rss.  An additional tunable is added 
> for each process, under /proc/pid, and also to the memcg hierarchy, and is 
> enabled via a system-wide sysctl.  I way to mesh the two scoring 
> mechanisms together would be helpful, but for our purposes we don't use 
> oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be 
> oom disabled when written by third party apps.
> 
> For memcg oom conditions, iteration of the hierarchy begins at the oom 
> memcg.  For system oom conditions, this is the root memcg.
> 
> All processes attached to the oom memcg have their priority based value 
> and this is compared to all child memcg's priority value at that level.  
> If a process has the lowest priority, it is killed and we're done; we 
> could implement a "kill all" mechanism for this memcg that is checked 
> before the process is killed.
> 
> If a memcg has the lowest priority compared to attached processes, it is 
> iterated as well, and so on throughout the memcg hierarchy until we find 
> the lowest priority process in the lowest priority leaf memcg.  This way, 
> we can fully control which process is killed for both system and memcg oom 
> conditions.  I can easily post patches for this, we have used it for 
> years.
> 
> > > These are two different things, right?  We can adjust how the system oom 
> > > killer chooses victims when memcg hierarchies overcommit the system to not 
> > > strictly prefer the single process with the largest rss without killing 
> > > everything attached to the memcg.
> > 
> > They are different, and I thought about providing two independent knobs.
> > But after all I haven't found enough real life examples, where it can be useful.
> > Can you provide something here?
> > 
> 
> Yes, we have users who we chown their memcg hierarchy to and have full 
> control over setting up their hierarchy however we want.  Our "Activity 
> Manager", using Documentation/cgroup-v1/memory.txt terminology, only is 
> aware of the top level memcg that was chown'd to the user.  That user runs 
> a series of batch jobs that are submitted to it and each job is 
> represented as a subcontainer to enforce strict limits on the amount of 
> memory that job can use.  When it becomes oom, we have found that it is 
> preferable to oom kill the entire batch job rather than leave it in an 
> inconsistent state, so enabling such a knob here would be helpful.
> 
> Other top-level jobs are fine with individual processes being oom killed.  
> It can be a low priority process for which they have full control over 
> defining the priority through the new per-process and per-memcg value 
> described above.  Easy example is scraping logs periodically or other 
> best-effort tasks like cleanup.  They can happily be oom killed and 
> rescheduled without taking down the entire first-class job.
> 
> > Also, they are different only for non-leaf cgroups; leaf cgroups
> > are always treated as indivisible memory consumers during victim selection.
> > 
> > I assume, that containerized systems will always set oom_kill_all_tasks for
> > top-level container memory cgroups. By default it's turned off
> > to provide backward compatibility with current behavior and avoid
> > excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> > will post soon).
> > 
> 
> We certainly would not be enabling it for top-level memcgs, there would be 
> no way that we could because we have best-effort processes, but we would 
> like to enable it for small batch jobs that are run on behalf of a user in 
> their own subcontainer.  We have had this usecase for ~3 years and solely 
> because of the problem that you pointed out earlier: it is often much more 
> reliable for the kernel to do oom killing of multiple processes rather 
> than userspace.
> 
> > > In our methodology, each memcg is assigned a priority value and the 
> > > iteration of the hierarchy simply compares and visits the memcg with the 
> > > lowest priority at each level and then selects the largest process to 
> > > kill.  This could also support a "kill-all" knob.
> > > 
> > > 	struct mem_cgroup *memcg = root_mem_cgroup;
> > > 	struct mem_cgroup *low_memcg;
> > > 	unsigned long low_priority;
> > > 
> > > next:
> > > 	low_memcg = NULL;
> > > 	low_priority = ULONG_MAX;
> > > 	for_each_child_of_memcg(memcg) {
> > > 		unsigned long prio = memcg_oom_priority(memcg);
> > > 
> > > 		if (prio < low_priority) {
> > > 			low_memcg = memcg;
> > > 			low_priority = prio;
> > > 		}		
> > > 	}
> > > 	if (low_memcg)
> > > 		goto next;
> > > 	oom_kill_process_from_memcg(memcg);
> > > 
> > > So this is a priority based model that is different than your aggregate 
> > > usage model but I think it allows userspace to define a more powerful 
> > > policy.  We certainly may want to kill from a memcg with a single large 
> > > process, or we may want to kill from a memcg with several small processes, 
> > > it depends on the importance of that job.
> > 
> > I believe, that both models have some advantages.
> > Priority-based model is more powerful, but requires support from the userspace
> > to set up these priorities (and, probably, adjust them dynamically).
> 
> It's a no-op if nobody sets up priorities or the system-wide sysctl is 
> disabled.  Presumably, as in our model, the Activity Manager sets the 
> sysctl and is responsible for configuring the priorities if present.  All 
> memcgs at the sibling level or subcontainer level remain the default if 
> not defined by the chown'd user, so this falls back to an rss model for 
> backwards compatibility.

Hm, this is interesting...

What I'm thinking about, is that we can introduce the following model:
each memory cgroup has an integer oom priority value, 0 be default.
Root cgroup priority is always 0, other cgroups can have both positive
or negative priorities.

During OOM victim selection we compare cgroups on each hierarchy level
based on priority and size, if there are several cgroups with equal priority.
Per-task oom_score_adj will affect task selection inside a cgroup if
oom_kill_all_tasks is not set. -1000 special value will also completely
protect a task from being killed, if only oom_kill_all_tasks is not set.

I wonder, if it will work for you?

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v3 2/6] mm, oom: cgroup-aware OOM killer
  2017-07-12 12:11         ` Roman Gushchin
@ 2017-07-12 20:26           ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2017-07-12 20:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, 12 Jul 2017, Roman Gushchin wrote:

> > It's a no-op if nobody sets up priorities or the system-wide sysctl is 
> > disabled.  Presumably, as in our model, the Activity Manager sets the 
> > sysctl and is responsible for configuring the priorities if present.  All 
> > memcgs at the sibling level or subcontainer level remain the default if 
> > not defined by the chown'd user, so this falls back to an rss model for 
> > backwards compatibility.
> 
> Hm, this is interesting...
> 
> What I'm thinking about, is that we can introduce the following model:
> each memory cgroup has an integer oom priority value, 0 be default.
> Root cgroup priority is always 0, other cgroups can have both positive
> or negative priorities.
> 

For our purposes we use a range of [0, 10000] for the per-process oom 
priority; 10000 implies the process is not oom killable, 5000 is the 
default.  We use a range of [0, 9999] for the per-memcg oom priority since 
memcgs cannot disable themselves from oom killing (although they could oom 
disable all attached processes).  We can obviously remap our priorities to 
whatever we decide here, but I think we should give ourselves more room 
and provide 10000 priorities at the minimum (we have 5000 true priorities 
plus overlimit bias).  I'm not sure that negative priorities make sense in 
this model, is there a strong reason to prefer [-5000, 5000] over 
[0, 10000]?

And, yes, the root memcg remains a constant oom priority and is never 
actually checked.

> During OOM victim selection we compare cgroups on each hierarchy level
> based on priority and size, if there are several cgroups with equal priority.
> Per-task oom_score_adj will affect task selection inside a cgroup if
> oom_kill_all_tasks is not set. -1000 special value will also completely
> protect a task from being killed, if only oom_kill_all_tasks is not set.
> 

If there are several cgroups of equal priority, we prefer the one that was 
created the most recently just to avoid losing work that has been done for 
a long period of time.  But the key in this proposal is that we _always_ 
continue to iterate the memcg hierarchy until we find a process attached 
to a memcg with the lowest priority relative to sibling cgroups, if any.

To adapt your model to this proposal, memory.oom_kill_all_tasks would only 
be effective if there are no descendant memcgs.  In that case, iteration 
stops anyway and in my model we kill the process with the lowest 
per-process priority.  This could trivially check 
memory.oom_kill_all_tasks and kill everything, and I'm happy to support 
that feature since we have had a need for it in the past as well.

We should talk about when this priority-based scoring becomes effective.  
We enable it by default in our kernel, but it could be guarded with a VM 
sysctl if necessary to enact a system-wide policy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-07-12 20:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-21 21:19 [v3 0/6] cgroup-aware OOM killer Roman Gushchin
2017-06-21 21:19 ` [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Roman Gushchin
     [not found]   ` <201706220040.v5M0eSnK074332@www262.sakura.ne.jp>
2017-06-22 16:58     ` Roman Gushchin
2017-06-22 20:37       ` Tetsuo Handa
     [not found]         ` <201706230537.IDB21366.SQHJVFOOFOMFLt-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
2017-06-22 21:52           ` Tetsuo Handa
2017-06-29 18:47             ` Roman Gushchin
2017-06-29 20:13               ` Tetsuo Handa
2017-06-29  9:04   ` Michal Hocko
2017-06-21 21:19 ` [v3 2/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-07-10 23:05   ` David Rientjes
2017-07-11 12:51     ` Roman Gushchin
2017-07-11 20:56       ` David Rientjes
2017-07-12 12:11         ` Roman Gushchin
2017-07-12 20:26           ` David Rientjes
2017-06-21 21:19 ` [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Roman Gushchin
2017-06-21 21:19 ` [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Roman Gushchin
2017-06-21 21:19 ` [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Roman Gushchin
2017-06-29  8:53   ` Michal Hocko
2017-06-29 18:45     ` Roman Gushchin
2017-06-30  8:25       ` Michal Hocko
2017-06-21 21:19 ` [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).