linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
       [not found] <20170726132718.14806-1-guro@fb.com>
@ 2017-07-26 13:27 ` Roman Gushchin
  2017-07-26 13:56   ` Michal Hocko
  2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 13:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

First, separate tsk_is_oom_victim() and TIF_MEMDIE flag checks:
let the first one indicate that a task is killed by the OOM killer,
and the second one indicate that a task has an access to the memory
reserves (with a hope to eliminate it later).

Second, set TIF_MEMDIE to all threads of an OOM victim process.

Third, to limit the number of processes which have an access to memory
reserves, let's keep an atomic pointer to a task, which grabbed it.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 kernel/exit.c   |  2 +-
 mm/memcontrol.c |  2 +-
 mm/oom_kill.c   | 30 +++++++++++++++++++++++++-----
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 8f40bee5ba9d..d5f372a2a363 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -542,7 +542,7 @@ static void exit_mm(void)
 	task_unlock(current);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
+	if (tsk_is_oom_victim(current))
 		exit_oom_victim();
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d61133e6af99..9085e55eb69f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1896,7 +1896,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+	if (unlikely(tsk_is_oom_victim(current) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
 		goto force;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..72de01be4d33 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -435,6 +435,8 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 static bool oom_killer_disabled __read_mostly;
 
+static struct task_struct *tif_memdie_owner;
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 
 /*
@@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
 	struct mm_struct *mm = tsk->mm;
 
 	WARN_ON(oom_killer_disabled);
-	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+
+	if (!cmpxchg(&tif_memdie_owner, NULL, current)) {
+		struct task_struct *t;
+
+		rcu_read_lock();
+		for_each_thread(current, t)
+			set_tsk_thread_flag(t, TIF_MEMDIE);
+		rcu_read_unlock();
+	}
+
+	/*
+	 * OOM killer might race with memcg OOM.
+	 * oom_mm is bound to the signal struct life time.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm, NULL, mm))
 		return;
 
-	/* oom_mm is bound to the signal struct life time. */
-	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
-		mmgrab(tsk->signal->oom_mm);
+	mmgrab(tsk->signal->oom_mm);
 
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
@@ -682,6 +695,13 @@ void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
 
+	/*
+	 * If current tasks if a thread, which initially
+	 * received TIF_MEMDIE, clear tif_memdie_owner to
+	 * give a next process a chance to capture it.
+	 */
+	cmpxchg(&tif_memdie_owner, current, NULL);
+
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v4 2/4] mm, oom: cgroup-aware OOM killer
       [not found] <20170726132718.14806-1-guro@fb.com>
  2017-07-26 13:27 ` [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage Roman Gushchin
@ 2017-07-26 13:27 ` Roman Gushchin
  2017-07-27 21:41   ` kbuild test robot
  2017-08-01 14:54   ` Michal Hocko
  2017-07-26 13:27 ` [v4 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
  2017-07-26 13:27 ` [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
  3 siblings, 2 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 13:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  23 +++++
 include/linux/oom.h        |   3 +
 mm/memcontrol.c            | 208 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              | 172 ++++++++++++++++++++++++-------------
 4 files changed, 349 insertions(+), 57 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3914e3dd6168..b21bbb0edc72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks in the subtree in case of OOM */
+	bool oom_kill_all_tasks;
+
+	/* cached OOM score */
+	long oom_score;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -739,6 +753,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -926,6 +944,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 static inline void __inc_memcg_state(struct mem_cgroup *memcg,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..b7ec3bd441be 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	unsigned long chosen_points;
+	struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
@@ -79,6 +80,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9085e55eb69f..ba72d1cf73d0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask)
+{
+	long points = 0;
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int elegible = 0;
+
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		/*
+		 * If there are no tasks, or all tasks have oom_score_adj set
+		 * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
+		 * don't select this memory cgroup.
+		 */
+		if (!elegible &&
+		    (memcg->oom_kill_all_tasks ||
+		     task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
+			elegible = 1;
+
+		/*
+		 * If there are previously selected OOM victims,
+		 * abort memcg selection.
+		 */
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			elegible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible;
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter, *parent;
+
+	for_each_mem_cgroup_tree(iter, root) {
+		if (memcg_has_children(iter)) {
+			iter->oom_score = 0;
+			continue;
+		}
+
+		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
+		if (iter->oom_score == -1) {
+			oc->chosen_memcg = (void *)-1UL;
+			mem_cgroup_iter_break(root, iter);
+			return;
+		}
+
+		if (!iter->oom_score)
+			continue;
+
+		for (parent = parent_mem_cgroup(iter); parent && parent != root;
+		     parent = parent_mem_cgroup(parent))
+			parent->oom_score += iter->oom_score;
+	}
+
+	for (;;) {
+		struct cgroup_subsys_state *css;
+		struct mem_cgroup *memcg = NULL;
+		long score = LONG_MIN;
+
+		css_for_each_child(css, &root->css) {
+			struct mem_cgroup *iter = mem_cgroup_from_css(css);
+
+			if (iter->oom_score > score) {
+				memcg = iter;
+				score = iter->oom_score;
+			}
+		}
+
+		if (!memcg) {
+			if (oc->memcg && root == oc->memcg) {
+				oc->chosen_memcg = oc->memcg;
+				css_get(&oc->chosen_memcg->css);
+				oc->chosen_points = oc->memcg->oom_score;
+			}
+			break;
+		}
+
+		if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
+			oc->chosen_memcg = memcg;
+			css_get(&oc->chosen_memcg->css);
+			oc->chosen_points = score;
+			break;
+		}
+
+		root = memcg;
+	}
+}
+
+static void select_victim_root_cgroup_task(struct oom_control *oc)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int ret = 0;
+
+	css_task_iter_start(&root_mem_cgroup->css, 0, &it);
+	while (!ret && (task = css_task_iter_next(&it)))
+		ret = oom_evaluate_task(task, oc);
+	css_task_iter_end(&it);
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root = root_mem_cgroup;
+
+	oc->chosen = NULL;
+	oc->chosen_memcg = NULL;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+
+	rcu_read_lock();
+
+	select_victim_memcg(root, oc);
+	if (oc->chosen_memcg == (void *)-1UL) {
+		/* Existing OOM victims are found. */
+		rcu_read_unlock();
+		return true;
+	}
+
+	/*
+	 * For system-wide OOMs we should consider tasks in the root cgroup
+	 * with oom_score larger than oc->chosen_points.
+	 */
+	if (!oc->memcg) {
+		select_victim_root_cgroup_task(oc);
+
+		if (oc->chosen && oc->chosen_memcg) {
+			/*
+			 * If we've decided to kill a task in the root memcg,
+			 * release chosen_memcg.
+			 */
+			css_put(&oc->chosen_memcg->css);
+			oc->chosen_memcg = NULL;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return !!oc->chosen || !!oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5171,6 +5346,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_kill_all_tasks = memcg->oom_kill_all_tasks;
+
+	seq_printf(m, "%d\n", oom_kill_all_tasks);
+
+	return 0;
+}
+
+static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_kill_all_tasks;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks);
+	if (err)
+		return err;
+
+	memcg->oom_kill_all_tasks = !!oom_kill_all_tasks;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5291,6 +5493,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_kill_all_tasks",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_kill_all_tasks_show,
+		.write = memory_oom_kill_all_tasks_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 72de01be4d33..a9d75becd1e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -829,66 +829,14 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD))
 		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -959,10 +907,117 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		wake_oom_reaper(victim);
 
 	mmdrop(mm);
-	put_task_struct(victim);
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just set TIF_MEMDIE so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+	put_task_struct(victim);
+}
+
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+	if (!tsk_is_oom_victim(task))
+		__oom_kill_process(task);
+	return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+	if (oc->chosen) {
+		if (oc->chosen != (void *)-1UL) {
+			__oom_kill_process(oc->chosen);
+			put_task_struct(oc->chosen);
+			schedule_timeout_killable(1);
+		}
+		return true;
+
+	} else if (oc->chosen_memcg) {
+		if (oc->chosen_memcg == (void *)-1UL)
+			return true;
+
+		/* Always begin with the biggest task */
+		oc->chosen_points = 0;
+		oc->chosen = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+		if (oc->chosen && oc->chosen != (void *)-1UL) {
+			__oom_kill_process(oc->chosen);
+			put_task_struct(oc->chosen);
+
+			if (oc->chosen_memcg->oom_kill_all_tasks)
+				mem_cgroup_scan_tasks(oc->chosen_memcg,
+						      oom_kill_memcg_member,
+						      NULL);
+		}
+
+		mem_cgroup_put(oc->chosen_memcg);
+		oc->chosen_memcg = NULL;
+		return true;
+
+	} else {
+		oc->chosen_points = 0;
+		return false;
+	}
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1064,6 +1119,9 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v4 3/4] mm, oom: introduce oom_priority for memory cgroups
       [not found] <20170726132718.14806-1-guro@fb.com>
  2017-07-26 13:27 ` [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage Roman Gushchin
  2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-07-26 13:27 ` Roman Gushchin
  2017-08-08 23:14   ` David Rientjes
  2017-07-26 13:27 ` [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
  3 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 13:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	David Rientjes, Tejun Heo, Tetsuo Handa, kernel-team, cgroups,
	linux-doc, linux-kernel

Introduce a per-memory-cgroup oom_priority setting: an integer number
within the [-10000, 10000] range, which defines the order in which
the OOM killer selects victim memory cgroups.

OOM killer prefers memory cgroups with larger priority if they are
populated with elegible tasks.

The oom_priority value is compared within sibling cgroups.

The root cgroup has the oom_priority 0, which cannot be changed.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 55 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b21bbb0edc72..d31ac58e08ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,9 @@ struct mem_cgroup {
 	/* cached OOM score */
 	long oom_score;
 
+	/* OOM killer priority */
+	short oom_priority;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba72d1cf73d0..2c1566995077 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2710,12 +2710,21 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	for (;;) {
 		struct cgroup_subsys_state *css;
 		struct mem_cgroup *memcg = NULL;
+		short prio = SHRT_MIN;
 		long score = LONG_MIN;
 
 		css_for_each_child(css, &root->css) {
 			struct mem_cgroup *iter = mem_cgroup_from_css(css);
 
-			if (iter->oom_score > score) {
+			if (iter->oom_score == 0)
+				continue;
+
+			if (iter->oom_priority > prio) {
+				memcg = iter;
+				prio = iter->oom_priority;
+				score = iter->oom_score;
+			} else if (iter->oom_priority == prio &&
+				   iter->oom_score > score) {
 				memcg = iter;
 				score = iter->oom_score;
 			}
@@ -2782,7 +2791,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	 * For system-wide OOMs we should consider tasks in the root cgroup
 	 * with oom_score larger than oc->chosen_points.
 	 */
-	if (!oc->memcg) {
+	if (!oc->memcg && !(oc->chosen_memcg &&
+			    oc->chosen_memcg->oom_priority > 0)) {
+		/*
+		 * Root memcg has priority 0, so if chosen memcg has lower
+		 * priority, any task in root cgroup is preferable.
+		 */
+		if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
+			oc->chosen_points = 0;
+
 		select_victim_root_cgroup_task(oc);
 
 		if (oc->chosen && oc->chosen_memcg) {
@@ -5373,6 +5390,34 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_priority_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	seq_printf(m, "%d\n", memcg->oom_priority);
+
+	return 0;
+}
+
+static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_priority;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_priority);
+	if (err)
+		return err;
+
+	if (oom_priority < -10000 || oom_priority > 10000)
+		return -EINVAL;
+
+	memcg->oom_priority = (short)oom_priority;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5499,6 +5544,12 @@ static struct cftype memory_files[] = {
 		.write = memory_oom_kill_all_tasks_write,
 	},
 	{
+		.name = "oom_priority",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_priority_show,
+		.write = memory_oom_priority_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer
       [not found] <20170726132718.14806-1-guro@fb.com>
                   ` (2 preceding siblings ...)
  2017-07-26 13:27 ` [v4 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
@ 2017-07-26 13:27 ` Roman Gushchin
  2017-08-08 23:24   ` David Rientjes
  3 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 13:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Update cgroups v2 docs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index cb9ea281ab72..bf106b6b6b52 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. Cgroup-aware OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1001,6 +1002,37 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_kill_all_tasks
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	Defines whether the OOM killer should treat the cgroup
+	as a single entity during the victim selection.
+
+	If set, OOM killer will kill all belonging tasks in
+	corresponding cgroup is selected as an OOM victim.
+
+	Be default, OOM killer respect /proc/pid/oom_score_adj value
+	-1000, and will never kill the task, unless oom_kill_all_tasks
+	is set.
+
+  memory.oom_priority
+
+	A read-write single value file which exits on non-root
+	cgroups.  The default is "0".
+
+	An integer number within the [-10000, 10000] range,
+	which defines the order in which the OOM killer selects victim
+	memory cgroups.
+
+	OOM killer prefers memory cgroups with larger priority if they
+	are populated with elegible tasks.
+
+	The oom_priority value is compared within sibling cgroups.
+
+	The root cgroup has the oom_priority 0, which cannot be changed.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1205,6 +1237,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+Cgroup-aware OOM Killer
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choise of a victim, hierarchically looking for the largest memory
+consumer. By default, it will look for the biggest task in the
+biggest leaf cgroup.
+
+Be default, all cgroups have oom_priority 0, and OOM killer will
+chose the largest cgroup recursively on each level. For non-root
+cgroups it's possible to change the oom_priority, and it will cause
+the OOM killer to look athe the priority value first, and compare
+sizes only of cgroups with equal priority.
+
+But a user can change this behavior by enabling the per-cgroup
+oom_kill_all_tasks option. If set, it causes the OOM killer treat
+the whole cgroup as an indivisible memory consumer. In case if it's
+selected as on OOM victim, all belonging tasks will be killed.
+
+Tasks in the root cgroup are treated as independent memory consumers,
+and are compared with other memory consumers (e.g. leaf cgroups).
+The root cgroup doesn't support the oom_kill_all_tasks feature.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
 IO
 --
 
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
  2017-07-26 13:27 ` [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage Roman Gushchin
@ 2017-07-26 13:56   ` Michal Hocko
  2017-07-26 14:06     ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-07-26 13:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed 26-07-17 14:27:15, Roman Gushchin wrote:
[...]
> @@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
>  	struct mm_struct *mm = tsk->mm;
>  
>  	WARN_ON(oom_killer_disabled);
> -	/* OOM killer might race with memcg OOM */
> -	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> +
> +	if (!cmpxchg(&tif_memdie_owner, NULL, current)) {
> +		struct task_struct *t;
> +
> +		rcu_read_lock();
> +		for_each_thread(current, t)
> +			set_tsk_thread_flag(t, TIF_MEMDIE);
> +		rcu_read_unlock();
> +	}

I would realy much rather see we limit the amount of memory reserves oom
victims can consume rather than build on top of the current hackish
approach of limiting the number of tasks because the fundamental problem
is still there (a heavy multithreaded process can still deplete the
reserves completely).

Is there really any reason to not go with the existing patch I've
pointed to the last time around? You didn't seem to have any objects
back then.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
  2017-07-26 13:56   ` Michal Hocko
@ 2017-07-26 14:06     ` Roman Gushchin
  2017-07-26 14:24       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, Jul 26, 2017 at 03:56:22PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 14:27:15, Roman Gushchin wrote:
> [...]
> > @@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
> >  	struct mm_struct *mm = tsk->mm;
> >  
> >  	WARN_ON(oom_killer_disabled);
> > -	/* OOM killer might race with memcg OOM */
> > -	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > +
> > +	if (!cmpxchg(&tif_memdie_owner, NULL, current)) {
> > +		struct task_struct *t;
> > +
> > +		rcu_read_lock();
> > +		for_each_thread(current, t)
> > +			set_tsk_thread_flag(t, TIF_MEMDIE);
> > +		rcu_read_unlock();
> > +	}
> 
> I would realy much rather see we limit the amount of memory reserves oom
> victims can consume rather than build on top of the current hackish
> approach of limiting the number of tasks because the fundamental problem
> is still there (a heavy multithreaded process can still deplete the
> reserves completely).
> 
> Is there really any reason to not go with the existing patch I've
> pointed to the last time around? You didn't seem to have any objects
> back then.

Hi Michal!

I had this patch in mind and mentioned in the commit log, that TIF_MEMDIE
as an memory reserves access indicator will probably be eliminated later.

But that patch is not upstream yet, and it's directly related to the theme.
The proposed refactoring of TIF_MEMDIE usage is not against your approach,
and will not make harder to go this way further.

I'm slightly concerned about an idea to give TIF_MEMDIE to all tasks
in case we're killing a really large cgroup. But it's only a theoretical
concern, maybe it's fine.

So, I'd keep the existing approach for this patchset, and then we can follow
your approach and we will have a better test case for it.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
  2017-07-26 14:06     ` Roman Gushchin
@ 2017-07-26 14:24       ` Michal Hocko
  2017-07-26 14:44         ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-07-26 14:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed 26-07-17 15:06:07, Roman Gushchin wrote:
> On Wed, Jul 26, 2017 at 03:56:22PM +0200, Michal Hocko wrote:
> > On Wed 26-07-17 14:27:15, Roman Gushchin wrote:
> > [...]
> > > @@ -656,13 +658,24 @@ static void mark_oom_victim(struct task_struct *tsk)
> > >  	struct mm_struct *mm = tsk->mm;
> > >  
> > >  	WARN_ON(oom_killer_disabled);
> > > -	/* OOM killer might race with memcg OOM */
> > > -	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > +
> > > +	if (!cmpxchg(&tif_memdie_owner, NULL, current)) {
> > > +		struct task_struct *t;
> > > +
> > > +		rcu_read_lock();
> > > +		for_each_thread(current, t)
> > > +			set_tsk_thread_flag(t, TIF_MEMDIE);
> > > +		rcu_read_unlock();
> > > +	}
> > 
> > I would realy much rather see we limit the amount of memory reserves oom
> > victims can consume rather than build on top of the current hackish
> > approach of limiting the number of tasks because the fundamental problem
> > is still there (a heavy multithreaded process can still deplete the
> > reserves completely).
> > 
> > Is there really any reason to not go with the existing patch I've
> > pointed to the last time around? You didn't seem to have any objects
> > back then.
> 
> Hi Michal!
> 
> I had this patch in mind and mentioned in the commit log, that TIF_MEMDIE
> as an memory reserves access indicator will probably be eliminated later.
> 
> But that patch is not upstream yet, and it's directly related to the theme.
> The proposed refactoring of TIF_MEMDIE usage is not against your approach,
> and will not make harder to go this way further.

So what is the reason to invent another tricky way to limit access to
memory reserves? The patch is not upstream but nothin prevents you from
picking it up and post along with your series. Or if you prefer I can
post it separately?

> I'm slightly concerned about an idea to give TIF_MEMDIE to all tasks
> in case we're killing a really large cgroup.

Why? There shouldn't be any issue if the access is limited.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
  2017-07-26 14:24       ` Michal Hocko
@ 2017-07-26 14:44         ` Michal Hocko
  2017-07-26 14:50           ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-07-26 14:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed 26-07-17 16:24:34, Michal Hocko wrote:
[...]
> Or if you prefer I can post it separately?

I've just tried to rebase relevant parts on top of the current mmotm
tree and it needs some non-trivial updates. Would you mind if I post
those patches with you on CC? I really think that we shouldn't invent a
new throttling just to replace it later.

I will comment on the rest of the series later. I have glanced through
it and have to digest it some more before replying.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage
  2017-07-26 14:44         ` Michal Hocko
@ 2017-07-26 14:50           ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-07-26 14:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, Jul 26, 2017 at 04:44:08PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 16:24:34, Michal Hocko wrote:
> [...]
> > Or if you prefer I can post it separately?
> 
> I've just tried to rebase relevant parts on top of the current mmotm
> tree and it needs some non-trivial updates. Would you mind if I post
> those patches with you on CC?

Sure.
Again, I'm not against your approach (and I've tried to rebase your patches,
and it worked well for me, although I didn't run any proper tests),
I just don't want to create an unnecessary dependancy here.

If your patchset will be accepted, it will be quite trivial to rebase
mine and vice-versa; they are not so dependant.

> I will comment on the rest of the series later. I have glanced through
> it and have to digest it some more before replying.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-07-27 21:41   ` kbuild test robot
  2017-08-01 14:54   ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2017-07-27 21:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: kbuild-all, linux-mm, Roman Gushchin, Michal Hocko,
	Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2101 bytes --]

Hi Roman,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.13-rc2 next-20170727]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Roman-Gushchin/cgroup-aware-OOM-killer/20170728-051627
config: i386-tinyconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/oom_kill.c: In function 'oom_kill_memcg_victim':
>> mm/oom_kill.c:1005:24: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
       if (oc->chosen_memcg->oom_kill_all_tasks)
                           ^~

vim +1005 mm/oom_kill.c

   982	
   983	static bool oom_kill_memcg_victim(struct oom_control *oc)
   984	{
   985		if (oc->chosen) {
   986			if (oc->chosen != (void *)-1UL) {
   987				__oom_kill_process(oc->chosen);
   988				put_task_struct(oc->chosen);
   989				schedule_timeout_killable(1);
   990			}
   991			return true;
   992	
   993		} else if (oc->chosen_memcg) {
   994			if (oc->chosen_memcg == (void *)-1UL)
   995				return true;
   996	
   997			/* Always begin with the biggest task */
   998			oc->chosen_points = 0;
   999			oc->chosen = NULL;
  1000			mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
  1001			if (oc->chosen && oc->chosen != (void *)-1UL) {
  1002				__oom_kill_process(oc->chosen);
  1003				put_task_struct(oc->chosen);
  1004	
> 1005				if (oc->chosen_memcg->oom_kill_all_tasks)
  1006					mem_cgroup_scan_tasks(oc->chosen_memcg,
  1007							      oom_kill_memcg_member,
  1008							      NULL);
  1009			}
  1010	
  1011			mem_cgroup_put(oc->chosen_memcg);
  1012			oc->chosen_memcg = NULL;
  1013			return true;
  1014	
  1015		} else {
  1016			oc->chosen_points = 0;
  1017			return false;
  1018		}
  1019	}
  1020	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6670 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
  2017-07-27 21:41   ` kbuild test robot
@ 2017-08-01 14:54   ` Michal Hocko
  2017-08-01 15:25     ` Roman Gushchin
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-08-01 14:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed 26-07-17 14:27:16, Roman Gushchin wrote:
[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask)
> +{
> +	long points = 0;
> +	int nid;
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;

I am wondering why are you diverging from the global oom_badness
behavior here. Although doing per NUMA accounting sounds like a better
idea but then you just end up mixing this with non NUMA numbers and the
whole thing is harder to understand without great advantages.

> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter, *parent;
> +
> +	for_each_mem_cgroup_tree(iter, root) {
> +		if (memcg_has_children(iter)) {
> +			iter->oom_score = 0;
> +			continue;
> +		}
> +
> +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> +		if (iter->oom_score == -1) {
> +			oc->chosen_memcg = (void *)-1UL;
> +			mem_cgroup_iter_break(root, iter);
> +			return;
> +		}
> +
> +		if (!iter->oom_score)
> +			continue;
> +
> +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +		     parent = parent_mem_cgroup(parent))
> +			parent->oom_score += iter->oom_score;
> +	}
> +
> +	for (;;) {
> +		struct cgroup_subsys_state *css;
> +		struct mem_cgroup *memcg = NULL;
> +		long score = LONG_MIN;
> +
> +		css_for_each_child(css, &root->css) {
> +			struct mem_cgroup *iter = mem_cgroup_from_css(css);
> +
> +			if (iter->oom_score > score) {
> +				memcg = iter;
> +				score = iter->oom_score;
> +			}
> +		}
> +
> +		if (!memcg) {
> +			if (oc->memcg && root == oc->memcg) {
> +				oc->chosen_memcg = oc->memcg;
> +				css_get(&oc->chosen_memcg->css);
> +				oc->chosen_points = oc->memcg->oom_score;
> +			}
> +			break;
> +		}
> +
> +		if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> +			oc->chosen_memcg = memcg;
> +			css_get(&oc->chosen_memcg->css);
> +			oc->chosen_points = score;
> +			break;
> +		}
> +
> +		root = memcg;
> +	}
> +}

This and the rest of the victim selection code is really hairy and hard
to follow.

I would reap out the oom_kill_process into a separate patch.

> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)

To the rest of the patch. I have to say I do not quite like how it is
implemented. I was hoping for something much simpler which would hook
into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
then we would update the cumulative memcg badness (more specifically the
badness of the topmost parent with kill-all flag). Memcg will then
compete with existing self contained tasks (oom_badness will have to
tell whether points belong to a task or a memcg to allow the caller to
deal with it). But it shouldn't be much more complex than that.

Or is there something that I am missing and that would prevent such a
simple approach?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-01 14:54   ` Michal Hocko
@ 2017-08-01 15:25     ` Roman Gushchin
  2017-08-01 17:03       ` Michal Hocko
  2017-08-08 23:06       ` David Rientjes
  0 siblings, 2 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-08-01 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> On Wed 26-07-17 14:27:16, Roman Gushchin wrote:
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		if (nodemask && !node_isset(nid, *nodemask))
> > +			continue;
> > +
> > +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +	}
> > +
> > +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +		(PAGE_SIZE / 1024);
> > +	points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE);
> > +	points += memcg_page_state(memcg, MEMCG_SOCK);
> > +	points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > +	return points;
> 
> I am wondering why are you diverging from the global oom_badness
> behavior here. Although doing per NUMA accounting sounds like a better
> idea but then you just end up mixing this with non NUMA numbers and the
> whole thing is harder to understand without great advantages.

Ok, makes sense. I can revert to the existing OOM behaviour here.

> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter, *parent;
> > +
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		if (memcg_has_children(iter)) {
> > +			iter->oom_score = 0;
> > +			continue;
> > +		}
> > +
> > +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask);
> > +		if (iter->oom_score == -1) {
> > +			oc->chosen_memcg = (void *)-1UL;
> > +			mem_cgroup_iter_break(root, iter);
> > +			return;
> > +		}
> > +
> > +		if (!iter->oom_score)
> > +			continue;
> > +
> > +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > +		     parent = parent_mem_cgroup(parent))
> > +			parent->oom_score += iter->oom_score;
> > +	}
> > +
> > +	for (;;) {
> > +		struct cgroup_subsys_state *css;
> > +		struct mem_cgroup *memcg = NULL;
> > +		long score = LONG_MIN;
> > +
> > +		css_for_each_child(css, &root->css) {
> > +			struct mem_cgroup *iter = mem_cgroup_from_css(css);
> > +
> > +			if (iter->oom_score > score) {
> > +				memcg = iter;
> > +				score = iter->oom_score;
> > +			}
> > +		}
> > +
> > +		if (!memcg) {
> > +			if (oc->memcg && root == oc->memcg) {
> > +				oc->chosen_memcg = oc->memcg;
> > +				css_get(&oc->chosen_memcg->css);
> > +				oc->chosen_points = oc->memcg->oom_score;
> > +			}
> > +			break;
> > +		}
> > +
> > +		if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) {
> > +			oc->chosen_memcg = memcg;
> > +			css_get(&oc->chosen_memcg->css);
> > +			oc->chosen_points = score;
> > +			break;
> > +		}
> > +
> > +		root = memcg;
> > +	}
> > +}
> 
> This and the rest of the victim selection code is really hairy and hard
> to follow.

Will adding more comments help here?

> 
> I would reap out the oom_kill_process into a separate patch.

It was a separate patch, I've merged it based on Vladimir's feedback.
No problems, I can divide it back.

> > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > +static void __oom_kill_process(struct task_struct *victim)
> 
> To the rest of the patch. I have to say I do not quite like how it is
> implemented. I was hoping for something much simpler which would hook
> into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> then we would update the cumulative memcg badness (more specifically the
> badness of the topmost parent with kill-all flag). Memcg will then
> compete with existing self contained tasks (oom_badness will have to
> tell whether points belong to a task or a memcg to allow the caller to
> deal with it). But it shouldn't be much more complex than that.

I'm not sure, it will be any simpler. Basically I'm doing the same:
the difference is that you want to iterate over tasks and for each
task traverse the memcg tree, update per-cgroup oom score and find
the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
traverse the cgroup tree, and for each leaf cgroup iterate over processes.

Also, please note, that even without the kill-all flag the decision is made
on per-cgroup level (except tasks in the root cgroup).

Thank you!

Roman

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-01 15:25     ` Roman Gushchin
@ 2017-08-01 17:03       ` Michal Hocko
  2017-08-01 18:13         ` Roman Gushchin
  2017-08-08 23:06       ` David Rientjes
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-08-01 17:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
[...]
> > I would reap out the oom_kill_process into a separate patch.
> 
> It was a separate patch, I've merged it based on Vladimir's feedback.
> No problems, I can divide it back.

It would make the review slightly more easier
> 
> > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > +static void __oom_kill_process(struct task_struct *victim)
> > 
> > To the rest of the patch. I have to say I do not quite like how it is
> > implemented. I was hoping for something much simpler which would hook
> > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > then we would update the cumulative memcg badness (more specifically the
> > badness of the topmost parent with kill-all flag). Memcg will then
> > compete with existing self contained tasks (oom_badness will have to
> > tell whether points belong to a task or a memcg to allow the caller to
> > deal with it). But it shouldn't be much more complex than that.
> 
> I'm not sure, it will be any simpler. Basically I'm doing the same:
> the difference is that you want to iterate over tasks and for each
> task traverse the memcg tree, update per-cgroup oom score and find
> the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> traverse the cgroup tree, and for each leaf cgroup iterate over processes.

Yeah but this doesn't fit very well to the existing scheme so we would
need two different schemes which is not ideal from maint. point of view.
We also do not have to duplicate all the tricky checks we already do in
oom_evaluate_task. So I would prefer if we could try to hook there and
do the special handling there.

> Also, please note, that even without the kill-all flag the decision is made
> on per-cgroup level (except tasks in the root cgroup).

Yeah and I am not sure this is a reasonable behavior. Why should we
consider memcgs which are not kill-all as a single entity?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-01 17:03       ` Michal Hocko
@ 2017-08-01 18:13         ` Roman Gushchin
  2017-08-02  7:29           ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-08-01 18:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> [...]
> > > I would reap out the oom_kill_process into a separate patch.
> > 
> > It was a separate patch, I've merged it based on Vladimir's feedback.
> > No problems, I can divide it back.
> 
> It would make the review slightly more easier
> > 
> > > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > +static void __oom_kill_process(struct task_struct *victim)
> > > 
> > > To the rest of the patch. I have to say I do not quite like how it is
> > > implemented. I was hoping for something much simpler which would hook
> > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > then we would update the cumulative memcg badness (more specifically the
> > > badness of the topmost parent with kill-all flag). Memcg will then
> > > compete with existing self contained tasks (oom_badness will have to
> > > tell whether points belong to a task or a memcg to allow the caller to
> > > deal with it). But it shouldn't be much more complex than that.
> > 
> > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > the difference is that you want to iterate over tasks and for each
> > task traverse the memcg tree, update per-cgroup oom score and find
> > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> 
> Yeah but this doesn't fit very well to the existing scheme so we would
> need two different schemes which is not ideal from maint. point of view.
> We also do not have to duplicate all the tricky checks we already do in
> oom_evaluate_task. So I would prefer if we could try to hook there and
> do the special handling there.

I hope, that iterating over all tasks just to check if there are
in-flight OOM victims might be optimized at some point.
That means, we would be able to choose a victim much cheaper.
It's not easy, but it feels as a right direction to go.

Also, adding new tricks to the oom_evaluate_task() will make the code
even more hairy. Some of the existing tricks are useless for memcg selection.

> 
> > Also, please note, that even without the kill-all flag the decision is made
> > on per-cgroup level (except tasks in the root cgroup).
> 
> Yeah and I am not sure this is a reasonable behavior. Why should we
> consider memcgs which are not kill-all as a single entity?

I think, it's reasonable to choose a cgroup/container to blow off based on
the cgroup oom_priority/size (including hierarchical settings), and then
kill one biggest or all tasks depending on cgroup settings.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-01 18:13         ` Roman Gushchin
@ 2017-08-02  7:29           ` Michal Hocko
  2017-08-03 12:47             ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-08-02  7:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue 01-08-17 19:13:52, Roman Gushchin wrote:
> On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> > [...]
> > > > I would reap out the oom_kill_process into a separate patch.
> > > 
> > > It was a separate patch, I've merged it based on Vladimir's feedback.
> > > No problems, I can divide it back.
> > 
> > It would make the review slightly more easier
> > > 
> > > > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > +static void __oom_kill_process(struct task_struct *victim)
> > > > 
> > > > To the rest of the patch. I have to say I do not quite like how it is
> > > > implemented. I was hoping for something much simpler which would hook
> > > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > > then we would update the cumulative memcg badness (more specifically the
> > > > badness of the topmost parent with kill-all flag). Memcg will then
> > > > compete with existing self contained tasks (oom_badness will have to
> > > > tell whether points belong to a task or a memcg to allow the caller to
> > > > deal with it). But it shouldn't be much more complex than that.
> > > 
> > > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > > the difference is that you want to iterate over tasks and for each
> > > task traverse the memcg tree, update per-cgroup oom score and find
> > > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> > 
> > Yeah but this doesn't fit very well to the existing scheme so we would
> > need two different schemes which is not ideal from maint. point of view.
> > We also do not have to duplicate all the tricky checks we already do in
> > oom_evaluate_task. So I would prefer if we could try to hook there and
> > do the special handling there.
> 
> I hope, that iterating over all tasks just to check if there are
> in-flight OOM victims might be optimized at some point.
> That means, we would be able to choose a victim much cheaper.
> It's not easy, but it feels as a right direction to go.

You would have to count per each oom domain and that sounds quite
unfeasible to me.
 
> Also, adding new tricks to the oom_evaluate_task() will make the code
> even more hairy. Some of the existing tricks are useless for memcg selection.

Not sure what you mean but oom_evaluate_task has been usable for both
global and memcg oom paths so far. I do not see any reason why this
shouldn't hold for a different oom killing strategy.

> > > Also, please note, that even without the kill-all flag the decision is made
> > > on per-cgroup level (except tasks in the root cgroup).
> > 
> > Yeah and I am not sure this is a reasonable behavior. Why should we
> > consider memcgs which are not kill-all as a single entity?
> 
> I think, it's reasonable to choose a cgroup/container to blow off based on
> the cgroup oom_priority/size (including hierarchical settings), and then
> kill one biggest or all tasks depending on cgroup settings.

But that doesn't mean you have to treat even !kill-all memcgs like a
single entity. In fact we should compare killable entities which is
either a task or the whole memcg if configured that way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-02  7:29           ` Michal Hocko
@ 2017-08-03 12:47             ` Roman Gushchin
  2017-08-03 13:01               ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2017-08-03 12:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, Aug 02, 2017 at 09:29:01AM +0200, Michal Hocko wrote:
> On Tue 01-08-17 19:13:52, Roman Gushchin wrote:
> > On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> > > On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > > > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I would reap out the oom_kill_process into a separate patch.
> > > > 
> > > > It was a separate patch, I've merged it based on Vladimir's feedback.
> > > > No problems, I can divide it back.
> > > 
> > > It would make the review slightly more easier
> > > > 
> > > > > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > > +static void __oom_kill_process(struct task_struct *victim)
> > > > > 
> > > > > To the rest of the patch. I have to say I do not quite like how it is
> > > > > implemented. I was hoping for something much simpler which would hook
> > > > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > > > then we would update the cumulative memcg badness (more specifically the
> > > > > badness of the topmost parent with kill-all flag). Memcg will then
> > > > > compete with existing self contained tasks (oom_badness will have to
> > > > > tell whether points belong to a task or a memcg to allow the caller to
> > > > > deal with it). But it shouldn't be much more complex than that.
> > > > 
> > > > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > > > the difference is that you want to iterate over tasks and for each
> > > > task traverse the memcg tree, update per-cgroup oom score and find
> > > > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > > > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> > > 
> > > Yeah but this doesn't fit very well to the existing scheme so we would
> > > need two different schemes which is not ideal from maint. point of view.
> > > We also do not have to duplicate all the tricky checks we already do in
> > > oom_evaluate_task. So I would prefer if we could try to hook there and
> > > do the special handling there.
> > 
> > I hope, that iterating over all tasks just to check if there are
> > in-flight OOM victims might be optimized at some point.
> > That means, we would be able to choose a victim much cheaper.
> > It's not easy, but it feels as a right direction to go.
> 
> You would have to count per each oom domain and that sounds quite
> unfeasible to me.

It's hard, but traversing the whole cgroup tree from bottom to top
for each task is just not scalable.

This is exactly why I've choosen a compromise right now: let's
iterate over all tasks, but do it by iterating over the cgroup tree.

>  
> > Also, adding new tricks to the oom_evaluate_task() will make the code
> > even more hairy. Some of the existing tricks are useless for memcg selection.
> 
> Not sure what you mean but oom_evaluate_task has been usable for both
> global and memcg oom paths so far. I do not see any reason why this
> shouldn't hold for a different oom killing strategy.

Yes, but in both cases we've evaluated tasks, not cgroups.

> 
> > > > Also, please note, that even without the kill-all flag the decision is made
> > > > on per-cgroup level (except tasks in the root cgroup).
> > > 
> > > Yeah and I am not sure this is a reasonable behavior. Why should we
> > > consider memcgs which are not kill-all as a single entity?
> > 
> > I think, it's reasonable to choose a cgroup/container to blow off based on
> > the cgroup oom_priority/size (including hierarchical settings), and then
> > kill one biggest or all tasks depending on cgroup settings.
> 
> But that doesn't mean you have to treat even !kill-all memcgs like a
> single entity. In fact we should compare killable entities which is
> either a task or the whole memcg if configured that way.

I believe it's absolutely valid user's intention to prioritize some
cgroups over other, even if only one task should be killed in case of OOM.

Thanks!

Roman

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-03 12:47             ` Roman Gushchin
@ 2017-08-03 13:01               ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2017-08-03 13:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Thu 03-08-17 13:47:51, Roman Gushchin wrote:
> On Wed, Aug 02, 2017 at 09:29:01AM +0200, Michal Hocko wrote:
> > On Tue 01-08-17 19:13:52, Roman Gushchin wrote:
> > > On Tue, Aug 01, 2017 at 07:03:03PM +0200, Michal Hocko wrote:
> > > > On Tue 01-08-17 16:25:48, Roman Gushchin wrote:
> > > > > On Tue, Aug 01, 2017 at 04:54:35PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I would reap out the oom_kill_process into a separate patch.
> > > > > 
> > > > > It was a separate patch, I've merged it based on Vladimir's feedback.
> > > > > No problems, I can divide it back.
> > > > 
> > > > It would make the review slightly more easier
> > > > > 
> > > > > > > -static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > > > +static void __oom_kill_process(struct task_struct *victim)
> > > > > > 
> > > > > > To the rest of the patch. I have to say I do not quite like how it is
> > > > > > implemented. I was hoping for something much simpler which would hook
> > > > > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > > > > then we would update the cumulative memcg badness (more specifically the
> > > > > > badness of the topmost parent with kill-all flag). Memcg will then
> > > > > > compete with existing self contained tasks (oom_badness will have to
> > > > > > tell whether points belong to a task or a memcg to allow the caller to
> > > > > > deal with it). But it shouldn't be much more complex than that.
> > > > > 
> > > > > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > > > > the difference is that you want to iterate over tasks and for each
> > > > > task traverse the memcg tree, update per-cgroup oom score and find
> > > > > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > > > > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> > > > 
> > > > Yeah but this doesn't fit very well to the existing scheme so we would
> > > > need two different schemes which is not ideal from maint. point of view.
> > > > We also do not have to duplicate all the tricky checks we already do in
> > > > oom_evaluate_task. So I would prefer if we could try to hook there and
> > > > do the special handling there.
> > > 
> > > I hope, that iterating over all tasks just to check if there are
> > > in-flight OOM victims might be optimized at some point.
> > > That means, we would be able to choose a victim much cheaper.
> > > It's not easy, but it feels as a right direction to go.
> > 
> > You would have to count per each oom domain and that sounds quite
> > unfeasible to me.
> 
> It's hard, but traversing the whole cgroup tree from bottom to top
> for each task is just not scalable.

We are talking about the oom path which is a slow path. Besides that
memcg hierarchies will not be very deep usually (we are not talking
about hundreds). 

> This is exactly why I've choosen a compromise right now: let's
> iterate over all tasks, but do it by iterating over the cgroup tree.
> 
> >  
> > > Also, adding new tricks to the oom_evaluate_task() will make the code
> > > even more hairy. Some of the existing tricks are useless for memcg selection.
> > 
> > Not sure what you mean but oom_evaluate_task has been usable for both
> > global and memcg oom paths so far. I do not see any reason why this
> > shouldn't hold for a different oom killing strategy.
> 
> Yes, but in both cases we've evaluated tasks, not cgroups.
> 
> > 
> > > > > Also, please note, that even without the kill-all flag the decision is made
> > > > > on per-cgroup level (except tasks in the root cgroup).
> > > > 
> > > > Yeah and I am not sure this is a reasonable behavior. Why should we
> > > > consider memcgs which are not kill-all as a single entity?
> > > 
> > > I think, it's reasonable to choose a cgroup/container to blow off based on
> > > the cgroup oom_priority/size (including hierarchical settings), and then
> > > kill one biggest or all tasks depending on cgroup settings.
> > 
> > But that doesn't mean you have to treat even !kill-all memcgs like a
> > single entity. In fact we should compare killable entities which is
> > either a task or the whole memcg if configured that way.
> 
> I believe it's absolutely valid user's intention to prioritize some
> cgroups over other, even if only one task should be killed in case of OOM.

This is mixing different concepts which I really dislike. The semantic
is getting really fuzzy. How are you going to apply memcg priority when
you are killing a single task? What portion of the memcgs priority does
the task get? Then you have that root is special...

No I really dislike this. We should start simple and compare killable
entities. If you want to apply memcg priority then only on the whole
memcgs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-01 15:25     ` Roman Gushchin
  2017-08-01 17:03       ` Michal Hocko
@ 2017-08-08 23:06       ` David Rientjes
  2017-08-14 12:03         ` Roman Gushchin
  1 sibling, 1 reply; 23+ messages in thread
From: David Rientjes @ 2017-08-08 23:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, 1 Aug 2017, Roman Gushchin wrote:

> > To the rest of the patch. I have to say I do not quite like how it is
> > implemented. I was hoping for something much simpler which would hook
> > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > then we would update the cumulative memcg badness (more specifically the
> > badness of the topmost parent with kill-all flag). Memcg will then
> > compete with existing self contained tasks (oom_badness will have to
> > tell whether points belong to a task or a memcg to allow the caller to
> > deal with it). But it shouldn't be much more complex than that.
> 
> I'm not sure, it will be any simpler. Basically I'm doing the same:
> the difference is that you want to iterate over tasks and for each
> task traverse the memcg tree, update per-cgroup oom score and find
> the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> 
> Also, please note, that even without the kill-all flag the decision is made
> on per-cgroup level (except tasks in the root cgroup).
> 

I think your implementation is preferred and is actually quite simple to 
follow, and I would encourage you to follow through with it.  It has a 
similar implementation to what we have done for years to kill a process 
from a leaf memcg.

I did notice that oom_kill_memcg_victim() calls directly into 
__oom_kill_process(), however, so we lack the traditional oom killer 
output that shows memcg usage and potential tasklist.  I think we should 
still be dumping this information to the kernel log so that we can see a 
breakdown of charged memory.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 3/4] mm, oom: introduce oom_priority for memory cgroups
  2017-07-26 13:27 ` [v4 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
@ 2017-08-08 23:14   ` David Rientjes
  2017-08-14 12:39     ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: David Rientjes @ 2017-08-08 23:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, 26 Jul 2017, Roman Gushchin wrote:

> Introduce a per-memory-cgroup oom_priority setting: an integer number
> within the [-10000, 10000] range, which defines the order in which
> the OOM killer selects victim memory cgroups.
> 
> OOM killer prefers memory cgroups with larger priority if they are
> populated with elegible tasks.
> 
> The oom_priority value is compared within sibling cgroups.
> 
> The root cgroup has the oom_priority 0, which cannot be changed.
> 

Awesome!  Very excited to see that you implemented this suggestion and it 
is similar to priority based oom killing that we have done.  I think this 
kind of support is long overdue in the oom killer.

Comment inline.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/memcontrol.h |  3 +++
>  mm/memcontrol.c            | 55 ++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b21bbb0edc72..d31ac58e08ad 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -206,6 +206,9 @@ struct mem_cgroup {
>  	/* cached OOM score */
>  	long oom_score;
>  
> +	/* OOM killer priority */
> +	short oom_priority;
> +
>  	/* handle for "memory.events" */
>  	struct cgroup_file events_file;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba72d1cf73d0..2c1566995077 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2710,12 +2710,21 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  	for (;;) {
>  		struct cgroup_subsys_state *css;
>  		struct mem_cgroup *memcg = NULL;
> +		short prio = SHRT_MIN;
>  		long score = LONG_MIN;
>  
>  		css_for_each_child(css, &root->css) {
>  			struct mem_cgroup *iter = mem_cgroup_from_css(css);
>  
> -			if (iter->oom_score > score) {
> +			if (iter->oom_score == 0)
> +				continue;
> +
> +			if (iter->oom_priority > prio) {
> +				memcg = iter;
> +				prio = iter->oom_priority;
> +				score = iter->oom_score;
> +			} else if (iter->oom_priority == prio &&
> +				   iter->oom_score > score) {
>  				memcg = iter;
>  				score = iter->oom_score;
>  			}

Your tiebreaking is done based on iter->oom_score, which I suppose makes 
sense given that the oom killer traditionally tries to kill from the 
largest memory hogging process.

We actually tiebreak on a timestamp of memcg creation and prefer to kill 
from the newer memcg when iter->oom_priority is the same.  The reasoning 
is that we schedule jobs on a machine that have an inherent priority but 
is unaware of other jobs running at the same priority and so the kill 
decision, if based on iter->oom_score, may differ based on current memory 
usage.

I'm not necessarily arguing against using iter->oom_score, but was 
wondering if you would also find that tiebreaking based on a timestamp 
when priorities are the same is a more clear semantic to describe?  It's 
similar to how the system oom killer tiebreaked based on which task_struct 
appeared later in the tasklist when memory usage was the same.

Your approach makes oom killing less likely in the near term since it 
kills a more memory hogging memcg, but has the potential to lose less 
work.  A timestamp based approach loses the least amount of work by 
preferring to kill newer memcgs but oom killing may be more frequent if 
smaller child memcgs are killed.  I would argue the former is the 
responsibility of the user for using the same priority.

> @@ -2782,7 +2791,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  	 * For system-wide OOMs we should consider tasks in the root cgroup
>  	 * with oom_score larger than oc->chosen_points.
>  	 */
> -	if (!oc->memcg) {
> +	if (!oc->memcg && !(oc->chosen_memcg &&
> +			    oc->chosen_memcg->oom_priority > 0)) {
> +		/*
> +		 * Root memcg has priority 0, so if chosen memcg has lower
> +		 * priority, any task in root cgroup is preferable.
> +		 */
> +		if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0)
> +			oc->chosen_points = 0;
> +
>  		select_victim_root_cgroup_task(oc);
>  
>  		if (oc->chosen && oc->chosen_memcg) {
> @@ -5373,6 +5390,34 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static int memory_oom_priority_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> +	seq_printf(m, "%d\n", memcg->oom_priority);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_oom_priority_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	int oom_priority;
> +	int err;
> +
> +	err = kstrtoint(strstrip(buf), 0, &oom_priority);
> +	if (err)
> +		return err;
> +
> +	if (oom_priority < -10000 || oom_priority > 10000)
> +		return -EINVAL;
> +
> +	memcg->oom_priority = (short)oom_priority;
> +
> +	return nbytes;
> +}
> +
>  static int memory_events_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> @@ -5499,6 +5544,12 @@ static struct cftype memory_files[] = {
>  		.write = memory_oom_kill_all_tasks_write,
>  	},
>  	{
> +		.name = "oom_priority",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_oom_priority_show,
> +		.write = memory_oom_priority_write,
> +	},
> +	{
>  		.name = "events",
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  		.file_offset = offsetof(struct mem_cgroup, events_file),

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-07-26 13:27 ` [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
@ 2017-08-08 23:24   ` David Rientjes
  2017-08-14 12:28     ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: David Rientjes @ 2017-08-08 23:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, 26 Jul 2017, Roman Gushchin wrote:

> +Cgroup-aware OOM Killer
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as first class OOM entities.
> +
> +Under OOM conditions the memory controller tries to make the best
> +choise of a victim, hierarchically looking for the largest memory
> +consumer. By default, it will look for the biggest task in the
> +biggest leaf cgroup.
> +
> +Be default, all cgroups have oom_priority 0, and OOM killer will
> +chose the largest cgroup recursively on each level. For non-root
> +cgroups it's possible to change the oom_priority, and it will cause
> +the OOM killer to look athe the priority value first, and compare
> +sizes only of cgroups with equal priority.
> +
> +But a user can change this behavior by enabling the per-cgroup
> +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> +the whole cgroup as an indivisible memory consumer. In case if it's
> +selected as on OOM victim, all belonging tasks will be killed.
> +
> +Tasks in the root cgroup are treated as independent memory consumers,
> +and are compared with other memory consumers (e.g. leaf cgroups).
> +The root cgroup doesn't support the oom_kill_all_tasks feature.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to the sub-tree
> +of the OOM'ing cgroup.
> +
>  IO
>  --

Thanks very much for following through with this.

As described in http://marc.info/?l=linux-kernel&m=149980660611610 this is 
very similar to what we do for priority based oom killing.

I'm wondering your comments on extending it one step further, however: 
include process priority as part of the selection rather than simply memcg 
priority.

memory.oom_priority will dictate which memcg the kill will originate from, 
but processes have no ability to specify that they should actually be 
killed as opposed to a leaf memcg.  I'm not sure how important this is for 
your usecase, but we have found it useful to be able to specify process 
priority as part of the decisionmaking.

At each level of consideration, we simply kill a process with lower 
/proc/pid/oom_priority if there are no memcgs with a lower 
memory.oom_priority.  This allows us to define the exact process that will 
be oom killed, absent oom_kill_all_tasks, and not require that the process 
be attached to leaf memcg.  Most notably these are processes that are best 
effort: stats collection, logging, etc.

Do you think it would be helpful to introduce per-process oom priority as 
well?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 2/4] mm, oom: cgroup-aware OOM killer
  2017-08-08 23:06       ` David Rientjes
@ 2017-08-14 12:03         ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-08-14 12:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Aug 08, 2017 at 04:06:38PM -0700, David Rientjes wrote:
> On Tue, 1 Aug 2017, Roman Gushchin wrote:
> 
> > > To the rest of the patch. I have to say I do not quite like how it is
> > > implemented. I was hoping for something much simpler which would hook
> > > into oom_evaluate_task. If a task belongs to a memcg with kill-all flag
> > > then we would update the cumulative memcg badness (more specifically the
> > > badness of the topmost parent with kill-all flag). Memcg will then
> > > compete with existing self contained tasks (oom_badness will have to
> > > tell whether points belong to a task or a memcg to allow the caller to
> > > deal with it). But it shouldn't be much more complex than that.
> > 
> > I'm not sure, it will be any simpler. Basically I'm doing the same:
> > the difference is that you want to iterate over tasks and for each
> > task traverse the memcg tree, update per-cgroup oom score and find
> > the corresponding memcg(s) with the kill-all flag. I'm doing the opposite:
> > traverse the cgroup tree, and for each leaf cgroup iterate over processes.
> > 
> > Also, please note, that even without the kill-all flag the decision is made
> > on per-cgroup level (except tasks in the root cgroup).
> > 
> 
> I think your implementation is preferred and is actually quite simple to 
> follow, and I would encourage you to follow through with it.  It has a 
> similar implementation to what we have done for years to kill a process 
> from a leaf memcg.

Hi David!

Thank you for the support.

> 
> I did notice that oom_kill_memcg_victim() calls directly into 
> __oom_kill_process(), however, so we lack the traditional oom killer 
> output that shows memcg usage and potential tasklist.  I think we should 
> still be dumping this information to the kernel log so that we can see a 
> breakdown of charged memory.

I think the existing output is too verbose for the case, when we kill
a cgroup with many processes inside. But I absolutely agree, that we need
some debug output, I'll add it in v5.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-08-08 23:24   ` David Rientjes
@ 2017-08-14 12:28     ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-08-14 12:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Aug 08, 2017 at 04:24:32PM -0700, David Rientjes wrote:
> On Wed, 26 Jul 2017, Roman Gushchin wrote:
> 
> > +Cgroup-aware OOM Killer
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as first class OOM entities.
> > +
> > +Under OOM conditions the memory controller tries to make the best
> > +choise of a victim, hierarchically looking for the largest memory
> > +consumer. By default, it will look for the biggest task in the
> > +biggest leaf cgroup.
> > +
> > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > +chose the largest cgroup recursively on each level. For non-root
> > +cgroups it's possible to change the oom_priority, and it will cause
> > +the OOM killer to look athe the priority value first, and compare
> > +sizes only of cgroups with equal priority.
> > +
> > +But a user can change this behavior by enabling the per-cgroup
> > +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > +the whole cgroup as an indivisible memory consumer. In case if it's
> > +selected as on OOM victim, all belonging tasks will be killed.
> > +
> > +Tasks in the root cgroup are treated as independent memory consumers,
> > +and are compared with other memory consumers (e.g. leaf cgroups).
> > +The root cgroup doesn't support the oom_kill_all_tasks feature.
> > +
> > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > +the memory controller considers only cgroups belonging to the sub-tree
> > +of the OOM'ing cgroup.
> > +
> >  IO
> >  --
> 
> Thanks very much for following through with this.
> 
> As described in http://marc.info/?l=linux-kernel&m=149980660611610 this is 
> very similar to what we do for priority based oom killing.
> 
> I'm wondering your comments on extending it one step further, however: 
> include process priority as part of the selection rather than simply memcg 
> priority.
> 
> memory.oom_priority will dictate which memcg the kill will originate from, 
> but processes have no ability to specify that they should actually be 
> killed as opposed to a leaf memcg.  I'm not sure how important this is for 
> your usecase, but we have found it useful to be able to specify process 
> priority as part of the decisionmaking.
> 
> At each level of consideration, we simply kill a process with lower 
> /proc/pid/oom_priority if there are no memcgs with a lower 
> memory.oom_priority.  This allows us to define the exact process that will 
> be oom killed, absent oom_kill_all_tasks, and not require that the process 
> be attached to leaf memcg.  Most notably these are processes that are best 
> effort: stats collection, logging, etc.

I'm focused on cgroup v2 interface, that means, that there are no processes
belonging to non-leaf cgroups. So, cgroups are compared only with root-cgroup
processes, and I'm not sure we really need a way to prioritize them.

> 
> Do you think it would be helpful to introduce per-process oom priority as 
> well?

I'm not against per-process oom_priority, and it might be a good idea
to replace the existing oom_score_adj with it at some point. I might be wrong,
but I think users mostly using the extereme oom_score_adj values;
no one really needs the tiebreaking based on some percentages
of the total memory. And oom_priority will be just a simpler and more clear
way to express the same intention.

But it's not directly related to this patchset, and it's more arguable,
so I think it can be done later.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v4 3/4] mm, oom: introduce oom_priority for memory cgroups
  2017-08-08 23:14   ` David Rientjes
@ 2017-08-14 12:39     ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2017-08-14 12:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel

On Tue, Aug 08, 2017 at 04:14:50PM -0700, David Rientjes wrote:
> On Wed, 26 Jul 2017, Roman Gushchin wrote:
> 
> > Introduce a per-memory-cgroup oom_priority setting: an integer number
> > within the [-10000, 10000] range, which defines the order in which
> > the OOM killer selects victim memory cgroups.
> > 
> > OOM killer prefers memory cgroups with larger priority if they are
> > populated with elegible tasks.
> > 
> > The oom_priority value is compared within sibling cgroups.
> > 
> > The root cgroup has the oom_priority 0, which cannot be changed.
> > 
> 
> Awesome!  Very excited to see that you implemented this suggestion and it 
> is similar to priority based oom killing that we have done.  I think this 
> kind of support is long overdue in the oom killer.
> 
> Comment inline.
> 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: kernel-team@fb.com
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > ---
> >  include/linux/memcontrol.h |  3 +++
> >  mm/memcontrol.c            | 55 ++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 56 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index b21bbb0edc72..d31ac58e08ad 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -206,6 +206,9 @@ struct mem_cgroup {
> >  	/* cached OOM score */
> >  	long oom_score;
> >  
> > +	/* OOM killer priority */
> > +	short oom_priority;
> > +
> >  	/* handle for "memory.events" */
> >  	struct cgroup_file events_file;
> >  
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ba72d1cf73d0..2c1566995077 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2710,12 +2710,21 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> >  	for (;;) {
> >  		struct cgroup_subsys_state *css;
> >  		struct mem_cgroup *memcg = NULL;
> > +		short prio = SHRT_MIN;
> >  		long score = LONG_MIN;
> >  
> >  		css_for_each_child(css, &root->css) {
> >  			struct mem_cgroup *iter = mem_cgroup_from_css(css);
> >  
> > -			if (iter->oom_score > score) {
> > +			if (iter->oom_score == 0)
> > +				continue;
> > +
> > +			if (iter->oom_priority > prio) {
> > +				memcg = iter;
> > +				prio = iter->oom_priority;
> > +				score = iter->oom_score;
> > +			} else if (iter->oom_priority == prio &&
> > +				   iter->oom_score > score) {
> >  				memcg = iter;
> >  				score = iter->oom_score;
> >  			}
> 
> Your tiebreaking is done based on iter->oom_score, which I suppose makes 
> sense given that the oom killer traditionally tries to kill from the 
> largest memory hogging process.
> 
> We actually tiebreak on a timestamp of memcg creation and prefer to kill 
> from the newer memcg when iter->oom_priority is the same.  The reasoning 
> is that we schedule jobs on a machine that have an inherent priority but 
> is unaware of other jobs running at the same priority and so the kill 
> decision, if based on iter->oom_score, may differ based on current memory 
> usage.
> 
> I'm not necessarily arguing against using iter->oom_score, but was 
> wondering if you would also find that tiebreaking based on a timestamp 
> when priorities are the same is a more clear semantic to describe?  It's 
> similar to how the system oom killer tiebreaked based on which task_struct 
> appeared later in the tasklist when memory usage was the same.
> 
> Your approach makes oom killing less likely in the near term since it 
> kills a more memory hogging memcg, but has the potential to lose less 
> work.  A timestamp based approach loses the least amount of work by 
> preferring to kill newer memcgs but oom killing may be more frequent if 
> smaller child memcgs are killed.  I would argue the former is the 
> responsibility of the user for using the same priority.

I think we should have the same approach for cgroups and processes.

We use the size-based approach for processes, and it will be really confusing
to have something different for memory cgroups. So I'd prefer to match
the existing behavior right now, and later, if required, extend both per-process
and per-cgroup algorithms to support the time-based evaluation.

Thanks!

Roman

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-08-14 12:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20170726132718.14806-1-guro@fb.com>
2017-07-26 13:27 ` [v4 1/4] mm, oom: refactor the TIF_MEMDIE usage Roman Gushchin
2017-07-26 13:56   ` Michal Hocko
2017-07-26 14:06     ` Roman Gushchin
2017-07-26 14:24       ` Michal Hocko
2017-07-26 14:44         ` Michal Hocko
2017-07-26 14:50           ` Roman Gushchin
2017-07-26 13:27 ` [v4 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-07-27 21:41   ` kbuild test robot
2017-08-01 14:54   ` Michal Hocko
2017-08-01 15:25     ` Roman Gushchin
2017-08-01 17:03       ` Michal Hocko
2017-08-01 18:13         ` Roman Gushchin
2017-08-02  7:29           ` Michal Hocko
2017-08-03 12:47             ` Roman Gushchin
2017-08-03 13:01               ` Michal Hocko
2017-08-08 23:06       ` David Rientjes
2017-08-14 12:03         ` Roman Gushchin
2017-07-26 13:27 ` [v4 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
2017-08-08 23:14   ` David Rientjes
2017-08-14 12:39     ` Roman Gushchin
2017-07-26 13:27 ` [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
2017-08-08 23:24   ` David Rientjes
2017-08-14 12:28     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).