linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND v12 0/6] cgroup-aware OOM killer
@ 2017-10-19 18:52 Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v12:
  - Root memory cgroup is evaluated based on sum of the oom scores
    of belonging tasks
  - Do not fallback to the per-process behavior if there if
    it wasn't possbile to kill a memcg victim
  - Rebase on top of mm tree

v11:
  - Fixed an issue with skipping the root mem cgroup
    (discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
    patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
    (proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
    if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
    memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 258 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 212 ++++++++++++++++++++++++------------
 7 files changed, 506 insertions(+), 76 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a0d1f7..0b9f36117989 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -842,68 +842,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -977,6 +921,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, Andrew Morton,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50e6906314f8..1d30a45a4bbe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 19:30   ` Michal Hocko
  2017-10-31 15:04   ` Shakeel Butt
  2017-10-19 18:52 ` [RESEND v12 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  17 +++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 181 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  72 +++++++++++++-----
 4 files changed, 262 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d30a45a4bbe..f364bfed745f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Root memory cgroup is a special case:
+	 * we don't have necessary stats to evaluate it exactly as
+	 * leaf memory cgroups, so we approximate it's oom_score
+	 * by summing oom_score of all belonging tasks, which are
+	 * owners of their mm structs.
+	 *
+	 * If there are inflight OOM victim tasks inside
+	 * the root memcg, we return -1.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP,
+				      &task->signal->oom_mm->flags)) {
+				score = -1;
+				break;
+			}
+
+			task_lock(task);
+			if (!task->mm || task->mm->owner != task) {
+				task_unlock(task);
+				continue;
+			}
+			task_unlock(task);
+
+			score += oom_badness(task, memcg, nodemask,
+					     totalpages);
+		}
+		css_task_iter_end(&it);
+
+		return score;
+	}
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0b9f36117989..5b670adb850c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -923,7 +923,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -984,6 +984,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1036,6 +1057,7 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc)) {
+		if (oom_kill_memcg_victim(oc))
+		    delay = true;
+
+		goto out;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RESEND v12 4/6] mm, oom: introduce memory.oom_group
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-10-19 18:52 ` [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 17 +++++++++++
 mm/memcontrol.c            | 75 +++++++++++++++++++++++++++++++++++++++++++---
 mm/oom_kill.c              | 49 +++++++++++++++++++++++-------
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/*
+	 * Treat the sub-tree as an indivisible memory consumer,
+	 * kill all belonging tasks if the memory cgroup selected
+	 * as OOM victim.
+	 */
+	bool oom_group;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 {
 	return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f364bfed745f..ad10dbdf723b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2785,19 +2785,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *group = NULL;
+	long group_score = 0;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
 
 	/*
+	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+	 * all tasks belonging to the memcg should be killed.
+	 * So, we mark the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		return;
+	}
+
+	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
+	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
+		/*
+		 * We don't consider non-leaf non-oom_group memory cgroups
+		 * as OOM victims.
+		 */
+		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+		    !mem_cgroup_oom_group(iter))
+			continue;
+
+		/*
+		 * If group is not set or we've ran out of the group's sub-tree,
+		 * we should set group and reset group_score.
+		 */
+		if (!group || group == root_mem_cgroup ||
+		    !mem_cgroup_is_descendant(iter, group)) {
+			group = iter;
+			group_score = 0;
+		}
+
 		if (memcg_has_children(iter) && iter != root_mem_cgroup)
 			continue;
 
@@ -2819,9 +2851,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			break;
 		}
 
-		if (score > oc->chosen_points) {
-			oc->chosen_points = score;
-			oc->chosen_memcg = iter;
+		group_score += score;
+
+		if (group_score > oc->chosen_points) {
+			oc->chosen_points = group_score;
+			oc->chosen_memcg = group;
 		}
 	}
 
@@ -5448,6 +5482,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes,
+				      loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5568,6 +5629,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5b670adb850c..da7461033ee4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -848,6 +848,17 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	/*
+	 * __oom_kill_process() is used to kill all tasks belonging to
+	 * the selected memory cgroup, so we should check that we're not
+	 * trying to kill an unkillable task.
+	 */
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -984,21 +995,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
 {
+	get_task_struct(task);
+	__oom_kill_process(task);
+	return 0;
+}
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
 	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
 		return oc->chosen_memcg;
 
-	/* Kill a task in the chosen memcg with the biggest memory footprint */
-	oc->chosen_points = 0;
-	oc->chosen_task = NULL;
-	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
-	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
-		goto out;
-
-	__oom_kill_process(oc->chosen_task);
+	/*
+	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+	 * of the chosen memory cgroup, otherwise kill the task with the biggest
+	 * memory footprint.
+	 */
+	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+		/* We have one or more terminating processes at this point. */
+		oc->chosen_task = INFLIGHT_VICTIM;
+	} else {
+		oc->chosen_points = 0;
+		oc->chosen_task = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+		if (oc->chosen_task == NULL ||
+		    oc->chosen_task == INFLIGHT_VICTIM)
+			goto out;
+
+		__oom_kill_process(oc->chosen_task);
+	}
 
 out:
 	mem_cgroup_put(oc->chosen_memcg);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-10-19 18:52 ` [RESEND v12 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 18:52 ` [RESEND v12 6/6] mm, oom, docs: describe the " Roman Gushchin
  2017-10-19 19:45 ` [RESEND v12 0/6] " Johannes Weiner
  6 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c7086c8835da..0e1685ca1d7b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ad10dbdf723b..eb1e15385782 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2875,6 +2875,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RESEND v12 6/6] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-10-19 18:52 ` [RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-10-19 18:52 ` Roman Gushchin
  2017-10-19 19:45 ` [RESEND v12 0/6] " Johannes Weiner
  6 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2017-10-19 18:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0bbdc720dd7c..69db5bf9c580 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1031,6 +1032,28 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup as an
+	indivisible memory consumers and compare it with other memory
+	consumers by it's memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	This applies to system-wide OOM conditions and reaching
+	the hard memory limit of the cgroup and their ancestor.
+	If OOM condition happens in a descendant cgroup with it's own
+	memory limit, the memory cgroup can't be considered
+	as an OOM victim, and OOM killer will not kill all belonging
+	tasks.
+
+	Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1234,6 +1257,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-19 18:52 ` [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-10-19 19:30   ` Michal Hocko
  2017-10-31 15:04   ` Shakeel Butt
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-10-19 19:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Thu 19-10-17 19:52:15, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> This patch introduces the core functionality: an ability to select
> a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
> looks for the biggest leaf memory cgroup and kills the biggest
> task belonging to it.
> 
> The following patches will extend this functionality to consider
> non-leaf memory cgroups as OOM victims, and also provide an ability
> to kill all tasks belonging to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf memory cgroups.
> Due to memcg statistics implementation a special approximation
> is used for estimating oom_score of root memory cgroup: we sum
> oom_score of the belonging processes (or, to be more precise,
> tasks owning their mm structures).
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Just to make it clear. My ack is conditional on the opt-in which is
implemented later in the series. Strictly speaking system would
behave differently during the bisection and that might lead to a
confusion. I guess it would be better to simply disable this feature
until we have means to enable it. But I do not really care strongly
here.

There is another thing that I am more concerned about. Usually you
should drop ack when making further changes or at least call them out
so that the reviewer is aware of them.  In this particular case I am
worried about the fallback code we have discussed previously

[...]
> @@ -1080,27 +1102,39 @@ bool out_of_memory(struct oom_control *oc)
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
> -		oc->chosen = current;
> +		oc->chosen_task = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
>  		return true;
>  	}
>  
> +	if (mem_cgroup_select_oom_victim(oc)) {
> +		if (oom_kill_memcg_victim(oc))
> +		    delay = true;
> +
> +		goto out;
> +	}
> +
[...]
> +out:
> +	/*
> +	 * Give the killed process a good chance to exit before trying
> +	 * to allocate memory again.
> +	 */
> +	if (delay)
> +		schedule_timeout_killable(1);
> +
> +	return !!oc->chosen_task;
>  }

this basically means that if you manage to select a memcg victim but
then you won't be able to select any task in that memcg then you would
return false from out_of_memory and that has other consequences. Namely
__alloc_pages_may_oom will not set did_some_progress and so the
allocation path will fail. While this scenario is not very likely we
should behave better. Your previous implementation (which I've acked)
did fall back to the standard oom killer path which is the safest
option. Maybe we can do better but let's try robust and be clever later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (5 preceding siblings ...)
  2017-10-19 18:52 ` [RESEND v12 6/6] mm, oom, docs: describe the " Roman Gushchin
@ 2017-10-19 19:45 ` Johannes Weiner
  2017-10-19 21:09   ` Michal Hocko
  2017-10-23  0:24   ` David Rientjes
  6 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2017-10-19 19:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote:
> This patchset makes the OOM killer cgroup-aware.

Hi Andrew,

I believe this code is ready for merging upstream, and it seems Michal
is in agreement. There are two main things to consider, however.

David would have really liked for this patchset to include knobs to
influence how the algorithm picks cgroup victims. The rest of us
agreed that this is beyond the scope of these patches, that the
patches don't need it to be useful, and that there is nothing
preventing anyone from adding configurability later on. David
subsequently nacked the series as he considers it incomplete. Neither
Michal nor I see technical merit in David's nack.

Michal acked the implementation, but on the condition that the new
behavior be opt-in, to not surprise existing users. I *think* we agree
that respecting the cgroup topography during global OOM is what we
should have been doing when cgroups were initially introduced; where
we disagree is that I think users shouldn't have to opt in to
improvements. We have done much more invasive changes to the victim
selection without actual regressions in the past. Further, this change
only applies to mounts of the new cgroup2. Tejun also wasn't convinced
of the risk for regression, and too would prefer cgroup-awareness to
be the default in cgroup2. I would ask for patch 5/6 to be dropped.

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-19 19:45 ` [RESEND v12 0/6] " Johannes Weiner
@ 2017-10-19 21:09   ` Michal Hocko
  2017-10-23  0:24   ` David Rientjes
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-10-19 21:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Thu 19-10-17 15:45:34, Johannes Weiner wrote:
> On Thu, Oct 19, 2017 at 07:52:12PM +0100, Roman Gushchin wrote:
> > This patchset makes the OOM killer cgroup-aware.
> 
> Hi Andrew,
> 
> I believe this code is ready for merging upstream, and it seems Michal
> is in agreement. There are two main things to consider, however.
> 
> David would have really liked for this patchset to include knobs to
> influence how the algorithm picks cgroup victims. The rest of us
> agreed that this is beyond the scope of these patches, that the
> patches don't need it to be useful, and that there is nothing
> preventing anyone from adding configurability later on. David
> subsequently nacked the series as he considers it incomplete. Neither
> Michal nor I see technical merit in David's nack.

agreed

> Michal acked the implementation, but on the condition that the new
> behavior be opt-in, to not surprise existing users.

and just to make it clear I have also said I will _not_ nack if that is
not the case.

> I *think* we agree
> that respecting the cgroup topography during global OOM is what we
> should have been doing when cgroups were initially introduced;

We do not agree here though. I am not convinced that respecting the
cgroup topography is an universal win. It is true that there is no best
OOM victim selection strategy but what we have currently is the simplest
option and as such the most robust one. I can tell from the past year
experience that many of those clever heuristics actually contributed to
lockups and non-deterministic behavior.

> where
> we disagree is that I think users shouldn't have to opt in to
> improvements. We have done much more invasive changes to the victim
> selection without actual regressions in the past. Further, this change
> only applies to mounts of the new cgroup2.

which basically means that the behavior will change under many users
feet because the respecitve cgroup configuration is chosen by somebody
else (e.g. systemd) so I do not really buy "only v2 behavior"

> Tejun also wasn't convinced
> of the risk for regression, and too would prefer cgroup-awareness to
> be the default in cgroup2. I would ask for patch 5/6 to be dropped.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-19 19:45 ` [RESEND v12 0/6] " Johannes Weiner
  2017-10-19 21:09   ` Michal Hocko
@ 2017-10-23  0:24   ` David Rientjes
  2017-10-23 11:49     ` Michal Hocko
  2017-10-26 14:24     ` Johannes Weiner
  1 sibling, 2 replies; 32+ messages in thread
From: David Rientjes @ 2017-10-23  0:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Thu, 19 Oct 2017, Johannes Weiner wrote:

> David would have really liked for this patchset to include knobs to
> influence how the algorithm picks cgroup victims. The rest of us
> agreed that this is beyond the scope of these patches, that the
> patches don't need it to be useful, and that there is nothing
> preventing anyone from adding configurability later on. David
> subsequently nacked the series as he considers it incomplete. Neither
> Michal nor I see technical merit in David's nack.
> 

The nack is for three reasons:

 (1) unfair comparison of root mem cgroup usage to bias against that mem 
     cgroup from oom kill in system oom conditions,

 (2) the ability of users to completely evade the oom killer by attaching
     all processes to child cgroups either purposefully or unpurposefully,
     and

 (3) the inability of userspace to effectively control oom victim  
     selection.

For (1), the difference in v12 is adding the rss of all processes attached 
to the root mem cgroup as the evaluation.  This is not the same criteria 
that child cgroups are evaluated on, and they are compared using that 
bias.  It is very trivial to provide a fair comparison as was suggested in 
v11.

For (2), users who do

	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

can completely evade the oom killer and this may be part of their standard 
operating procedure for restricting resources with other cgroups other 
than the mem cgroup.  Again, it's an unfair comparison to all other 
cgroups on the system.

For (3), users need the ability to protect important cgroups, such as 
protecting a cgroup that is limited to 50% of system memory.  They need 
the ability to kill processes from other cgroups to protect these 
important processes.  This is nothing new: the oom killer has always 
provided the ability to bias against important processes.

There was follow-up email on all of these points where very trivial 
changes were suggested to address all three of these issues, and which 
Roman has implemented in one form or another in previous iterations with 
the bonus that no accounting to the root mem cgroup needs to be done.

> Michal acked the implementation, but on the condition that the new
> behavior be opt-in, to not surprise existing users. I *think* we agree
> that respecting the cgroup topography during global OOM is what we
> should have been doing when cgroups were initially introduced; where
> we disagree is that I think users shouldn't have to opt in to
> improvements. We have done much more invasive changes to the victim
> selection without actual regressions in the past. Further, this change
> only applies to mounts of the new cgroup2. Tejun also wasn't convinced
> of the risk for regression, and too would prefer cgroup-awareness to
> be the default in cgroup2. I would ask for patch 5/6 to be dropped.
> 

I agree with Michal that the new victim selection should be opt-in with 
CGRP_GROUP_OOM.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-23  0:24   ` David Rientjes
@ 2017-10-23 11:49     ` Michal Hocko
  2017-10-25 20:12       ` David Rientjes
  2017-10-26 14:24     ` Johannes Weiner
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-10-23 11:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, linux-mm, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Sun 22-10-17 17:24:51, David Rientjes wrote:
> On Thu, 19 Oct 2017, Johannes Weiner wrote:
> 
> > David would have really liked for this patchset to include knobs to
> > influence how the algorithm picks cgroup victims. The rest of us
> > agreed that this is beyond the scope of these patches, that the
> > patches don't need it to be useful, and that there is nothing
> > preventing anyone from adding configurability later on. David
> > subsequently nacked the series as he considers it incomplete. Neither
> > Michal nor I see technical merit in David's nack.
> > 
> 
> The nack is for three reasons:
> 
>  (1) unfair comparison of root mem cgroup usage to bias against that mem 
>      cgroup from oom kill in system oom conditions,

Most users who are going to use this feature right now will have
most of the userspace in their containers rather than in the root
memcg. The root memcg will always be special and as such there will
never be a universal best way to handle it. We should to satisfy most of
usecases. I would consider this something that is an open for a further
discussion but nothing that should stand in the way.
 
>  (2) the ability of users to completely evade the oom killer by attaching
>      all processes to child cgroups either purposefully or unpurposefully,
>      and

This doesn't differ from the current state where a task can purposefully
or unpurposefully hide itself from the global memory killer by spawning
new processes.
 
>  (3) the inability of userspace to effectively control oom victim  
>      selection.

this is not requested by the current usecase and it has been pointed out
that this will be possible to implement on top of the foundation of this
patchset.

So again, nothing to nack the work as is.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-23 11:49     ` Michal Hocko
@ 2017-10-25 20:12       ` David Rientjes
  0 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2017-10-25 20:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Mon, 23 Oct 2017, Michal Hocko wrote:

> On Sun 22-10-17 17:24:51, David Rientjes wrote:
> > On Thu, 19 Oct 2017, Johannes Weiner wrote:
> > 
> > > David would have really liked for this patchset to include knobs to
> > > influence how the algorithm picks cgroup victims. The rest of us
> > > agreed that this is beyond the scope of these patches, that the
> > > patches don't need it to be useful, and that there is nothing
> > > preventing anyone from adding configurability later on. David
> > > subsequently nacked the series as he considers it incomplete. Neither
> > > Michal nor I see technical merit in David's nack.
> > > 
> > 
> > The nack is for three reasons:
> > 
> >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> >      cgroup from oom kill in system oom conditions,
> 
> Most users who are going to use this feature right now will have
> most of the userspace in their containers rather than in the root
> memcg. The root memcg will always be special and as such there will
> never be a universal best way to handle it. We should to satisfy most of
> usecases. I would consider this something that is an open for a further
> discussion but nothing that should stand in the way.
>  
> >  (2) the ability of users to completely evade the oom killer by attaching
> >      all processes to child cgroups either purposefully or unpurposefully,
> >      and
> 
> This doesn't differ from the current state where a task can purposefully
> or unpurposefully hide itself from the global memory killer by spawning
> new processes.
>  

It cannot hide from the global oom killer if this patchset is used because 
it cannot hide its memory usage beneath cgroup levels.  This comment is in 
support of accounting memory usage up the hierarchy.

> >  (3) the inability of userspace to effectively control oom victim  
> >      selection.
> 
> this is not requested by the current usecase and it has been pointed out
> that this will be possible to implement on top of the foundation of this
> patchset.
> 

There's no reason to not present a complete patchset.  Userspace needs the 
ability to bias or prefer processes (or cgroups, in this case).  That's 
been the case with oom_adj in the past and oom_score_adj with the 
rewritten heuristic.  It's trivial to implement and the only pending 
suggestion to do this influence involves a slightly different scoring 
mechanism than this patchset; it goes back to accounting memory up the 
hierarchy as Roman initially implemented and then biasing between cgroups 
based on an oom_score_adj.  So the proposed influence mechanism cannot be 
implemented on top of this patchset as is, and that gives more reason why 
we cannot merge incomplete patches that can't be extended in the future.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-23  0:24   ` David Rientjes
  2017-10-23 11:49     ` Michal Hocko
@ 2017-10-26 14:24     ` Johannes Weiner
  2017-10-26 21:03       ` David Rientjes
  1 sibling, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2017-10-26 14:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Sun, Oct 22, 2017 at 05:24:51PM -0700, David Rientjes wrote:
> On Thu, 19 Oct 2017, Johannes Weiner wrote:
> 
> > David would have really liked for this patchset to include knobs to
> > influence how the algorithm picks cgroup victims. The rest of us
> > agreed that this is beyond the scope of these patches, that the
> > patches don't need it to be useful, and that there is nothing
> > preventing anyone from adding configurability later on. David
> > subsequently nacked the series as he considers it incomplete. Neither
> > Michal nor I see technical merit in David's nack.
> > 
> 
> The nack is for three reasons:
> 
>  (1) unfair comparison of root mem cgroup usage to bias against that mem 
>      cgroup from oom kill in system oom conditions,
> 
>  (2) the ability of users to completely evade the oom killer by attaching
>      all processes to child cgroups either purposefully or unpurposefully,
>      and
> 
>  (3) the inability of userspace to effectively control oom victim  
>      selection.

My apologies if my summary was too reductionist.

That being said, the arguments you repeat here have come up in
previous threads and been responded to. This doesn't change my
conclusion that your NAK is bogus.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-26 14:24     ` Johannes Weiner
@ 2017-10-26 21:03       ` David Rientjes
  2017-10-27  9:31         ` Roman Gushchin
  2017-10-27 20:05         ` Johannes Weiner
  0 siblings, 2 replies; 32+ messages in thread
From: David Rientjes @ 2017-10-26 21:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Thu, 26 Oct 2017, Johannes Weiner wrote:

> > The nack is for three reasons:
> > 
> >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> >      cgroup from oom kill in system oom conditions,
> > 
> >  (2) the ability of users to completely evade the oom killer by attaching
> >      all processes to child cgroups either purposefully or unpurposefully,
> >      and
> > 
> >  (3) the inability of userspace to effectively control oom victim  
> >      selection.
> 
> My apologies if my summary was too reductionist.
> 
> That being said, the arguments you repeat here have come up in
> previous threads and been responded to. This doesn't change my
> conclusion that your NAK is bogus.
> 

They actually haven't been responded to, Roman was working through v11 and 
made a change on how the root mem cgroup usage was calculated that was 
better than previous iterations but still not an apples to apples 
comparison with other cgroups.  The problem is that it the calculation for 
leaf cgroups includes additional memory classes, so it biases against 
processes that are moved to non-root mem cgroups.  Simply creating mem 
cgroups and attaching processes should not independently cause them to 
become more preferred: it should be a fair comparison between the root mem 
cgroup and the set of leaf mem cgroups as implemented.  That is very 
trivial to do with hierarchical oom cgroup scoring.

Since the ability of userspace to control oom victim selection is not 
addressed whatsoever by this patchset, and the suggested method cannot be 
implemented on top of this patchset as you have argued because it requires 
a change to the heuristic itself, the patchset needs to become complete 
before being mergeable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-26 21:03       ` David Rientjes
@ 2017-10-27  9:31         ` Roman Gushchin
  2017-10-30 21:36           ` David Rientjes
  2017-10-27 20:05         ` Johannes Weiner
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2017-10-27  9:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
> On Thu, 26 Oct 2017, Johannes Weiner wrote:
> 
> > > The nack is for three reasons:
> > > 
> > >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> > >      cgroup from oom kill in system oom conditions,
> > > 
> > >  (2) the ability of users to completely evade the oom killer by attaching
> > >      all processes to child cgroups either purposefully or unpurposefully,
> > >      and
> > > 
> > >  (3) the inability of userspace to effectively control oom victim  
> > >      selection.
> > 
> > My apologies if my summary was too reductionist.
> > 
> > That being said, the arguments you repeat here have come up in
> > previous threads and been responded to. This doesn't change my
> > conclusion that your NAK is bogus.
> > 
> 
> They actually haven't been responded to, Roman was working through v11 and 
> made a change on how the root mem cgroup usage was calculated that was 
> better than previous iterations but still not an apples to apples 
> comparison with other cgroups.  The problem is that it the calculation for 
> leaf cgroups includes additional memory classes, so it biases against 
> processes that are moved to non-root mem cgroups.  Simply creating mem 
> cgroups and attaching processes should not independently cause them to 
> become more preferred: it should be a fair comparison between the root mem 
> cgroup and the set of leaf mem cgroups as implemented.  That is very 
> trivial to do with hierarchical oom cgroup scoring.
> 
> Since the ability of userspace to control oom victim selection is not 
> addressed whatsoever by this patchset, and the suggested method cannot be 
> implemented on top of this patchset as you have argued because it requires 
> a change to the heuristic itself, the patchset needs to become complete 
> before being mergeable.

Hi David!

The thing is that the hierarchical approach (as in v8), which are you pushing,
has it's own limitations, which we've discussed in details earlier. There are
reasons why v12 is different, and we can't really simple go back. I mean if
there are better ideas how to resolve concerns raised in discussions around v8,
let me know, but ignoring them is not an option.

>From my point of view, an idea of selecting the biggest memcg tree-wide is
perfectly fine, as far as it possible to group memcgs in OOM domains.
As in v12, it can be done by setting the memory.oom_group knob.
It's perfectly possible to extend this by adding an ability to continue OOM
victim selection in the selected memcg instead of killing all belonging tasks,
as far as a practical need arises.

The way how we evaluate the root memory cgroup isn't as important as the
question how we compare cgroups in the hierarchy. So even if the hierarchical
approach allows to implement fairer comparison, it's not a reason to choose it.
Just because there are more serious concerns, discussed earlier.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-26 21:03       ` David Rientjes
  2017-10-27  9:31         ` Roman Gushchin
@ 2017-10-27 20:05         ` Johannes Weiner
  2017-10-31 14:17           ` peter enderborg
  1 sibling, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2017-10-27 20:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
> On Thu, 26 Oct 2017, Johannes Weiner wrote:
> 
> > > The nack is for three reasons:
> > > 
> > >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> > >      cgroup from oom kill in system oom conditions,
> > > 
> > >  (2) the ability of users to completely evade the oom killer by attaching
> > >      all processes to child cgroups either purposefully or unpurposefully,
> > >      and
> > > 
> > >  (3) the inability of userspace to effectively control oom victim  
> > >      selection.
> > 
> > My apologies if my summary was too reductionist.
> > 
> > That being said, the arguments you repeat here have come up in
> > previous threads and been responded to. This doesn't change my
> > conclusion that your NAK is bogus.
> 
> They actually haven't been responded to, Roman was working through v11 and 
> made a change on how the root mem cgroup usage was calculated that was 
> better than previous iterations but still not an apples to apples 
> comparison with other cgroups.  The problem is that it the calculation for 
> leaf cgroups includes additional memory classes, so it biases against 
> processes that are moved to non-root mem cgroups.  Simply creating mem 
> cgroups and attaching processes should not independently cause them to 
> become more preferred: it should be a fair comparison between the root mem 
> cgroup and the set of leaf mem cgroups as implemented.  That is very 
> trivial to do with hierarchical oom cgroup scoring.

There is absolutely no value in your repeating the same stuff over and
over again without considering what other people are telling you.

Hierarchical oom scoring has other downsides, and most of us agree
that they aren't preferable over the differences in scoring the root
vs scoring other cgroups - in particular because the root cannot be
controlled, doesn't even have local statistics, and so is unlikely to
contain important work on a containerized system. Getting the ballpark
right for the vast majority of usecases is more than good enough here.

> Since the ability of userspace to control oom victim selection is not 
> addressed whatsoever by this patchset, and the suggested method cannot be 
> implemented on top of this patchset as you have argued because it requires 
> a change to the heuristic itself, the patchset needs to become complete 
> before being mergeable.

It is complete. It just isn't a drop-in replacement for what you've
been doing out-of-tree for years. Stop making your problem everybody
else's problem.

You can change the the heuristics later, as you have done before. Or
you can add another configuration flag and we can phase out the old
mode, like we do all the time.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-27  9:31         ` Roman Gushchin
@ 2017-10-30 21:36           ` David Rientjes
  2017-10-31  7:54             ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: David Rientjes @ 2017-10-30 21:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Andrew Morton, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, 27 Oct 2017, Roman Gushchin wrote:

> The thing is that the hierarchical approach (as in v8), which are you pushing,
> has it's own limitations, which we've discussed in details earlier. There are
> reasons why v12 is different, and we can't really simple go back. I mean if
> there are better ideas how to resolve concerns raised in discussions around v8,
> let me know, but ignoring them is not an option.
> 

I'm not ignoring them, I have stated that we need the ability to protect 
important cgroups on the system without oom disabling all attached 
processes.  If that is implemented as a memory.oom_score_adj with the same 
semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
memory (the limit), it can also address the issues pointed out with the 
hierarchical approach in v8.  If this is not the case, could you elaborate 
on what your exact concern is and why we do not care that users can 
completely circumvent victim selection by creating child cgroups for other 
controllers?

Since the ability to protect important cgroups on the system may require a 
heuristic change, I think it should be solved now rather than constantly 
insisting that we can make this patchset complete later and in the 
meantime force the user to set all attached processes to be oom disabled.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-30 21:36           ` David Rientjes
@ 2017-10-31  7:54             ` Michal Hocko
  2017-10-31 22:21               ` David Rientjes
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-10-31  7:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon 30-10-17 14:36:39, David Rientjes wrote:
> On Fri, 27 Oct 2017, Roman Gushchin wrote:
> 
> > The thing is that the hierarchical approach (as in v8), which are you pushing,
> > has it's own limitations, which we've discussed in details earlier. There are
> > reasons why v12 is different, and we can't really simple go back. I mean if
> > there are better ideas how to resolve concerns raised in discussions around v8,
> > let me know, but ignoring them is not an option.
> > 
> 
> I'm not ignoring them, I have stated that we need the ability to protect 
> important cgroups on the system without oom disabling all attached 
> processes.  If that is implemented as a memory.oom_score_adj with the same 
> semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
> memory (the limit), it can also address the issues pointed out with the 
> hierarchical approach in v8.

No it cannot and it would be a terrible interface to have as well. You
do not want to permanently tune oom_score_adj to compensate for
structural restrictions on the hierarchy.

> If this is not the case, could you elaborate 
> on what your exact concern is and why we do not care that users can 
> completely circumvent victim selection by creating child cgroups for other 
> controllers?
> 
> Since the ability to protect important cgroups on the system may require a 
> heuristic change, I think it should be solved now rather than constantly 
> insisting that we can make this patchset complete later and in the 
> meantime force the user to set all attached processes to be oom disabled.

I believe, and Roman has pointed that out as well already, that further
improvements can be implemented without changing user visible behavior
as and add-on. If you disagree then you better come with a solid proof
that all of us wrong and reasonable semantic cannot be achieved that
way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-27 20:05         ` Johannes Weiner
@ 2017-10-31 14:17           ` peter enderborg
  2017-10-31 14:34             ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: peter enderborg @ 2017-10-31 14:17 UTC (permalink / raw)
  To: Johannes Weiner, David Rientjes
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, Roman Gushchin

On 10/27/2017 10:05 PM, Johannes Weiner wrote:
> On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
>> On Thu, 26 Oct 2017, Johannes Weiner wrote:
>>
>>>> The nack is for three reasons:
>>>>
>>>>  (1) unfair comparison of root mem cgroup usage to bias against that mem 
>>>>      cgroup from oom kill in system oom conditions,
>>>>
>>>>  (2) the ability of users to completely evade the oom killer by attaching
>>>>      all processes to child cgroups either purposefully or unpurposefully,
>>>>      and
>>>>
>>>>  (3) the inability of userspace to effectively control oom victim  
>>>>      selection.
>>> My apologies if my summary was too reductionist.
>>>
>>> That being said, the arguments you repeat here have come up in
>>> previous threads and been responded to. This doesn't change my
>>> conclusion that your NAK is bogus.
>> They actually haven't been responded to, Roman was working through v11 and 
>> made a change on how the root mem cgroup usage was calculated that was 
>> better than previous iterations but still not an apples to apples 
>> comparison with other cgroups.  The problem is that it the calculation for 
>> leaf cgroups includes additional memory classes, so it biases against 
>> processes that are moved to non-root mem cgroups.  Simply creating mem 
>> cgroups and attaching processes should not independently cause them to 
>> become more preferred: it should be a fair comparison between the root mem 
>> cgroup and the set of leaf mem cgroups as implemented.  That is very 
>> trivial to do with hierarchical oom cgroup scoring.
> There is absolutely no value in your repeating the same stuff over and
> over again without considering what other people are telling you.
>
> Hierarchical oom scoring has other downsides, and most of us agree
> that they aren't preferable over the differences in scoring the root
> vs scoring other cgroups - in particular because the root cannot be
> controlled, doesn't even have local statistics, and so is unlikely to
> contain important work on a containerized system. Getting the ballpark
> right for the vast majority of usecases is more than good enough here.
>
>> Since the ability of userspace to control oom victim selection is not 
>> addressed whatsoever by this patchset, and the suggested method cannot be 
>> implemented on top of this patchset as you have argued because it requires 
>> a change to the heuristic itself, the patchset needs to become complete 
>> before being mergeable.
> It is complete. It just isn't a drop-in replacement for what you've
> been doing out-of-tree for years. Stop making your problem everybody
> else's problem.
>
> You can change the the heuristics later, as you have done before. Or
> you can add another configuration flag and we can phase out the old
> mode, like we do all the time.
>
I think this problem is related to the removal of the lowmemorykiller,
where this is the life-line when the user-space for some reason fails.

So I guess quite a few will have this problem.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-31 14:17           ` peter enderborg
@ 2017-10-31 14:34             ` Michal Hocko
  2017-10-31 15:07               ` peter enderborg
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-10-31 14:34 UTC (permalink / raw)
  To: peter enderborg
  Cc: Johannes Weiner, David Rientjes, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, Roman Gushchin

On Tue 31-10-17 15:17:11, peter enderborg wrote:
> On 10/27/2017 10:05 PM, Johannes Weiner wrote:
> > On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
> >> On Thu, 26 Oct 2017, Johannes Weiner wrote:
> >>
> >>>> The nack is for three reasons:
> >>>>
> >>>>  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> >>>>      cgroup from oom kill in system oom conditions,
> >>>>
> >>>>  (2) the ability of users to completely evade the oom killer by attaching
> >>>>      all processes to child cgroups either purposefully or unpurposefully,
> >>>>      and
> >>>>
> >>>>  (3) the inability of userspace to effectively control oom victim  
> >>>>      selection.
> >>> My apologies if my summary was too reductionist.
> >>>
> >>> That being said, the arguments you repeat here have come up in
> >>> previous threads and been responded to. This doesn't change my
> >>> conclusion that your NAK is bogus.
> >> They actually haven't been responded to, Roman was working through v11 and 
> >> made a change on how the root mem cgroup usage was calculated that was 
> >> better than previous iterations but still not an apples to apples 
> >> comparison with other cgroups.  The problem is that it the calculation for 
> >> leaf cgroups includes additional memory classes, so it biases against 
> >> processes that are moved to non-root mem cgroups.  Simply creating mem 
> >> cgroups and attaching processes should not independently cause them to 
> >> become more preferred: it should be a fair comparison between the root mem 
> >> cgroup and the set of leaf mem cgroups as implemented.  That is very 
> >> trivial to do with hierarchical oom cgroup scoring.
> > There is absolutely no value in your repeating the same stuff over and
> > over again without considering what other people are telling you.
> >
> > Hierarchical oom scoring has other downsides, and most of us agree
> > that they aren't preferable over the differences in scoring the root
> > vs scoring other cgroups - in particular because the root cannot be
> > controlled, doesn't even have local statistics, and so is unlikely to
> > contain important work on a containerized system. Getting the ballpark
> > right for the vast majority of usecases is more than good enough here.
> >
> >> Since the ability of userspace to control oom victim selection is not 
> >> addressed whatsoever by this patchset, and the suggested method cannot be 
> >> implemented on top of this patchset as you have argued because it requires 
> >> a change to the heuristic itself, the patchset needs to become complete 
> >> before being mergeable.
> > It is complete. It just isn't a drop-in replacement for what you've
> > been doing out-of-tree for years. Stop making your problem everybody
> > else's problem.
> >
> > You can change the the heuristics later, as you have done before. Or
> > you can add another configuration flag and we can phase out the old
> > mode, like we do all the time.
> >
> I think this problem is related to the removal of the lowmemorykiller,
> where this is the life-line when the user-space for some reason fails.
> 
> So I guess quite a few will have this problem.

Could you be more specific please? We are _not_ removing possibility of
the user space influenced oom victim selection. You can still use the
_current_ oom selection heuristic. The patch adds a new selection method
which is opt-in so only those who want to opt-in will not be allowed to
have any influence on the victim selection. And as it has been pointed
out this can be implemented later so it is not like "this won't be
possible anymore in future"
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-19 18:52 ` [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
  2017-10-19 19:30   ` Michal Hocko
@ 2017-10-31 15:04   ` Shakeel Butt
  2017-10-31 15:29     ` Michal Hocko
  2017-10-31 16:40     ` Johannes Weiner
  1 sibling, 2 replies; 32+ messages in thread
From: Shakeel Butt @ 2017-10-31 15:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Linux MM, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, Cgroups, linux-doc, LKML

> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +       struct mem_cgroup *iter;
> +
> +       oc->chosen_memcg = NULL;
> +       oc->chosen_points = 0;
> +
> +       /*
> +        * The oom_score is calculated for leaf memory cgroups (including
> +        * the root memcg).
> +        */
> +       rcu_read_lock();
> +       for_each_mem_cgroup_tree(iter, root) {
> +               long score;
> +
> +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +                       continue;
> +

Cgroup v2 does not support charge migration between memcgs. So, there
can be intermediate nodes which may contain the major charge of the
processes in their leave descendents. Skipping such intermediate nodes
will kind of protect such processes from oom-killer (lower on the list
to be killed). Is it ok to not handle such scenario? If yes, shouldn't
we document it?

> +               score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +               /*
> +                * Ignore empty and non-eligible memory cgroups.
> +                */
> +               if (score == 0)
> +                       continue;
> +
> +               /*
> +                * If there are inflight OOM victims, we don't need
> +                * to look further for new victims.
> +                */
> +               if (score == -1) {
> +                       oc->chosen_memcg = INFLIGHT_VICTIM;
> +                       mem_cgroup_iter_break(root, iter);
> +                       break;
> +               }
> +
> +               if (score > oc->chosen_points) {
> +                       oc->chosen_points = score;
> +                       oc->chosen_memcg = iter;
> +               }
> +       }
> +
> +       if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +               css_get(&oc->chosen_memcg->css);
> +
> +       rcu_read_unlock();
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-31 14:34             ` Michal Hocko
@ 2017-10-31 15:07               ` peter enderborg
  0 siblings, 0 replies; 32+ messages in thread
From: peter enderborg @ 2017-10-31 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, David Rientjes, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, Roman Gushchin

On 10/31/2017 03:34 PM, Michal Hocko wrote:
> On Tue 31-10-17 15:17:11, peter enderborg wrote:
>> On 10/27/2017 10:05 PM, Johannes Weiner wrote:
>>> On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
>>>> On Thu, 26 Oct 2017, Johannes Weiner wrote:
>>>>
>>>>>> The nack is for three reasons:
>>>>>>
>>>>>>  (1) unfair comparison of root mem cgroup usage to bias against that mem 
>>>>>>      cgroup from oom kill in system oom conditions,
>>>>>>
>>>>>>  (2) the ability of users to completely evade the oom killer by attaching
>>>>>>      all processes to child cgroups either purposefully or unpurposefully,
>>>>>>      and
>>>>>>
>>>>>>  (3) the inability of userspace to effectively control oom victim  
>>>>>>      selection.
>>>>> My apologies if my summary was too reductionist.
>>>>>
>>>>> That being said, the arguments you repeat here have come up in
>>>>> previous threads and been responded to. This doesn't change my
>>>>> conclusion that your NAK is bogus.
>>>> They actually haven't been responded to, Roman was working through v11 and 
>>>> made a change on how the root mem cgroup usage was calculated that was 
>>>> better than previous iterations but still not an apples to apples 
>>>> comparison with other cgroups.  The problem is that it the calculation for 
>>>> leaf cgroups includes additional memory classes, so it biases against 
>>>> processes that are moved to non-root mem cgroups.  Simply creating mem 
>>>> cgroups and attaching processes should not independently cause them to 
>>>> become more preferred: it should be a fair comparison between the root mem 
>>>> cgroup and the set of leaf mem cgroups as implemented.  That is very 
>>>> trivial to do with hierarchical oom cgroup scoring.
>>> There is absolutely no value in your repeating the same stuff over and
>>> over again without considering what other people are telling you.
>>>
>>> Hierarchical oom scoring has other downsides, and most of us agree
>>> that they aren't preferable over the differences in scoring the root
>>> vs scoring other cgroups - in particular because the root cannot be
>>> controlled, doesn't even have local statistics, and so is unlikely to
>>> contain important work on a containerized system. Getting the ballpark
>>> right for the vast majority of usecases is more than good enough here.
>>>
>>>> Since the ability of userspace to control oom victim selection is not 
>>>> addressed whatsoever by this patchset, and the suggested method cannot be 
>>>> implemented on top of this patchset as you have argued because it requires 
>>>> a change to the heuristic itself, the patchset needs to become complete 
>>>> before being mergeable.
>>> It is complete. It just isn't a drop-in replacement for what you've
>>> been doing out-of-tree for years. Stop making your problem everybody
>>> else's problem.
>>>
>>> You can change the the heuristics later, as you have done before. Or
>>> you can add another configuration flag and we can phase out the old
>>> mode, like we do all the time.
>>>
>> I think this problem is related to the removal of the lowmemorykiller,
>> where this is the life-line when the user-space for some reason fails.
>>
>> So I guess quite a few will have this problem.
> Could you be more specific please? We are _not_ removing possibility of
> the user space influenced oom victim selection. You can still use the
> _current_ oom selection heuristic. The patch adds a new selection method
> which is opt-in so only those who want to opt-in will not be allowed to
> have any influence on the victim selection. And as it has been pointed
> out this can be implemented later so it is not like "this won't be
> possible anymore in future"

I think the idea is to have a implementation that is lowmemorykiller selection heuristic.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 15:04   ` Shakeel Butt
@ 2017-10-31 15:29     ` Michal Hocko
  2017-10-31 19:06       ` Michal Hocko
  2017-10-31 16:40     ` Johannes Weiner
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-10-31 15:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue 31-10-17 08:04:19, Shakeel Butt wrote:
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +       struct mem_cgroup *iter;
> > +
> > +       oc->chosen_memcg = NULL;
> > +       oc->chosen_points = 0;
> > +
> > +       /*
> > +        * The oom_score is calculated for leaf memory cgroups (including
> > +        * the root memcg).
> > +        */
> > +       rcu_read_lock();
> > +       for_each_mem_cgroup_tree(iter, root) {
> > +               long score;
> > +
> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > +                       continue;
> > +
> 
> Cgroup v2 does not support charge migration between memcgs. So, there
> can be intermediate nodes which may contain the major charge of the
> processes in their leave descendents. Skipping such intermediate nodes
> will kind of protect such processes from oom-killer (lower on the list
> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> we document it?

Yes, this is a real problem and the one which is not really solvable
without the charge migration. You simply have no clue _who_ owns the
memory so I assume that admins will need to setup the hierarchy which
allows subgroups to migrate tasks to be oom_group.

Or we might want to allow opt-in for charge migration in v2. To be
honest I wasn't completely happy about removing this functionality
altogether in v2 but there was a strong pushback back then that relying
on the charge migration doesn't have any sound usecase.

Anyway, I agree that documentation should be explicit about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 15:04   ` Shakeel Butt
  2017-10-31 15:29     ` Michal Hocko
@ 2017-10-31 16:40     ` Johannes Weiner
  2017-10-31 17:50       ` Shakeel Butt
  1 sibling, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2017-10-31 16:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +       struct mem_cgroup *iter;
> > +
> > +       oc->chosen_memcg = NULL;
> > +       oc->chosen_points = 0;
> > +
> > +       /*
> > +        * The oom_score is calculated for leaf memory cgroups (including
> > +        * the root memcg).
> > +        */
> > +       rcu_read_lock();
> > +       for_each_mem_cgroup_tree(iter, root) {
> > +               long score;
> > +
> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > +                       continue;
> > +
> 
> Cgroup v2 does not support charge migration between memcgs. So, there
> can be intermediate nodes which may contain the major charge of the
> processes in their leave descendents. Skipping such intermediate nodes
> will kind of protect such processes from oom-killer (lower on the list
> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> we document it?

Tasks cannot be in intermediate nodes, so the only way you can end up
in a situation like this is to start tasks fully, let them fault in
their full workingset, then create child groups and move them there.

That has attribution problems much wider than the OOM killer: any
local limits you would set on a leaf cgroup like this ALSO won't
control the memory of its tasks - as it's all sitting in the parent.

We created the "no internal competition" rule exactly to prevent this
situation. To be consistent with that rule, we might want to disallow
the creation of child groups once a cgroup has local memory charges.

It's trivial to change the setup sequence to create the leaf cgroup
first, then launch the workload from within.

Either way, this is nothing specific about the OOM killer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 16:40     ` Johannes Weiner
@ 2017-10-31 17:50       ` Shakeel Butt
  2017-10-31 18:44         ` Johannes Weiner
  0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2017-10-31 17:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
>> > +
>> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>> > +{
>> > +       struct mem_cgroup *iter;
>> > +
>> > +       oc->chosen_memcg = NULL;
>> > +       oc->chosen_points = 0;
>> > +
>> > +       /*
>> > +        * The oom_score is calculated for leaf memory cgroups (including
>> > +        * the root memcg).
>> > +        */
>> > +       rcu_read_lock();
>> > +       for_each_mem_cgroup_tree(iter, root) {
>> > +               long score;
>> > +
>> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
>> > +                       continue;
>> > +
>>
>> Cgroup v2 does not support charge migration between memcgs. So, there
>> can be intermediate nodes which may contain the major charge of the
>> processes in their leave descendents. Skipping such intermediate nodes
>> will kind of protect such processes from oom-killer (lower on the list
>> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
>> we document it?
>
> Tasks cannot be in intermediate nodes, so the only way you can end up
> in a situation like this is to start tasks fully, let them fault in
> their full workingset, then create child groups and move them there.
>
> That has attribution problems much wider than the OOM killer: any
> local limits you would set on a leaf cgroup like this ALSO won't
> control the memory of its tasks - as it's all sitting in the parent.
>
> We created the "no internal competition" rule exactly to prevent this
> situation.

Rather than the "no internal competition" restriction I think "charge
migration" would have resolved that situation? Also "no internal
competition" restriction (I am assuming 'no internal competition' is
no tasks in internal nodes, please correct me if I am wrong) has made
"charge migration" hard to implement and thus not added in cgroup v2.

I know this is parallel discussion and excuse my ignorance, what are
other reasons behind "no internal competition" specifically for memory
controller?

> To be consistent with that rule, we might want to disallow
> the creation of child groups once a cgroup has local memory charges.
>
> It's trivial to change the setup sequence to create the leaf cgroup
> first, then launch the workload from within.
>

Only if cgroup hierarchy is centrally controller and each task's whole
hierarchy is known in advance.

> Either way, this is nothing specific about the OOM killer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 17:50       ` Shakeel Butt
@ 2017-10-31 18:44         ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2017-10-31 18:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue, Oct 31, 2017 at 10:50:43AM -0700, Shakeel Butt wrote:
> On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
> >> > +
> >> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> >> > +{
> >> > +       struct mem_cgroup *iter;
> >> > +
> >> > +       oc->chosen_memcg = NULL;
> >> > +       oc->chosen_points = 0;
> >> > +
> >> > +       /*
> >> > +        * The oom_score is calculated for leaf memory cgroups (including
> >> > +        * the root memcg).
> >> > +        */
> >> > +       rcu_read_lock();
> >> > +       for_each_mem_cgroup_tree(iter, root) {
> >> > +               long score;
> >> > +
> >> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> >> > +                       continue;
> >> > +
> >>
> >> Cgroup v2 does not support charge migration between memcgs. So, there
> >> can be intermediate nodes which may contain the major charge of the
> >> processes in their leave descendents. Skipping such intermediate nodes
> >> will kind of protect such processes from oom-killer (lower on the list
> >> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> >> we document it?
> >
> > Tasks cannot be in intermediate nodes, so the only way you can end up
> > in a situation like this is to start tasks fully, let them fault in
> > their full workingset, then create child groups and move them there.
> >
> > That has attribution problems much wider than the OOM killer: any
> > local limits you would set on a leaf cgroup like this ALSO won't
> > control the memory of its tasks - as it's all sitting in the parent.
> >
> > We created the "no internal competition" rule exactly to prevent this
> > situation.
> 
> Rather than the "no internal competition" restriction I think "charge
> migration" would have resolved that situation? Also "no internal
> competition" restriction (I am assuming 'no internal competition' is
> no tasks in internal nodes, please correct me if I am wrong) has made
> "charge migration" hard to implement and thus not added in cgroup v2.
> 
> I know this is parallel discussion and excuse my ignorance, what are
> other reasons behind "no internal competition" specifically for memory
> controller?

Sorry, but this is completely off-topic.

The rationale for this decisions is in Documentation/cgroup-v2.txt.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 15:29     ` Michal Hocko
@ 2017-10-31 19:06       ` Michal Hocko
  2017-10-31 19:13         ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-10-31 19:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue 31-10-17 16:29:23, Michal Hocko wrote:
> On Tue 31-10-17 08:04:19, Shakeel Butt wrote:
> > > +
> > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > > +{
> > > +       struct mem_cgroup *iter;
> > > +
> > > +       oc->chosen_memcg = NULL;
> > > +       oc->chosen_points = 0;
> > > +
> > > +       /*
> > > +        * The oom_score is calculated for leaf memory cgroups (including
> > > +        * the root memcg).
> > > +        */
> > > +       rcu_read_lock();
> > > +       for_each_mem_cgroup_tree(iter, root) {
> > > +               long score;
> > > +
> > > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > > +                       continue;
> > > +
> > 
> > Cgroup v2 does not support charge migration between memcgs. So, there
> > can be intermediate nodes which may contain the major charge of the
> > processes in their leave descendents. Skipping such intermediate nodes
> > will kind of protect such processes from oom-killer (lower on the list
> > to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> > we document it?
> 
> Yes, this is a real problem and the one which is not really solvable
> without the charge migration. You simply have no clue _who_ owns the
> memory so I assume that admins will need to setup the hierarchy which
> allows subgroups to migrate tasks to be oom_group.

Hmm, scratch that. I have completely missed that the memory controller
disables tasks migration completely in v2. I thought the standard
restriction about the write access to the target cgroup and a common
ancestor holds for all controllers but now I've noticed that we
simply disallow the migration altogether. This wasn't the case before
1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from
subtree_control enabling") which I wasn't aware of.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-31 19:06       ` Michal Hocko
@ 2017-10-31 19:13         ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2017-10-31 19:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Linux MM, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, Cgroups,
	linux-doc, LKML

On Tue 31-10-17 20:06:44, Michal Hocko wrote:
> On Tue 31-10-17 16:29:23, Michal Hocko wrote:
> > On Tue 31-10-17 08:04:19, Shakeel Butt wrote:
> > > > +
> > > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > > > +{
> > > > +       struct mem_cgroup *iter;
> > > > +
> > > > +       oc->chosen_memcg = NULL;
> > > > +       oc->chosen_points = 0;
> > > > +
> > > > +       /*
> > > > +        * The oom_score is calculated for leaf memory cgroups (including
> > > > +        * the root memcg).
> > > > +        */
> > > > +       rcu_read_lock();
> > > > +       for_each_mem_cgroup_tree(iter, root) {
> > > > +               long score;
> > > > +
> > > > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> > > > +                       continue;
> > > > +
> > > 
> > > Cgroup v2 does not support charge migration between memcgs. So, there
> > > can be intermediate nodes which may contain the major charge of the
> > > processes in their leave descendents. Skipping such intermediate nodes
> > > will kind of protect such processes from oom-killer (lower on the list
> > > to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> > > we document it?
> > 
> > Yes, this is a real problem and the one which is not really solvable
> > without the charge migration. You simply have no clue _who_ owns the
> > memory so I assume that admins will need to setup the hierarchy which
> > allows subgroups to migrate tasks to be oom_group.
> 
> Hmm, scratch that. I have completely missed that the memory controller
> disables tasks migration completely in v2. I thought the standard
> restriction about the write access to the target cgroup and a common
> ancestor holds for all controllers but now I've noticed that we
> simply disallow the migration altogether. This wasn't the case before
> 1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from
> subtree_control enabling") which I wasn't aware of.

Blee brain fart, I have misread the code. We return 0 which is a success
so can_attach doesn't fail and so the tasks migration should be allowed
under standard cgroup restrictions, we just do not migrate charges.

Anyway, time to stop writing emails for me today. Sorry about the
confusion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-31  7:54             ` Michal Hocko
@ 2017-10-31 22:21               ` David Rientjes
  2017-11-01  7:37                 ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: David Rientjes @ 2017-10-31 22:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, 31 Oct 2017, Michal Hocko wrote:

> > I'm not ignoring them, I have stated that we need the ability to protect 
> > important cgroups on the system without oom disabling all attached 
> > processes.  If that is implemented as a memory.oom_score_adj with the same 
> > semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
> > memory (the limit), it can also address the issues pointed out with the 
> > hierarchical approach in v8.
> 
> No it cannot and it would be a terrible interface to have as well. You
> do not want to permanently tune oom_score_adj to compensate for
> structural restrictions on the hierarchy.
> 

memory.oom_score_adj would never need to be permanently tuned, just as 
/proc/pid/oom_score_adj need never be permanently tuned.  My response was 
an answer to Roman's concern that "v8 has it's own limitations," but I 
haven't seen a concrete example where the oom killer is forced to kill 
from the non-preferred cgroup while the user has power of biasing against 
certain cgroups with memory.oom_score_adj.  Do you have such a concrete 
example that we can work with?

> I believe, and Roman has pointed that out as well already, that further
> improvements can be implemented without changing user visible behavior
> as and add-on. If you disagree then you better come with a solid proof
> that all of us wrong and reasonable semantic cannot be achieved that
> way.

We simply cannot determine if improvements can be implemented in the 
future without user-visible changes if those improvements are unknown or 
undecided at this time.  It may require hierarchical accounting when 
making a choice between siblings, as suggested with oom_score_adj.  The 
only thing that we need to agree on is that userspace needs to have some 
kind of influence over victim selection: the oom killer killing an 
important user process is an extremely sensitive thing.  If the patchset 
lacks the ability to have that influence, and such an ability would impact 
the heuristic overall, it's better to introduce that together as a 
complete patchset rather than merging an incomplete feature when it's 
known the user needs some control, asking the user to workaround it by 
setting all processes to oom disabled in a preferred mem cgroup, and then 
changing the heuristic again.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-10-31 22:21               ` David Rientjes
@ 2017-11-01  7:37                 ` Michal Hocko
  2017-11-01 20:42                   ` David Rientjes
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2017-11-01  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 31-10-17 15:21:23, David Rientjes wrote:
> On Tue, 31 Oct 2017, Michal Hocko wrote:
> 
> > > I'm not ignoring them, I have stated that we need the ability to protect 
> > > important cgroups on the system without oom disabling all attached 
> > > processes.  If that is implemented as a memory.oom_score_adj with the same 
> > > semantics as /proc/pid/oom_score_adj, i.e. a proportion of available 
> > > memory (the limit), it can also address the issues pointed out with the 
> > > hierarchical approach in v8.
> > 
> > No it cannot and it would be a terrible interface to have as well. You
> > do not want to permanently tune oom_score_adj to compensate for
> > structural restrictions on the hierarchy.
> > 
> 
> memory.oom_score_adj would never need to be permanently tuned, just as 
> /proc/pid/oom_score_adj need never be permanently tuned.  My response was 
> an answer to Roman's concern that "v8 has it's own limitations," but I 
> haven't seen a concrete example where the oom killer is forced to kill 
> from the non-preferred cgroup while the user has power of biasing against 
> certain cgroups with memory.oom_score_adj.  Do you have such a concrete 
> example that we can work with?

Yes, the one with structural requirements due to other controllers or
due to general organizational purposes where hierarchical (sibling
oriented) comparison just doesn't work. Take the students, teachers,
admins example. You definitely do not want to kill from students
subgroups by default just because this is the largest entity type.
Tuning memory.oom_score_adj doesn't work for that usecase as soon as
new subgroups come and go.

> > I believe, and Roman has pointed that out as well already, that further
> > improvements can be implemented without changing user visible behavior
> > as and add-on. If you disagree then you better come with a solid proof
> > that all of us wrong and reasonable semantic cannot be achieved that
> > way.
> 
> We simply cannot determine if improvements can be implemented in the 
> future without user-visible changes if those improvements are unknown or 
> undecided at this time.

Come on. There have been at least two examples on how this could be
achieved. One priority based which would use cumulative memory
consumption if set on intermediate nodes which would allow you to
compare siblings. And another one was to add a new knob which would make
an intermediate node an aggregate for accounting purposes.

> It may require hierarchical accounting when 
> making a choice between siblings, as suggested with oom_score_adj.  The 
> only thing that we need to agree on is that userspace needs to have some 
> kind of influence over victim selection: the oom killer killing an 
> important user process is an extremely sensitive thing.

And I am pretty sure we have already agreed that something like this is
useful for some usecases and nobody objected this would get merged in
future. All we are saying now is that this is not in scope of _this_
patchseries because the vast majority of usecases simply do not care
about influencing the oom selection. They only do care about having per
cgroup behavior and/or kill all semantic. I really do not understand
what is hard about that.

> If the patchset 
> lacks the ability to have that influence, and such an ability would impact 
> the heuristic overall, it's better to introduce that together as a 
> complete patchset rather than merging an incomplete feature when it's 
> known the user needs some control, asking the user to workaround it by 
> setting all processes to oom disabled in a preferred mem cgroup, and then 
> changing the heuristic again.

I believe we can introduce new knobs without influencing those who do
not set them and I haven't heard any argument which would say otherwise.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RESEND v12 0/6] cgroup-aware OOM killer
  2017-11-01  7:37                 ` Michal Hocko
@ 2017-11-01 20:42                   ` David Rientjes
  0 siblings, 0 replies; 32+ messages in thread
From: David Rientjes @ 2017-11-01 20:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 1 Nov 2017, Michal Hocko wrote:

> > memory.oom_score_adj would never need to be permanently tuned, just as 
> > /proc/pid/oom_score_adj need never be permanently tuned.  My response was 
> > an answer to Roman's concern that "v8 has it's own limitations," but I 
> > haven't seen a concrete example where the oom killer is forced to kill 
> > from the non-preferred cgroup while the user has power of biasing against 
> > certain cgroups with memory.oom_score_adj.  Do you have such a concrete 
> > example that we can work with?
> 
> Yes, the one with structural requirements due to other controllers or
> due to general organizational purposes where hierarchical (sibling
> oriented) comparison just doesn't work.

You mean where an admin or user does

	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

to place constraints on processes with other controllers but unknowingly 
completely circumvented oom kill selection?  That's one of my concerns 
that hasn't been addressed and I believe only can be done with 
hierarchical accounting.

> Take the students, teachers,
> admins example. You definitely do not want to kill from students
> subgroups by default just because this is the largest entity type.
> Tuning memory.oom_score_adj doesn't work for that usecase as soon as
> new subgroups come and go.
> 

With hierarchical accounting, that solves all three concerns that I have 
enumerated, the top-level organizer knows the limits imposed.  It is 
therefore *trivial* to prefer /students by biasing against it with 
memory.oom_score_adj over other top-level mem cgroups and still kill 
from /students if going over a certain threshold of memory.  With 
hierarchical accounting and memory.oom_score_adj, it could even be used 
for userspace to determine which student to kill from.  If /admins is 
using more memory than expected, it gets biased against with the same 
memory.oom_score_adj.  The point is that the top-level organizer that is 
structing the mem cgroup tree knows the limits imposed and the preference 
of /admins over /students, or vice versa.  It also doesn't allow /students 
to circumvent victim selection by creating child mem cgroups.

If this is missing your point, please draw the hierarchy you are 
suggesting and show which mem cgroup is preferred by the admin and does 
not allow the user to circumvent that priority.

> > We simply cannot determine if improvements can be implemented in the 
> > future without user-visible changes if those improvements are unknown or 
> > undecided at this time.
> 
> Come on. There have been at least two examples on how this could be
> achieved. One priority based which would use cumulative memory
> consumption if set on intermediate nodes which would allow you to
> compare siblings. And another one was to add a new knob which would make
> an intermediate node an aggregate for accounting purposes.
> 

We don't need a memory.oom_group, memory.oom_hierarchical_accounting, 
memory.oom_priority, and memory.oom_score_adj.  I believe 
memory.oom_score_adj suffices.  I don't think it is in our best interest 
or the users best interest to maintain many different combinations of how 
an oom victim is selected.  I believe all the power needed is by providing 
a memory.oom_score_adj since cgroup memory usage is being considered, just 
as /proc/pid/oom_score_adj exists when process memory usage is being 
considered.  It seems very intuitive.

In the interest of not polluting the namespace, not allowing users to 
circumvent the decisions of the oom killer, and to allow userspace to be 
able to influence oom victim selection, we need this to be implemented now 
rather than later since it may affect the heuristic under consideration 
and we should have a complete patchset without being backed into a corner 
based on decisions made earlier with the rationale that it can be figured 
out later, let's merge this now.

> And I am pretty sure we have already agreed that something like this is
> useful for some usecases and nobody objected this would get merged in
> future. All we are saying now is that this is not in scope of _this_
> patchseries because the vast majority of usecases simply do not care
> about influencing the oom selection. They only do care about having per
> cgroup behavior and/or kill all semantic. I really do not understand
> what is hard about that.
> 

I honestly do not believe what hurry we're in or what is so hard to 
understand about implementing the ability of userspace to influence the 
decisionmaking that works well together with the heuristic being 
introduced.

We can stop wasting time arguing about whether its appropriate to merge an 
incomplete patchset or not and actually fix the three fundamental flaws 
that have been outlined with this approach, or I can fork the patchset and 
introduce it myself as proposed if that is preferred.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-11-01 20:42 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-19 18:52 [RESEND v12 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-19 18:52 ` [RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-19 18:52 ` [RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-10-19 18:52 ` [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-10-19 19:30   ` Michal Hocko
2017-10-31 15:04   ` Shakeel Butt
2017-10-31 15:29     ` Michal Hocko
2017-10-31 19:06       ` Michal Hocko
2017-10-31 19:13         ` Michal Hocko
2017-10-31 16:40     ` Johannes Weiner
2017-10-31 17:50       ` Shakeel Butt
2017-10-31 18:44         ` Johannes Weiner
2017-10-19 18:52 ` [RESEND v12 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-19 18:52 ` [RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-19 18:52 ` [RESEND v12 6/6] mm, oom, docs: describe the " Roman Gushchin
2017-10-19 19:45 ` [RESEND v12 0/6] " Johannes Weiner
2017-10-19 21:09   ` Michal Hocko
2017-10-23  0:24   ` David Rientjes
2017-10-23 11:49     ` Michal Hocko
2017-10-25 20:12       ` David Rientjes
2017-10-26 14:24     ` Johannes Weiner
2017-10-26 21:03       ` David Rientjes
2017-10-27  9:31         ` Roman Gushchin
2017-10-30 21:36           ` David Rientjes
2017-10-31  7:54             ` Michal Hocko
2017-10-31 22:21               ` David Rientjes
2017-11-01  7:37                 ` Michal Hocko
2017-11-01 20:42                   ` David Rientjes
2017-10-27 20:05         ` Johannes Weiner
2017-10-31 14:17           ` peter enderborg
2017-10-31 14:34             ` Michal Hocko
2017-10-31 15:07               ` peter enderborg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).