linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 0/7] cgroup-aware OOM killer
@ 2017-11-30 15:28 Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
                   ` (9 more replies)
  0 siblings, 10 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

This patchset makes the OOM killer cgroup-aware.

v13:
  - Reverted fallback to per-process OOM as in v11 (asked by Michal)
  - Added entry in cgroup features list
  - Added a note about charge migration
  - Rebase

v12:
  - Root memory cgroup is evaluated based on sum of the oom scores
    of belonging tasks
  - Do not fallback to the per-process behavior if there if
    it wasn't possbile to kill a memcg victim
  - Rebase on top of mm tree

v11:
  - Fixed an issue with skipping the root mem cgroup
    (discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
    patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
    (proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
    if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
    memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org

Roman Gushchin (7):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer
  cgroup: list groupoom in cgroup features

 Documentation/cgroup-v2.txt |  58 ++++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  13 ++-
 mm/memcontrol.c             | 258 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 224 +++++++++++++++++++++++++-------------
 7 files changed, 525 insertions(+), 79 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b0d0fed8480..f041534d77d3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -814,68 +814,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -949,6 +893,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, Andrew Morton,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4aac306ebe3..55fbda60cef6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -888,7 +888,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -896,8 +897,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -906,7 +905,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-12-01  8:35   ` Michal Hocko
  2017-12-07  1:24   ` Andrew Morton
  2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  17 +++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 181 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  84 +++++++++++++++------
 4 files changed, 272 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 882046863581..cb4db659a8b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27cd36b762b5..10f495c8454d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -10,6 +10,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -51,7 +58,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
 extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55fbda60cef6..592ffb1c98a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Root memory cgroup is a special case:
+	 * we don't have necessary stats to evaluate it exactly as
+	 * leaf memory cgroups, so we approximate it's oom_score
+	 * by summing oom_score of all belonging tasks, which are
+	 * owners of their mm structs.
+	 *
+	 * If there are inflight OOM victim tasks inside
+	 * the root memcg, we return -1.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP,
+				      &task->signal->oom_mm->flags)) {
+				score = -1;
+				break;
+			}
+
+			task_lock(task);
+			if (!task->mm || task->mm->owner != task) {
+				task_unlock(task);
+				continue;
+			}
+			task_unlock(task);
+
+			score += oom_badness(task, memcg, nodemask,
+					     totalpages);
+		}
+		css_task_iter_end(&it);
+
+		return score;
+	}
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f041534d77d3..bcfa92f29407 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
 		if (oc->page)
 			return true;
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc)) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page) {
+			if (oc->chosen_memcg &&
+			    oc->chosen_memcg != INFLIGHT_VICTIM)
+				mem_cgroup_put(oc->chosen_memcg);
+			return true;
+		}
+
+		if (oom_kill_memcg_victim(oc)) {
+			delay = true;
+			goto out;
+		}
+	}
+
 	select_bad_process(oc);
 	/*
 	 * Try really last second allocation attempt after we selected an OOM
@@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	oc->page = alloc_pages_before_oomkill(oc);
 	if (oc->page) {
-		if (oc->chosen && oc->chosen != (void *)-1UL)
-			put_task_struct(oc->chosen);
+		if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+			put_task_struct(oc->chosen_task);
 		return true;
 	}
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 4/7] mm, oom: introduce memory.oom_group
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 17 +++++++++++
 mm/memcontrol.c            | 75 +++++++++++++++++++++++++++++++++++++++++++---
 mm/oom_kill.c              | 47 +++++++++++++++++++++++------
 3 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cb4db659a8b5..7b8bcdf6571d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,13 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/*
+	 * Treat the sub-tree as an indivisible memory consumer,
+	 * kill all belonging tasks if the memory cgroup selected
+	 * as OOM victim.
+	 */
+	bool oom_group;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -490,6 +497,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -990,6 +1002,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 {
 	return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 592ffb1c98a7..5d27a4bbd478 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,19 +2779,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *group = NULL;
+	long group_score = 0;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
 
+	/*
+	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+	 * all tasks belonging to the memcg should be killed.
+	 * So, we mark the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		return;
+	}
+
 	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
+	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
+		/*
+		 * We don't consider non-leaf non-oom_group memory cgroups
+		 * as OOM victims.
+		 */
+		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+		    !mem_cgroup_oom_group(iter))
+			continue;
+
+		/*
+		 * If group is not set or we've ran out of the group's sub-tree,
+		 * we should set group and reset group_score.
+		 */
+		if (!group || group == root_mem_cgroup ||
+		    !mem_cgroup_is_descendant(iter, group)) {
+			group = iter;
+			group_score = 0;
+		}
+
 		if (memcg_has_children(iter) && iter != root_mem_cgroup)
 			continue;
 
@@ -2813,9 +2845,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			break;
 		}
 
-		if (score > oc->chosen_points) {
-			oc->chosen_points = score;
-			oc->chosen_memcg = iter;
+		group_score += score;
+
+		if (group_score > oc->chosen_points) {
+			oc->chosen_points = group_score;
+			oc->chosen_memcg = group;
 		}
 	}
 
@@ -5440,6 +5474,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes,
+				      loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5559,6 +5620,12 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_max_show,
 		.write = memory_max_write,
 	},
+	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
 	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bcfa92f29407..4678468bae17 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,17 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	/*
+	 * __oom_kill_process() is used to kill all tasks belonging to
+	 * the selected memory cgroup, so we should check that we're not
+	 * trying to kill an unkillable task.
+	 */
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -956,21 +967,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
 {
+	get_task_struct(task);
+	__oom_kill_process(task);
+	return 0;
+}
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
 	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
 		return oc->chosen_memcg;
 
-	/* Kill a task in the chosen memcg with the biggest memory footprint */
-	oc->chosen_points = 0;
-	oc->chosen_task = NULL;
-	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
-	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
-		goto out;
+	/*
+	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+	 * of the chosen memory cgroup, otherwise kill the task with the biggest
+	 * memory footprint.
+	 */
+	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+		/* We have one or more terminating processes at this point. */
+		oc->chosen_task = INFLIGHT_VICTIM;
+	} else {
+		oc->chosen_points = 0;
+		oc->chosen_task = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+		if (oc->chosen_task == NULL ||
+		    oc->chosen_task == INFLIGHT_VICTIM)
+			goto out;
 
-	__oom_kill_process(oc->chosen_task);
+		__oom_kill_process(oc->chosen_task);
+	}
 
 out:
 	mem_cgroup_put(oc->chosen_memcg);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-12-01  8:41   ` Michal Hocko
  2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8b7fd8eeccee..9fb99e25d654 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,6 +81,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0b1ffe147f24..7338e12979e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d27a4bbd478..c76d5fb55c5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-12-01  8:41   ` Michal Hocko
  2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 58 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 779211fbb69f..c80a147f94b7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1026,6 +1027,28 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup as an
+	indivisible memory consumers and compare it with other memory
+	consumers by it's memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	This applies to system-wide OOM conditions and reaching
+	the hard memory limit of the cgroup and their ancestor.
+	If OOM condition happens in a descendant cgroup with it's own
+	memory limit, the memory cgroup can't be considered
+	as an OOM victim, and OOM killer will not kill all belonging
+	tasks.
+
+	Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+Please, note that memory charges are not migrating if tasks
+are moved between different memory cgroups. Moving tasks with
+significant memory footprint may affect OOM victim selection logic.
+If it's a case, please, consider creating a common ancestor for
+the source and destination memory cgroups and enabling oom_group
+on ancestor layer.
+
 
 IO
 --
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v13 7/7] cgroup: list groupoom in cgroup features
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (5 preceding siblings ...)
  2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
  2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Tejun Heo, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

List groupoom in cgroup features list (exported via
/sys/kernel/cgroup/features), which can be used by a userspace
apps (most likely, systemd) to get an idea which cgroup features
are supported by kernel.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 kernel/cgroup/cgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7338e12979e1..693443282fc1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5922,7 +5922,8 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			     char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
+	return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
+					"groupoom\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (6 preceding siblings ...)
  2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
@ 2017-11-30 20:39 ` Andrew Morton
  2018-01-10  0:57   ` David Rientjes
  2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
  9 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-11-30 20:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu, 30 Nov 2017 15:28:17 +0000 Roman Gushchin <guro@fb.com> wrote:

> This patchset makes the OOM killer cgroup-aware.

Thanks, I'll grab these.

There has been controversy over this patchset, to say the least.  I
can't say that I followed it closely!  Could those who still have
reservations please summarise their concerns and hopefully suggest a
way forward?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
  2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01  8:35   ` Michal Hocko
  2017-12-07  1:24   ` Andrew Morton
  1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01  8:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu 30-11-17 15:28:20, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> This patch introduces the core functionality: an ability to select
> a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
> looks for the biggest leaf memory cgroup and kills the biggest
> task belonging to it.
> 
> The following patches will extend this functionality to consider
> non-leaf memory cgroups as OOM victims, and also provide an ability
> to kill all tasks belonging to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf memory cgroups.
> Due to memcg statistics implementation a special approximation
> is used for estimating oom_score of root memory cgroup: we sum
> oom_score of the belonging processes (or, to be more precise,
> tasks owning their mm structures).
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

I am not entirely happy that this patch enables the cgroup behavior
unconditioanlly for cgroup v2 but later patch fixes that up. I do not
expect people are going to bisect oom workloads over these few commits
so this should be a big deal.

Anyway I still _strongly_ believe that the new heuristic is not
suitable for the default behavior and the opt-in is required. So my ack
is under this condition.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  17 +++++
>  include/linux/oom.h        |  12 ++-
>  mm/memcontrol.c            | 181 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/oom_kill.c              |  84 +++++++++++++++------
>  4 files changed, 272 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 882046863581..cb4db659a8b5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +	css_put(&memcg->css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
>  	return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>  		struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 27cd36b762b5..10f495c8454d 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -10,6 +10,13 @@
>  #include <linux/sched/coredump.h> /* MMF_* */
>  #include <linux/mm.h> /* VM_FAULT* */
>  
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
>  struct zonelist;
>  struct notifier_block;
>  struct mem_cgroup;
> @@ -51,7 +58,8 @@ struct oom_control {
>  
>  	/* Used by oom implementation, do not set */
>  	unsigned long totalpages;
> -	struct task_struct *chosen;
> +	struct task_struct *chosen_task;
> +	struct mem_cgroup *chosen_memcg;
>  	unsigned long chosen_points;
>  };
>  
> @@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
>  extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 55fbda60cef6..592ffb1c98a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Root memory cgroup is a special case:
> +	 * we don't have necessary stats to evaluate it exactly as
> +	 * leaf memory cgroups, so we approximate it's oom_score
> +	 * by summing oom_score of all belonging tasks, which are
> +	 * owners of their mm structs.
> +	 *
> +	 * If there are inflight OOM victim tasks inside
> +	 * the root memcg, we return -1.
> +	 */
> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			if (tsk_is_oom_victim(task) &&
> +			    !test_bit(MMF_OOM_SKIP,
> +				      &task->signal->oom_mm->flags)) {
> +				score = -1;
> +				break;
> +			}
> +
> +			task_lock(task);
> +			if (!task->mm || task->mm->owner != task) {
> +				task_unlock(task);
> +				continue;
> +			}
> +			task_unlock(task);
> +
> +			score += oom_badness(task, memcg, nodemask,
> +					     totalpages);
> +		}
> +		css_task_iter_end(&it);
> +
> +		return score;
> +	}
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +
> +	oc->chosen_memcg = NULL;
> +	oc->chosen_points = 0;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memory cgroups (including
> +	 * the root memcg).
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		long score;
> +
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +			continue;
> +
> +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need
> +		 * to look further for new victims.
> +		 */
> +		if (score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (score > oc->chosen_points) {
> +			oc->chosen_points = score;
> +			oc->chosen_memcg = iter;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *root;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	if (oc->memcg)
> +		root = oc->memcg;
> +	else
> +		root = root_mem_cgroup;
> +
> +	select_victim_memcg(root, oc);
> +
> +	return oc->chosen_memcg;
> +}
> +
>  /*
>   * Reclaims as many pages from the given memcg as possible.
>   *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f041534d77d3..bcfa92f29407 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	return CONSTRAINT_NONE;
>  }
>  
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>  	struct oom_control *oc = arg;
>  	unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  		goto next;
>  
>  	/* Prefer thread group leaders for display purposes */
> -	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> +	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
>  		goto next;
>  select:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
>  	get_task_struct(task);
> -	oc->chosen = task;
> +	oc->chosen_task = task;
>  	oc->chosen_points = points;
>  next:
>  	return 0;
>  abort:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> -	oc->chosen = (void *)-1UL;
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
> +	oc->chosen_task = INFLIGHT_VICTIM;
>  	return 1;
>  }
>  
>  /*
>   * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
>   */
>  static void select_bad_process(struct oom_control *oc)
>  {
> @@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
>  
>  static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
> -	struct task_struct *p = oc->chosen;
> +	struct task_struct *p = oc->chosen_task;
>  	unsigned int points = oc->chosen_points;
>  	struct task_struct *victim = p;
>  	struct task_struct *child;
> @@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Kill a task in the chosen memcg with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +out:
> +	mem_cgroup_put(oc->chosen_memcg);
> +	return oc->chosen_task;
> +}
> +
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> @@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
>  {
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
> +	bool delay = false; /* if set, delay next allocation attempt */
>  
>  	if (oom_killer_disabled)
>  		return false;
> @@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
>  		if (oc->page)
>  			return true;
>  		get_task_struct(current);
> -		oc->chosen = current;
> +		oc->chosen_task = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
>  		return true;
>  	}
>  
> +	if (mem_cgroup_select_oom_victim(oc)) {
> +		oc->page = alloc_pages_before_oomkill(oc);
> +		if (oc->page) {
> +			if (oc->chosen_memcg &&
> +			    oc->chosen_memcg != INFLIGHT_VICTIM)
> +				mem_cgroup_put(oc->chosen_memcg);
> +			return true;
> +		}
> +
> +		if (oom_kill_memcg_victim(oc)) {
> +			delay = true;
> +			goto out;
> +		}
> +	}
> +
>  	select_bad_process(oc);
>  	/*
>  	 * Try really last second allocation attempt after we selected an OOM
> @@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
>  	 */
>  	oc->page = alloc_pages_before_oomkill(oc);
>  	if (oc->page) {
> -		if (oc->chosen && oc->chosen != (void *)-1UL)
> -			put_task_struct(oc->chosen);
> +		if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
> +			put_task_struct(oc->chosen_task);
>  		return true;
>  	}
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
> -	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> +	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (oc->chosen && oc->chosen != (void *)-1UL) {
> +	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>  				 "Memory cgroup out of memory");
> -		/*
> -		 * Give the killed process a good chance to exit before trying
> -		 * to allocate memory again.
> -		 */
> -		schedule_timeout_killable(1);
> +		delay = true;
>  	}
> -	return !!oc->chosen;
> +
> +out:
> +	/*
> +	 * Give the killed process a good chance to exit before trying
> +	 * to allocate memory again.
> +	 */
> +	if (delay)
> +		schedule_timeout_killable(1);
> +
> +	return !!oc->chosen_task;
>  }
>  
>  /*
> -- 
> 2.14.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01  8:41   ` Michal Hocko
  2017-12-01 13:15     ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01  8:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.

Is it ok to create oom_group if the option is not enabled? This looks
confusing. I forgot all the details about how cgroup core creates file
so I do not have a good idea how to fix this.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/cgroup-defs.h |  5 +++++
>  kernel/cgroup/cgroup.c      | 10 ++++++++++
>  mm/memcontrol.c             |  3 +++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 8b7fd8eeccee..9fb99e25d654 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -81,6 +81,11 @@ enum {
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
>  	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> +	/*
> +	 * Enable cgroup-aware OOM killer.
> +	 */
> +	CGRP_GROUP_OOM = (1 << 5),
>  };
>  
>  /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 0b1ffe147f24..7338e12979e1 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
>  		if (!strcmp(token, "nsdelegate")) {
>  			*root_flags |= CGRP_ROOT_NS_DELEGATE;
>  			continue;
> +		} else if (!strcmp(token, "groupoom")) {
> +			*root_flags |= CGRP_GROUP_OOM;
> +			continue;
>  		}
>  
>  		pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>  			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>  		else
>  			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> +		if (root_flags & CGRP_GROUP_OOM)
> +			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> +		else
> +			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
>  	}
>  }
>  
> @@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>  {
>  	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>  		seq_puts(seq, ",nsdelegate");
> +	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> +		seq_puts(seq, ",groupoom");
>  	return 0;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d27a4bbd478..c76d5fb55c5c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return false;
>  
> +	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> +		return false;
> +
>  	if (oc->memcg)
>  		root = oc->memcg;
>  	else
> -- 
> 2.14.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-12-01  8:41   ` Michal Hocko
  2017-12-01 17:01     ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01  8:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
>  POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
>  belonging to the affected files to ensure correct memory ownership.
>  
> +OOM Killer
> +~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats cgroups as first class OOM entities.

This should mention groupoom mount option to enable this functionality.

Other than that looks ok to me
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (7 preceding siblings ...)
  2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
@ 2017-12-01  9:14 ` Michal Hocko
  2017-12-01 13:26   ` Tetsuo Handa
  2017-12-01 13:32   ` Roman Gushchin
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
  9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01  9:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

Recently added alloc_pages_before_oomkill gained new caller with this
patchset and I think it just grown to deserve a simpler code flow.
What do you think about this on top of the series?

---
>From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 1 Dec 2017 10:05:25 +0100
Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling

alloc_pages_before_oomkill is the last attempt to allocate memory before
we go and try to kill a process or a memcg. It's success path always has
to properly clean up the oc state (namely victim reference count). Let's
pull this into alloc_pages_before_oomkill directly rather than risk
somebody will forget to do it in future. Also document that we _know_
alloc_pages_before_oomkill violates proper layering and that is a
pragmatic decision.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 mm/oom_kill.c       | 21 +++------------------
 mm/page_alloc.c     | 24 ++++++++++++++++++++++--
 3 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 10f495c8454d..7052e0a20e13 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+extern bool alloc_pages_before_oomkill(struct oom_control *oc);
 
 extern int oom_evaluate_task(struct task_struct *task, void *arg);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4678468bae17..5c2cd299757b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		oc->page = alloc_pages_before_oomkill(oc);
-		if (oc->page)
+		if (alloc_pages_before_oomkill(oc))
 			return true;
 		get_task_struct(current);
 		oc->chosen_task = current;
@@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	if (mem_cgroup_select_oom_victim(oc)) {
-		oc->page = alloc_pages_before_oomkill(oc);
-		if (oc->page) {
-			if (oc->chosen_memcg &&
-			    oc->chosen_memcg != INFLIGHT_VICTIM)
-				mem_cgroup_put(oc->chosen_memcg);
+		if (alloc_pages_before_oomkill(oc))
 			return true;
-		}
 
 		if (oom_kill_memcg_victim(oc)) {
 			delay = true;
@@ -1127,17 +1121,8 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	select_bad_process(oc);
-	/*
-	 * Try really last second allocation attempt after we selected an OOM
-	 * victim, for somebody might have managed to free memory while we were
-	 * selecting an OOM victim which can take quite some time.
-	 */
-	oc->page = alloc_pages_before_oomkill(oc);
-	if (oc->page) {
-		if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
-			put_task_struct(oc->chosen_task);
+	if (alloc_pages_before_oomkill(oc))
 		return true;
-	}
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d518e9b2ee8..9e65fa06ee10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4146,7 +4146,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+/*
+ * Try really last second allocation attempt after we selected an OOM victim,
+ * for somebody might have managed to free memory while we were selecting an
+ * OOM victim which can take quite some time.
+ *
+ * This function is a blatant layer violation example because we cross the page
+ * allocator and reclaim (OOM killer) layers. Doing this right from the design
+ * POV would be much more complex though so let's close our eyes and pretend
+ * everything is allright.
+ */
+bool alloc_pages_before_oomkill(struct oom_control *oc)
 {
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
@@ -4167,7 +4177,17 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
 	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
 	if (reserve_flags)
 		alloc_flags = reserve_flags;
-	return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+	oc->page = get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+	if (!oc->page)
+		return false;
+
+	/*
+	 * we are skipping the remaining oom steps so clean up the pending oc
+	 * state
+	 */
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+		put_task_struct(oc->chosen_task);
+	return true;
 }
 
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
-- 
2.15.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-12-01  8:41   ` Michal Hocko
@ 2017-12-01 13:15     ` Roman Gushchin
  2017-12-01 13:31       ` Michal Hocko
  0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri, Dec 01, 2017 at 09:41:13AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> Is it ok to create oom_group if the option is not enabled? This looks
> confusing. I forgot all the details about how cgroup core creates file
> so I do not have a good idea how to fix this.

I don't think we do show/hide interface files dynamically.
Even for things like socket memory which can be disabled by the boot option,
we don't hide the corresponding stats entry.

So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
attempt if option is not enabled?

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
  2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2017-12-01 13:26   ` Tetsuo Handa
  2017-12-01 13:32   ` Roman Gushchin
  1 sibling, 0 replies; 51+ messages in thread
From: Tetsuo Handa @ 2017-12-01 13:26 UTC (permalink / raw)
  To: mhocko, guro
  Cc: linux-mm, vdavydov.dev, hannes, rientjes, akpm, tj, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?

I'm planning to post below patch in order to mitigate OOM lockup problem
caused by scheduling priority. But I'm OK with your patch because your patch
will not conflict with below patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2746a7..ef6e951 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3332,13 +3332,14 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
+	 * Acquire the oom lock. If that fails, give enough CPU time to the
+	 * owner of the oom lock in order to help reclaiming memory.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
-		*did_some_progress = 1;
+	while (!mutex_trylock(&oom_lock)) {
+		page = alloc_pages_before_oomkill(oc);
+		if (page)
+			return page;
 		schedule_timeout_uninterruptible(1);
-		return NULL;
 	}
 
 	/* Coredumps can quickly deplete all memory reserves */
----------

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-12-01 13:15     ` Roman Gushchin
@ 2017-12-01 13:31       ` Michal Hocko
  2017-12-01 17:00         ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
[...]
> So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> attempt if option is not enabled?

Yes, that would work as well. ENOTSUP sounds better to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
  2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
  2017-12-01 13:26   ` Tetsuo Handa
@ 2017-12-01 13:32   ` Roman Gushchin
  2017-12-01 13:54     ` Michal Hocko
  1 sibling, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

Hi, Michal!

I totally agree that out_of_memory() function deserves some refactoring.

But I think there is an issue with your patch (see below):

On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
> 
> ---
> From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 1 Dec 2017 10:05:25 +0100
> Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
> 
> alloc_pages_before_oomkill is the last attempt to allocate memory before
> we go and try to kill a process or a memcg. It's success path always has
> to properly clean up the oc state (namely victim reference count). Let's
> pull this into alloc_pages_before_oomkill directly rather than risk
> somebody will forget to do it in future. Also document that we _know_
> alloc_pages_before_oomkill violates proper layering and that is a
> pragmatic decision.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/oom.h |  2 +-
>  mm/oom_kill.c       | 21 +++------------------
>  mm/page_alloc.c     | 24 ++++++++++++++++++++++--
>  3 files changed, 26 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 10f495c8454d..7052e0a20e13 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> -extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
> +extern bool alloc_pages_before_oomkill(struct oom_control *oc);
>  
>  extern int oom_evaluate_task(struct task_struct *task, void *arg);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4678468bae17..5c2cd299757b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
>  	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> -		oc->page = alloc_pages_before_oomkill(oc);
> -		if (oc->page)
> +		if (alloc_pages_before_oomkill(oc))
>  			return true;
>  		get_task_struct(current);
>  		oc->chosen_task = current;
> @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	if (mem_cgroup_select_oom_victim(oc)) {
> -		oc->page = alloc_pages_before_oomkill(oc);
> -		if (oc->page) {
> -			if (oc->chosen_memcg &&
> -			    oc->chosen_memcg != INFLIGHT_VICTIM)
> -				mem_cgroup_put(oc->chosen_memcg);

You're removing chosen_memcg releasing here, but I don't see where you
do this instead. And I'm not sure that putting mem_cgroup_put() into
alloc_pages_before_oomkill() is a way towards simpler code.

I was thinking about a bit larger refactoring: splitting out_of_memory()
into the following parts (defined as separate functions): victim selection
(per-process, memcg-aware or just allocating task), last allocation attempt,
OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
but I don't have code yet.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
  2017-12-01 13:32   ` Roman Gushchin
@ 2017-12-01 13:54     ` Michal Hocko
  0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 01-12-17 13:32:15, Roman Gushchin wrote:
> Hi, Michal!
> 
> I totally agree that out_of_memory() function deserves some refactoring.
> 
> But I think there is an issue with your patch (see below):
> 
> On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> > Recently added alloc_pages_before_oomkill gained new caller with this
> > patchset and I think it just grown to deserve a simpler code flow.
> > What do you think about this on top of the series?
> > 
> > ---
[...]
> > @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
> >  	}
> >  
> >  	if (mem_cgroup_select_oom_victim(oc)) {
> > -		oc->page = alloc_pages_before_oomkill(oc);
> > -		if (oc->page) {
> > -			if (oc->chosen_memcg &&
> > -			    oc->chosen_memcg != INFLIGHT_VICTIM)
> > -				mem_cgroup_put(oc->chosen_memcg);
> 
> You're removing chosen_memcg releasing here, but I don't see where you
> do this instead. And I'm not sure that putting mem_cgroup_put() into
> alloc_pages_before_oomkill() is a way towards simpler code.

Dohh, I though I did. But obviously it is not there.

> I was thinking about a bit larger refactoring: splitting out_of_memory()
> into the following parts (defined as separate functions): victim selection
> (per-process, memcg-aware or just allocating task), last allocation attempt,
> OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
> but I don't have code yet.

OK, I will not push if you have further plans of course. This just hit
my eyes...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-12-01 13:31       ` Michal Hocko
@ 2017-12-01 17:00         ` Roman Gushchin
  0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Fri, Dec 01, 2017 at 02:31:45PM +0100, Michal Hocko wrote:
> On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
> [...]
> > So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> > attempt if option is not enabled?
> 
> Yes, that would work as well. ENOTSUP sounds better to me.
> -- 
> Michal Hocko
> SUSE Labs

>From 78bf2c00abf450bcd993d02a7dc1783144005fbd Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 14:30:14 +0000
Subject: [PATCH] mm, oom: return error on access to memory.oom_group if
 groupoom is disabled

Cgroup-aware OOM killer depends on cgroup mount option and is turned
off by default, despite the user interface (memory.oom_group file) is
always present. As it might be confusing to a user, let's return
-ENOTSUPP on an attempt to access to memory.oom_group if groupoom is not
enabled globally.

Example:
  $ cd /sys/fs/cgroup/user.slice/
  $ cat memory.oom_group
    cat: memory.oom_group: Unknown error 524
  $ echo 1 > memory.oom_group
    -bash: echo: write error: Unknown error 524
  $ mount -o remount,groupoom /sys/fs/cgroup
  $ echo 1 > memory.oom_group
  $ cat memory.oom_group
    1

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c76d5fb55c5c..b709ee4f914b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5482,6 +5482,9 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 	bool oom_group = memcg->oom_group;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return -ENOTSUPP;
+
 	seq_printf(m, "%d\n", oom_group);
 
 	return 0;
@@ -5495,6 +5498,9 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	int oom_group;
 	int err;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return -ENOTSUPP;
+
 	err = kstrtoint(strstrip(buf), 0, &oom_group);
 	if (err)
 		return err;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-12-01  8:41   ` Michal Hocko
@ 2017-12-01 17:01     ` Roman Gushchin
  2017-12-01 17:13       ` Michal Hocko
  0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri, Dec 01, 2017 at 09:41:54AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> > @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
> >  POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> >  belonging to the affected files to ensure correct memory ownership.
> >  
> > +OOM Killer
> > +~~~~~~~~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats cgroups as first class OOM entities.
> 
> This should mention groupoom mount option to enable this functionality.
> 
> Other than that looks ok to me
> Acked-by: Michal Hocko <mhocko@suse.com>
> -- 
> Michal Hocko
> SUSE Labs


>From 1d9c87128897ee7f27f9651d75b80f73985373e8 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 15:34:59 +0000
Subject: [PATCH] mm, oom, docs: document groupoom mount option

Add a note that cgroup-aware OOM logic is disabled by default
and describe how to enable it.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 Documentation/cgroup-v2.txt | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c80a147f94b7..ff8e92db636d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
 	and will never kill the unkillable task, even if memory.oom_group
 	is set.
 
+	If cgroup-aware OOM killer is not enabled, ENOTSUPP error
+	is returned on attempt to access the file.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1258,6 +1261,12 @@ OOM Killer
 Cgroup v2 memory controller implements a cgroup-aware OOM killer.
 It means that it treats cgroups as first class OOM entities.
 
+Cgroup-aware OOM logic is turned off by default and requires
+passing the "groupoom" option on mounting cgroupfs. It can also
+by remounting cgroupfs with the following command::
+
+  # mount -o remount,groupoom $MOUNT_POINT
+
 Under OOM conditions the memory controller tries to make the best
 choice of a victim, looking for a memory cgroup with the largest
 memory footprint, considering leaf cgroups and cgroups with the
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-12-01 17:01     ` Roman Gushchin
@ 2017-12-01 17:13       ` Michal Hocko
  0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 17:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 01-12-17 17:01:49, Roman Gushchin wrote:
[...]
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index c80a147f94b7..ff8e92db636d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
>  	and will never kill the unkillable task, even if memory.oom_group
>  	is set.
>  
> +	If cgroup-aware OOM killer is not enabled, ENOTSUPP error
> +	is returned on attempt to access the file.
> +
>    memory.events
>  	A read-only flat-keyed file which exists on non-root cgroups.
>  	The following entries are defined.  Unless specified
> @@ -1258,6 +1261,12 @@ OOM Killer
>  Cgroup v2 memory controller implements a cgroup-aware OOM killer.
>  It means that it treats cgroups as first class OOM entities.
>  
> +Cgroup-aware OOM logic is turned off by default and requires
> +passing the "groupoom" option on mounting cgroupfs. It can also
> +by remounting cgroupfs with the following command::
> +
> +  # mount -o remount,groupoom $MOUNT_POINT
> +
>  Under OOM conditions the memory controller tries to make the best
>  choice of a victim, looking for a memory cgroup with the largest
>  memory footprint, considering leaf cgroups and cgroups with the

Looks good to me

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
  2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
  2017-12-01  8:35   ` Michal Hocko
@ 2017-12-07  1:24   ` Andrew Morton
  2017-12-07 13:39     ` Roman Gushchin
  1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-12-07  1:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm


As a result of the "stalled MM patches" discussion I've dropped these
three patches:

mm,oom: move last second allocation to inside the OOM killer
mm,oom: use ALLOC_OOM for OOM victim's last second allocation
mm,oom: remove oom_lock serialization from the OOM reaper

and I had to rework this patch as a result.  Please carefully check (and
preferable test) my handiwork in out_of_memory()?



From: Roman Gushchin <guro@fb.com>
Subject: mm, oom: cgroup-aware OOM killer

Traditionally, the OOM killer is operating on a process level.  Under oom
conditions, it finds a process with the highest oom score and kills it.

This behavior doesn't suit well the system with many running containers:

1) There is no fairness between containers.  A small container with few
   large processes will be chosen over a large one with huge number of
   small processes.

2) Containers often do not expect that some random process inside will
   be killed.  In many cases much safer behavior is to kill all tasks in
   the container.  Traditionally, this was implemented in userspace, but
   doing it in the kernel has some advantages, especially in a case of a
   system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select a
memory cgroup as an OOM victim.  Under OOM conditions the OOM killer looks
for the biggest leaf memory cgroup and kills the biggest task belonging to
it.

The following patches will extend this functionality to consider non-leaf
memory cgroups as OOM victims, and also provide an ability to kill all
tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score is
compared with other leaf memory cgroups.  Due to memcg statistics
implementation a special approximation is used for estimating oom_score of
root memory cgroup: we sum oom_score of the belonging processes (or, to be
more precise, tasks owning their mm structures).

Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   17 +++
 include/linux/oom.h        |   12 ++
 mm/memcontrol.c            |  181 +++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |   72 ++++++++++---
 4 files changed, 262 insertions(+), 20 deletions(-)

diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
--- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/oom.h
@@ -10,6 +10,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -40,7 +47,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
--- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Root memory cgroup is a special case:
+	 * we don't have necessary stats to evaluate it exactly as
+	 * leaf memory cgroups, so we approximate it's oom_score
+	 * by summing oom_score of all belonging tasks, which are
+	 * owners of their mm structs.
+	 *
+	 * If there are inflight OOM victim tasks inside
+	 * the root memcg, we return -1.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			if (tsk_is_oom_victim(task) &&
+			    !test_bit(MMF_OOM_SKIP,
+				      &task->signal->oom_mm->flags)) {
+				score = -1;
+				break;
+			}
+
+			task_lock(task);
+			if (!task->mm || task->mm->owner != task) {
+				task_unlock(task);
+				continue;
+			}
+			task_unlock(task);
+
+			score += oom_badness(task, memcg, nodemask,
+					     totalpages);
+		}
+		css_task_iter_end(&it);
+
+		return score;
+	}
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc)) {
+		if (oom_kill_memcg_victim(oc)) {
+			delay = true;
+			goto out;
+		}
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
  2017-12-07  1:24   ` Andrew Morton
@ 2017-12-07 13:39     ` Roman Gushchin
  0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-07 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Wed, Dec 06, 2017 at 05:24:13PM -0800, Andrew Morton wrote:
> 
> As a result of the "stalled MM patches" discussion I've dropped these
> three patches:
> 
> mm,oom: move last second allocation to inside the OOM killer
> mm,oom: use ALLOC_OOM for OOM victim's last second allocation
> mm,oom: remove oom_lock serialization from the OOM reaper
> 
> and I had to rework this patch as a result.  Please carefully check (and
> preferable test) my handiwork in out_of_memory()?

Hi, Andrew!

Reviewed and tested, looks good to me. Thank you!

A couple of small nits below.

> 
> 
> 
> From: Roman Gushchin <guro@fb.com>
> Subject: mm, oom: cgroup-aware OOM killer
> 
> Traditionally, the OOM killer is operating on a process level.  Under oom
> conditions, it finds a process with the highest oom score and kills it.
> 
> This behavior doesn't suit well the system with many running containers:
> 
> 1) There is no fairness between containers.  A small container with few
>    large processes will be chosen over a large one with huge number of
>    small processes.
> 
> 2) Containers often do not expect that some random process inside will
>    be killed.  In many cases much safer behavior is to kill all tasks in
>    the container.  Traditionally, this was implemented in userspace, but
>    doing it in the kernel has some advantages, especially in a case of a
>    system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> This patch introduces the core functionality: an ability to select a
> memory cgroup as an OOM victim.  Under OOM conditions the OOM killer looks
> for the biggest leaf memory cgroup and kills the biggest task belonging to
> it.
> 
> The following patches will extend this functionality to consider non-leaf
> memory cgroups as OOM victims, and also provide an ability to kill all
> tasks belonging to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score is
> compared with other leaf memory cgroups.  Due to memcg statistics
> implementation a special approximation is used for estimating oom_score of
> root memory cgroup: we sum oom_score of the belonging processes (or, to be
> more precise, tasks owning their mm structures).
> 
> Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/memcontrol.h |   17 +++
>  include/linux/oom.h        |   12 ++
>  mm/memcontrol.c            |  181 +++++++++++++++++++++++++++++++++++
>  mm/oom_kill.c              |   72 ++++++++++---
>  4 files changed, 262 insertions(+), 20 deletions(-)
> 
> diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
>  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +	css_put(&memcg->css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
>  	return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>  		struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
> --- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/oom.h
> @@ -10,6 +10,13 @@
>  #include <linux/sched/coredump.h> /* MMF_* */
>  #include <linux/mm.h> /* VM_FAULT* */
>  
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
>  struct zonelist;
>  struct notifier_block;
>  struct mem_cgroup;
> @@ -40,7 +47,8 @@ struct oom_control {
>  
>  	/* Used by oom implementation, do not set */
>  	unsigned long totalpages;
> -	struct task_struct *chosen;
> +	struct task_struct *chosen_task;
> +	struct mem_cgroup *chosen_memcg;
>  	unsigned long chosen_points;
>  };
>  
> @@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
> --- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
>  	return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Root memory cgroup is a special case:
> +	 * we don't have necessary stats to evaluate it exactly as
> +	 * leaf memory cgroups, so we approximate it's oom_score
> +	 * by summing oom_score of all belonging tasks, which are
> +	 * owners of their mm structs.
> +	 *
> +	 * If there are inflight OOM victim tasks inside
> +	 * the root memcg, we return -1.
> +	 */
> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			if (tsk_is_oom_victim(task) &&
> +			    !test_bit(MMF_OOM_SKIP,
> +				      &task->signal->oom_mm->flags)) {
> +				score = -1;
> +				break;
> +			}
> +
> +			task_lock(task);
> +			if (!task->mm || task->mm->owner != task) {
> +				task_unlock(task);
> +				continue;
> +			}
> +			task_unlock(task);
> +
> +			score += oom_badness(task, memcg, nodemask,
> +					     totalpages);
> +		}
> +		css_task_iter_end(&it);
> +
> +		return score;
> +	}
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +
> +	oc->chosen_memcg = NULL;
> +	oc->chosen_points = 0;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memory cgroups (including
> +	 * the root memcg).
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		long score;
> +
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup)
> +			continue;
> +
> +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need
> +		 * to look further for new victims.
> +		 */
> +		if (score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (score > oc->chosen_points) {
> +			oc->chosen_points = score;
> +			oc->chosen_memcg = iter;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *root;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	if (oc->memcg)
> +		root = oc->memcg;
> +	else
> +		root = root_mem_cgroup;
> +
> +	select_victim_memcg(root, oc);
> +
> +	return oc->chosen_memcg;
> +}
> +
>  /*
>   * Reclaims as many pages from the given memcg as possible.
>   *
> diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
> --- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
>  	return CONSTRAINT_NONE;
>  }
>  
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
>  	struct oom_control *oc = arg;
>  	unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
>  		goto next;
>  
>  	/* Prefer thread group leaders for display purposes */
> -	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> +	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
>  		goto next;
>  select:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
>  	get_task_struct(task);
> -	oc->chosen = task;
> +	oc->chosen_task = task;
>  	oc->chosen_points = points;
>  next:
>  	return 0;
>  abort:
> -	if (oc->chosen)
> -		put_task_struct(oc->chosen);
> -	oc->chosen = (void *)-1UL;
> +	if (oc->chosen_task)
> +		put_task_struct(oc->chosen_task);
> +	oc->chosen_task = INFLIGHT_VICTIM;
>  	return 1;
>  }
>  
>  /*
>   * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
>   */
>  static void select_bad_process(struct oom_control *oc)
>  {
> @@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
>  
>  static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
> -	struct task_struct *p = oc->chosen;
> +	struct task_struct *p = oc->chosen_task;
>  	unsigned int points = oc->chosen_points;
>  	struct task_struct *victim = p;
>  	struct task_struct *child;
> @@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
>  	__oom_kill_process(victim);
>  }
>  
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Kill a task in the chosen memcg with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +out:
> +	mem_cgroup_put(oc->chosen_memcg);
> +	return oc->chosen_task;
> +}
> +
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> @@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
>  {
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
> +	bool delay = false; /* if set, delay next allocation attempt */
>  
>  	if (oom_killer_disabled)
>  		return false;
> @@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
>  	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
> -		oc->chosen = current;
> +		oc->chosen_task = current;
>  		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
>  		return true;
>  	}
>  
> +	if (mem_cgroup_select_oom_victim(oc)) {
> +		if (oom_kill_memcg_victim(oc)) {
> +			delay = true;
> +			goto out;
> +		}
> +	}

It's probably better to join two conditions with &&:
	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
		delay = true;
		goto out;
	}


> +
>  	select_bad_process(oc);
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
> -	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> +	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
>  		dump_header(oc, NULL);
>  		panic("Out of memory and no killable processes...\n");
>  	}
> -	if (oc->chosen && oc->chosen != (void *)-1UL) {
> +	if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {

It's better to replace "(void *)-1UL" here with "INFLIGHT_VICTIM"
to be consistent:
	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) ...

>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>  				 "Memory cgroup out of memory");

--

Also, I've sent a couple of small fixes, requested by Michal:

1) https://lkml.org/lkml/2017/12/1/569
   (mm, oom, docs: document groupoom mount option)

2) https://lkml.org/lkml/2017/12/1/568
   (mm, oom: return error on access to memory.oom_group if groupoom is disabled)

Can, you, please, pull them?


Thank you!


Roman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
@ 2018-01-10  0:57   ` David Rientjes
  2018-01-10 13:11     ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-10  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu, 30 Nov 2017, Andrew Morton wrote:

> > This patchset makes the OOM killer cgroup-aware.
> 
> Thanks, I'll grab these.
> 
> There has been controversy over this patchset, to say the least.  I
> can't say that I followed it closely!  Could those who still have
> reservations please summarise their concerns and hopefully suggest a
> way forward?
> 

Yes, I'll summarize what my concerns have been in the past and what they 
are wrt the patchset as it stands in -mm.  None of them originate from my 
current usecase or anticipated future usecase of the oom killer for 
system-wide or memcg-constrained oom conditions.  They are based purely on 
the patchset's use of an incomplete and unfair heuristic for deciding 
which cgroup to target.

I'll also suggest simple changes to the patchset, which I have in the 
past, that can be made to address all of these concerns.

1. The unfair comparison of the root mem cgroup vs leaf mem cgroups

The patchset uses two different heuristics to compare root and leaf mem 
cgroups and scores them based on number of pages.  For the root mem 
cgroup, it totals the /proc/pid/oom_score of all processes attached: 
that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
different independent of /proc/pid/oom_score_adj, but the most obvious 
unfairness comes from users who tune oom_score_adj.

An example: start a process that faults 1GB of anonymous memory and leave 
it attached to the root mem cgroup.  Start six more processes that each 
fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
will always kill the 1GB process attached to the root mem cgroup.  It's 
because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
it.

In this example, the leaf mem cgroup's score is 1,573,044, the number of 
pages for the 6GB of faulted memory.  The root mem cgroup's score is 
12,652,907, eight times larger even though its usage is six times smaller.

This is caused by the patchset disregarding oom_score_adj entirely for 
leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
the complete opposite result of what the cgroup aware oom killer 
advertises.

It also works the other way, if a large memory hog is attached to the root 
mem cgroup but has a negative oom_score_adj it is never killed and random 
processes are nuked solely because they happened to be attached to a leaf 
mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
so I can't presume that it is either known nor tested.

Solution: compare the root mem cgroup and leaf mem cgroups equally with 
the same criteria by doing hierarchical accounting of usage and 
subtracting from total system usage to find root usage.

2. Evading the oom killer by attaching processes to child cgroups

Any cgroup on the system can attach all their processes to individual 
child cgroups.  This is functionally the same as doing

	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

without the no internal process constraint introduced with cgroup v2.  All 
child cgroups are evaluated based on their own usage: all anon, 
unevictable, and unreclaimable slab as described previously.  It requires 
an individual cgroup to be the single largest consumer to be targeted by 
the oom killer.

An example: allow users to manage two different mem cgroup hierarchies 
limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
memory in their respective hierarchies.  On a system oom condition, we'd 
expect at least one process from user B's hierarchy would always be oom 
killed with the cgroup aware oom killer.  In fact, the changelog 
explicitly states it solves an issue where "1) There is no fairness 
between containers. A small container with few large processes will be 
chosen over a large one with huge number of small processes."

The opposite becomes true, however, if user B creates child cgroups and 
distributes its processes such that each child cgroup's usage never 
exceeds 10GB of memory.  This can either be done intentionally to 
purposefully have a low cgroup memory footprint to evade the oom killer or 
unintentionally with cgroup v2 to allow those individual processes to be 
constrained by other cgroups in a single hierarchy model.  User A, using 
10% of his memory limit, is always oom killed instead of user B, using 90% 
of his memory limit.

Others have commented its still possible to do this with a per-process 
model if users split their processes into many subprocesses with small 
memory footprints.

Solution: comparing cgroups must be done hierarchically.  Neither user A 
nor user B can evade the oom killer because targeting is done based on the 
total hierarchical usage rather than individual cgroups in their 
hierarchies.

3. Userspace has zero control over oom kill selection in leaf mem cgroups

Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
from the oom killer, the cgroup aware oom killer does not provide any 
solution for the user to protect leaf mem cgroups.  This is a result of 
leaf mem cgroups being evaluated based on their anon, unevictable, and 
unreclaimable slab usage and disregarding any user tunable.

Absent the cgroup aware oom killer, users have the ability to strongly 
prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
strongly bias against a process (/proc/pid/oom_score_adj = -999).

An example: a process knows its going to use a lot of memory, so it sets 
/proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
distrupting any other process.  If it's attached to the root mem cgroup, 
it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
outside its control, it will never be oom killed unless that cgroup's 
usage is the largest single cgroup usage on the system.  The reverse also 
is true for processes that the admin does not want to be oom killed: set 
/proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
cgroup has the highest usage on the system.

The result is that both admins and users have lost all control over which 
processes are oom killed.  They are left with only one alternative: set 
/proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
kill.  It doesn't address the issue at all for memcg-constrained oom 
conditions since no processes are killable anymore, and risks panicking 
the system if it is the only process left on the system.  A process 
preferring that it is first in line for oom kill simply cannot volunteer 
anymore.

Solution: allow users and admins to control oom kill selection by 
introducing a memory.oom_score_adj to affect the oom score of that mem 
cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
of a process.


I proposed a solution in 
https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
responded to, for all of these issues.  The idea is to do hierarchical 
accounting of mem cgroup hierarchies so that the hierarchy is traversed 
comparing total usage at each level to select target cgroups.  Admins and 
users can use memory.oom_score_adj to influence that decisionmaking at 
each level.

This solves #1 because mem cgroups can be compared based on the same 
classes of memory and the root mem cgroup's usage can be fairly compared 
by subtracting top-level mem cgroup usage from system usage.  All of the 
criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
counterpart that can be used to do the simple subtraction.

This solves #2 because evaluation is done hierarchically so that 
distributing processes over a set of child cgroups either intentionally 
or unintentionally no longer evades the oom killer.  Total usage is always 
accounted to the parent and there is no escaping this criteria for users.

This solves #3 because it allows admins to protect important processes in 
cgroups that are supposed to use, for example, 75% of system memory 
without it unconditionally being selected for oom kill but still oom kill 
if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
killer, as currently implemented, is selling mem cgroups short by 
requiring the user to accept that the important process will be oom killed 
iff it uses mem cgroups and isn't attached to root.  It also allows users 
to actually volunteer to be oom killed first without majority usage.

It has come up time and time again that this support can be introduced on 
top of the cgroup oom killer as implemented.  It simply cannot.  For 
admins and users to have control over decisionmaking, it needs a 
oom_score_adj type tunable that cannot change semantics from kernel 
version to kernel version and without polluting the mem cgroup filesystem.  
That, in my suggestion, is an adjustment on the amount of total 
hierarchical usage of each mem cgroup at each level of the hierarchy.  
That requires that the heuristic uses hierarchical usage rather than 
considering each cgroup as independent consumers as it does today.  We 
need to implement that heuristic and introduce userspace influence over 
oom kill selection now rather than later because its implementation 
changes how this patchset is implemented.

I can implement these changes, if preferred, on top of the current 
patchset, but I do not believe we want inconsistencies between kernel 
versions that introduce user visible changes for the sole reason that this 
current implementation is incomplete and unfair.  We can implement and 
introduce it once without behavior changing later because the core 
heuristic has necessarily changed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-10  0:57   ` David Rientjes
@ 2018-01-10 13:11     ` Roman Gushchin
  2018-01-10 19:33       ` Andrew Morton
  2018-01-10 20:50       ` David Rientjes
  0 siblings, 2 replies; 51+ messages in thread
From: Roman Gushchin @ 2018-01-10 13:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

Hello, David!

On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> On Thu, 30 Nov 2017, Andrew Morton wrote:
> 
> > > This patchset makes the OOM killer cgroup-aware.
> > 
> > Thanks, I'll grab these.
> > 
> > There has been controversy over this patchset, to say the least.  I
> > can't say that I followed it closely!  Could those who still have
> > reservations please summarise their concerns and hopefully suggest a
> > way forward?
> > 
> 
> Yes, I'll summarize what my concerns have been in the past and what they 
> are wrt the patchset as it stands in -mm.  None of them originate from my 
> current usecase or anticipated future usecase of the oom killer for 
> system-wide or memcg-constrained oom conditions.  They are based purely on 
> the patchset's use of an incomplete and unfair heuristic for deciding 
> which cgroup to target.
> 
> I'll also suggest simple changes to the patchset, which I have in the 
> past, that can be made to address all of these concerns.
> 
> 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> 
> The patchset uses two different heuristics to compare root and leaf mem 
> cgroups and scores them based on number of pages.  For the root mem 
> cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> different independent of /proc/pid/oom_score_adj, but the most obvious 
> unfairness comes from users who tune oom_score_adj.
> 
> An example: start a process that faults 1GB of anonymous memory and leave 
> it attached to the root mem cgroup.  Start six more processes that each 
> fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> will always kill the 1GB process attached to the root mem cgroup.  It's 
> because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> it.
> 
> In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> 12,652,907, eight times larger even though its usage is six times smaller.
> 
> This is caused by the patchset disregarding oom_score_adj entirely for 
> leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> the complete opposite result of what the cgroup aware oom killer 
> advertises.
> 
> It also works the other way, if a large memory hog is attached to the root 
> mem cgroup but has a negative oom_score_adj it is never killed and random 
> processes are nuked solely because they happened to be attached to a leaf 
> mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> so I can't presume that it is either known nor tested.
> 
> Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> the same criteria by doing hierarchical accounting of usage and 
> subtracting from total system usage to find root usage.

I find this problem quite minor, because I haven't seen any practical problems
caused by accounting of the root cgroup memory.
If it's a serious problem for you, it can be solved without switching to the
hierarchical accounting: it's possible to sum up all leaf cgroup stats and
substract them from global values. So, it can be a relatively small enhancement
on top of the current mm tree. This has nothing to do with global victim selection
approach.

> 
> 2. Evading the oom killer by attaching processes to child cgroups
> 
> Any cgroup on the system can attach all their processes to individual 
> child cgroups.  This is functionally the same as doing
> 
> 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> without the no internal process constraint introduced with cgroup v2.  All 
> child cgroups are evaluated based on their own usage: all anon, 
> unevictable, and unreclaimable slab as described previously.  It requires 
> an individual cgroup to be the single largest consumer to be targeted by 
> the oom killer.
> 
> An example: allow users to manage two different mem cgroup hierarchies 
> limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> memory in their respective hierarchies.  On a system oom condition, we'd 
> expect at least one process from user B's hierarchy would always be oom 
> killed with the cgroup aware oom killer.  In fact, the changelog 
> explicitly states it solves an issue where "1) There is no fairness 
> between containers. A small container with few large processes will be 
> chosen over a large one with huge number of small processes."
> 
> The opposite becomes true, however, if user B creates child cgroups and 
> distributes its processes such that each child cgroup's usage never 
> exceeds 10GB of memory.  This can either be done intentionally to 
> purposefully have a low cgroup memory footprint to evade the oom killer or 
> unintentionally with cgroup v2 to allow those individual processes to be 
> constrained by other cgroups in a single hierarchy model.  User A, using 
> 10% of his memory limit, is always oom killed instead of user B, using 90% 
> of his memory limit.
> 
> Others have commented its still possible to do this with a per-process 
> model if users split their processes into many subprocesses with small 
> memory footprints.
> 
> Solution: comparing cgroups must be done hierarchically.  Neither user A 
> nor user B can evade the oom killer because targeting is done based on the 
> total hierarchical usage rather than individual cgroups in their 
> hierarchies.

We've discussed this a lot.
Hierarchical approach has their own issues, which we've discussed during
previous iterations of the patchset. If you know how to address them
(I've no idea), please, go on and suggest your version.

> 
> 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> 
> Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> from the oom killer, the cgroup aware oom killer does not provide any 
> solution for the user to protect leaf mem cgroups.  This is a result of 
> leaf mem cgroups being evaluated based on their anon, unevictable, and 
> unreclaimable slab usage and disregarding any user tunable.
> 
> Absent the cgroup aware oom killer, users have the ability to strongly 
> prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> strongly bias against a process (/proc/pid/oom_score_adj = -999).
> 
> An example: a process knows its going to use a lot of memory, so it sets 
> /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> distrupting any other process.  If it's attached to the root mem cgroup, 
> it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> outside its control, it will never be oom killed unless that cgroup's 
> usage is the largest single cgroup usage on the system.  The reverse also 
> is true for processes that the admin does not want to be oom killed: set 
> /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> cgroup has the highest usage on the system.
> 
> The result is that both admins and users have lost all control over which 
> processes are oom killed.  They are left with only one alternative: set 
> /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> kill.  It doesn't address the issue at all for memcg-constrained oom 
> conditions since no processes are killable anymore, and risks panicking 
> the system if it is the only process left on the system.  A process 
> preferring that it is first in line for oom kill simply cannot volunteer 
> anymore.
> 
> Solution: allow users and admins to control oom kill selection by 
> introducing a memory.oom_score_adj to affect the oom score of that mem 
> cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> of a process.

The per-process oom_score_adj interface is not the nicest one, and I'm not
sure we want to replicate it on cgroup level as is. If you have an idea of how
it should look like, please, propose a patch; otherwise it's hard to discuss
it without the code.

> 
> 
> I proposed a solution in 
> https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> responded to, for all of these issues.  The idea is to do hierarchical 
> accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> comparing total usage at each level to select target cgroups.  Admins and 
> users can use memory.oom_score_adj to influence that decisionmaking at 
> each level.
> 
> This solves #1 because mem cgroups can be compared based on the same 
> classes of memory and the root mem cgroup's usage can be fairly compared 
> by subtracting top-level mem cgroup usage from system usage.  All of the 
> criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> counterpart that can be used to do the simple subtraction.
> 
> This solves #2 because evaluation is done hierarchically so that 
> distributing processes over a set of child cgroups either intentionally 
> or unintentionally no longer evades the oom killer.  Total usage is always 
> accounted to the parent and there is no escaping this criteria for users.
> 
> This solves #3 because it allows admins to protect important processes in 
> cgroups that are supposed to use, for example, 75% of system memory 
> without it unconditionally being selected for oom kill but still oom kill 
> if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> killer, as currently implemented, is selling mem cgroups short by 
> requiring the user to accept that the important process will be oom killed 
> iff it uses mem cgroups and isn't attached to root.  It also allows users 
> to actually volunteer to be oom killed first without majority usage.
> 
> It has come up time and time again that this support can be introduced on 
> top of the cgroup oom killer as implemented.  It simply cannot.  For 
> admins and users to have control over decisionmaking, it needs a 
> oom_score_adj type tunable that cannot change semantics from kernel 
> version to kernel version and without polluting the mem cgroup filesystem.  
> That, in my suggestion, is an adjustment on the amount of total 
> hierarchical usage of each mem cgroup at each level of the hierarchy.  
> That requires that the heuristic uses hierarchical usage rather than 
> considering each cgroup as independent consumers as it does today.  We 
> need to implement that heuristic and introduce userspace influence over 
> oom kill selection now rather than later because its implementation 
> changes how this patchset is implemented.
> 
> I can implement these changes, if preferred, on top of the current 
> patchset, but I do not believe we want inconsistencies between kernel 
> versions that introduce user visible changes for the sole reason that this 
> current implementation is incomplete and unfair.  We can implement and 
> introduce it once without behavior changing later because the core 
> heuristic has necessarily changed.

David, I _had_ hierarchical accounting implemented in one of the previous
versions of this patchset. And there were _reasons_, why we went away from it.
You can't just ignore them and say that "there is a simple solution, which
Roman is not responding". If you know how to address these issues and
convince everybody that hierarchical approach is a way to go, please,
go on and send your version of the patchset.

Thanks!

Roman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-10 13:11     ` Roman Gushchin
@ 2018-01-10 19:33       ` Andrew Morton
  2018-01-11  9:08         ` Michal Hocko
  2018-01-13 17:14         ` Johannes Weiner
  2018-01-10 20:50       ` David Rientjes
  1 sibling, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2018-01-10 19:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:

> Hello, David!
> 
> On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > 
> > > > This patchset makes the OOM killer cgroup-aware.
> > > 
> > > Thanks, I'll grab these.
> > > 
> > > There has been controversy over this patchset, to say the least.  I
> > > can't say that I followed it closely!  Could those who still have
> > > reservations please summarise their concerns and hopefully suggest a
> > > way forward?
> > > 
> > 
> > Yes, I'll summarize what my concerns have been in the past and what they 
> > are wrt the patchset as it stands in -mm.  None of them originate from my 
> > current usecase or anticipated future usecase of the oom killer for 
> > system-wide or memcg-constrained oom conditions.  They are based purely on 
> > the patchset's use of an incomplete and unfair heuristic for deciding 
> > which cgroup to target.
> > 
> > I'll also suggest simple changes to the patchset, which I have in the 
> > past, that can be made to address all of these concerns.
> > 
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > 
> > The patchset uses two different heuristics to compare root and leaf mem 
> > cgroups and scores them based on number of pages.  For the root mem 
> > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > unfairness comes from users who tune oom_score_adj.
> > 
> > An example: start a process that faults 1GB of anonymous memory and leave 
> > it attached to the root mem cgroup.  Start six more processes that each 
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > it.
> > 
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > 12,652,907, eight times larger even though its usage is six times smaller.
> > 
> > This is caused by the patchset disregarding oom_score_adj entirely for 
> > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > the complete opposite result of what the cgroup aware oom killer 
> > advertises.
> > 
> > It also works the other way, if a large memory hog is attached to the root 
> > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > processes are nuked solely because they happened to be attached to a leaf 
> > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > so I can't presume that it is either known nor tested.
> > 
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > the same criteria by doing hierarchical accounting of usage and 
> > subtracting from total system usage to find root usage.
> 
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.

It sounds like a significant shortcoming to me - the oom-killing
decisions which David describes are clearly incorrect?

If this can be fixed against the -mm patchset with a "relatively small
enhancement" then please let's get that done so it can be reviewed and
tested.

> > 
> > 2. Evading the oom killer by attaching processes to child cgroups
> > 
> > Any cgroup on the system can attach all their processes to individual 
> > child cgroups.  This is functionally the same as doing
> > 
> > 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > without the no internal process constraint introduced with cgroup v2.  All 
> > child cgroups are evaluated based on their own usage: all anon, 
> > unevictable, and unreclaimable slab as described previously.  It requires 
> > an individual cgroup to be the single largest consumer to be targeted by 
> > the oom killer.
> > 
> > An example: allow users to manage two different mem cgroup hierarchies 
> > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > memory in their respective hierarchies.  On a system oom condition, we'd 
> > expect at least one process from user B's hierarchy would always be oom 
> > killed with the cgroup aware oom killer.  In fact, the changelog 
> > explicitly states it solves an issue where "1) There is no fairness 
> > between containers. A small container with few large processes will be 
> > chosen over a large one with huge number of small processes."
> > 
> > The opposite becomes true, however, if user B creates child cgroups and 
> > distributes its processes such that each child cgroup's usage never 
> > exceeds 10GB of memory.  This can either be done intentionally to 
> > purposefully have a low cgroup memory footprint to evade the oom killer or 
> > unintentionally with cgroup v2 to allow those individual processes to be 
> > constrained by other cgroups in a single hierarchy model.  User A, using 
> > 10% of his memory limit, is always oom killed instead of user B, using 90% 
> > of his memory limit.
> > 
> > Others have commented its still possible to do this with a per-process 
> > model if users split their processes into many subprocesses with small 
> > memory footprints.
> > 
> > Solution: comparing cgroups must be done hierarchically.  Neither user A 
> > nor user B can evade the oom killer because targeting is done based on the 
> > total hierarchical usage rather than individual cgroups in their 
> > hierarchies.
> 
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.

Well, if a hierarchical approach isn't a workable fix for the problem
which David has identified then what *is* the fix?

> > 
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > 
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> > from the oom killer, the cgroup aware oom killer does not provide any 
> > solution for the user to protect leaf mem cgroups.  This is a result of 
> > leaf mem cgroups being evaluated based on their anon, unevictable, and 
> > unreclaimable slab usage and disregarding any user tunable.
> > 
> > Absent the cgroup aware oom killer, users have the ability to strongly 
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > 
> > An example: a process knows its going to use a lot of memory, so it sets 
> > /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> > distrupting any other process.  If it's attached to the root mem cgroup, 
> > it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> > outside its control, it will never be oom killed unless that cgroup's 
> > usage is the largest single cgroup usage on the system.  The reverse also 
> > is true for processes that the admin does not want to be oom killed: set 
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> > cgroup has the highest usage on the system.
> > 
> > The result is that both admins and users have lost all control over which 
> > processes are oom killed.  They are left with only one alternative: set 
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> > kill.  It doesn't address the issue at all for memcg-constrained oom 
> > conditions since no processes are killable anymore, and risks panicking 
> > the system if it is the only process left on the system.  A process 
> > preferring that it is first in line for oom kill simply cannot volunteer 
> > anymore.
> > 
> > Solution: allow users and admins to control oom kill selection by 
> > introducing a memory.oom_score_adj to affect the oom score of that mem 
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> > of a process.
> 
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.

It does make sense to have some form of per-cgroup tunability.  Why is
the oom_score_adj approach inappropriate and what would be better?  How
hard is it to graft such a thing onto the -mm patchset?

> > 
> > 
> > I proposed a solution in 
> > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > responded to, for all of these issues.  The idea is to do hierarchical 
> > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > comparing total usage at each level to select target cgroups.  Admins and 
> > users can use memory.oom_score_adj to influence that decisionmaking at 
> > each level.
> > 
> > This solves #1 because mem cgroups can be compared based on the same 
> > classes of memory and the root mem cgroup's usage can be fairly compared 
> > by subtracting top-level mem cgroup usage from system usage.  All of the 
> > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> > counterpart that can be used to do the simple subtraction.
> > 
> > This solves #2 because evaluation is done hierarchically so that 
> > distributing processes over a set of child cgroups either intentionally 
> > or unintentionally no longer evades the oom killer.  Total usage is always 
> > accounted to the parent and there is no escaping this criteria for users.
> > 
> > This solves #3 because it allows admins to protect important processes in 
> > cgroups that are supposed to use, for example, 75% of system memory 
> > without it unconditionally being selected for oom kill but still oom kill 
> > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > killer, as currently implemented, is selling mem cgroups short by 
> > requiring the user to accept that the important process will be oom killed 
> > iff it uses mem cgroups and isn't attached to root.  It also allows users 
> > to actually volunteer to be oom killed first without majority usage.
> > 
> > It has come up time and time again that this support can be introduced on 
> > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > admins and users to have control over decisionmaking, it needs a 
> > oom_score_adj type tunable that cannot change semantics from kernel 
> > version to kernel version and without polluting the mem cgroup filesystem.  
> > That, in my suggestion, is an adjustment on the amount of total 
> > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > That requires that the heuristic uses hierarchical usage rather than 
> > considering each cgroup as independent consumers as it does today.  We 
> > need to implement that heuristic and introduce userspace influence over 
> > oom kill selection now rather than later because its implementation 
> > changes how this patchset is implemented.
> > 
> > I can implement these changes, if preferred, on top of the current 
> > patchset, but I do not believe we want inconsistencies between kernel 
> > versions that introduce user visible changes for the sole reason that this 
> > current implementation is incomplete and unfair.  We can implement and 
> > introduce it once without behavior changing later because the core 
> > heuristic has necessarily changed.
> 
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.

Can you please summarize those issues for my understanding?

> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
> 
> Thanks!
> 
> Roman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-10 13:11     ` Roman Gushchin
  2018-01-10 19:33       ` Andrew Morton
@ 2018-01-10 20:50       ` David Rientjes
  1 sibling, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-10 20:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Wed, 10 Jan 2018, Roman Gushchin wrote:

> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > 
> > The patchset uses two different heuristics to compare root and leaf mem 
> > cgroups and scores them based on number of pages.  For the root mem 
> > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > unfairness comes from users who tune oom_score_adj.
> > 
> > An example: start a process that faults 1GB of anonymous memory and leave 
> > it attached to the root mem cgroup.  Start six more processes that each 
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > it.
> > 
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > 12,652,907, eight times larger even though its usage is six times smaller.
> > 
> > This is caused by the patchset disregarding oom_score_adj entirely for 
> > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > the complete opposite result of what the cgroup aware oom killer 
> > advertises.
> > 
> > It also works the other way, if a large memory hog is attached to the root 
> > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > processes are nuked solely because they happened to be attached to a leaf 
> > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > so I can't presume that it is either known nor tested.
> > 
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > the same criteria by doing hierarchical accounting of usage and 
> > subtracting from total system usage to find root usage.
> 
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.
> 

It's not surprising that this problem is considered minor, since it was 
quite obviously never tested when developing the series.  
/proc/pid/oom_score_adj being effective when the process is attached to 
the root mem cgroup and not when attached to a leaf mem cgroup is a major 
problem, and one that was never documented or explained.  I'm somewhat 
stunned this is considered minor.

Allow me to put a stake in the ground: admins and users need to be able to 
influence oom kill decisions.  Killing a process matters.  The ability to 
protect important processes matters, such as processes that are really 
supposed to use 50% of a system's memory.  As implemented, there is no 
control over this.

If a process's oom_score_adj = 1000 (always kill it), this always happens 
when attached to the root mem cgroup.  If attached to a leaf mem cgroup, 
it never happens unless that individual mem cgroup is the next largest 
single consumer of memory.  It makes it impossible for a process to be the 
preferred oom victim if it is ever attached to any non-root mem cgroup.  
Where it is attached can be outside the control of the application itself.

Perhaps worse, an important process with oom_score_adj = -999 (try to kill 
everything before it) is always oom killed first if attached to a leaf mem 
cgroup with majority usage.

> > 2. Evading the oom killer by attaching processes to child cgroups
> > 
> > Any cgroup on the system can attach all their processes to individual 
> > child cgroups.  This is functionally the same as doing
> > 
> > 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > without the no internal process constraint introduced with cgroup v2.  All 
> > child cgroups are evaluated based on their own usage: all anon, 
> > unevictable, and unreclaimable slab as described previously.  It requires 
> > an individual cgroup to be the single largest consumer to be targeted by 
> > the oom killer.
> > 
> > An example: allow users to manage two different mem cgroup hierarchies 
> > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > memory in their respective hierarchies.  On a system oom condition, we'd 
> > expect at least one process from user B's hierarchy would always be oom 
> > killed with the cgroup aware oom killer.  In fact, the changelog 
> > explicitly states it solves an issue where "1) There is no fairness 
> > between containers. A small container with few large processes will be 
> > chosen over a large one with huge number of small processes."
> > 
> > The opposite becomes true, however, if user B creates child cgroups and 
> > distributes its processes such that each child cgroup's usage never 
> > exceeds 10GB of memory.  This can either be done intentionally to 
> > purposefully have a low cgroup memory footprint to evade the oom killer or 
> > unintentionally with cgroup v2 to allow those individual processes to be 
> > constrained by other cgroups in a single hierarchy model.  User A, using 
> > 10% of his memory limit, is always oom killed instead of user B, using 90% 
> > of his memory limit.
> > 
> > Others have commented its still possible to do this with a per-process 
> > model if users split their processes into many subprocesses with small 
> > memory footprints.
> > 
> > Solution: comparing cgroups must be done hierarchically.  Neither user A 
> > nor user B can evade the oom killer because targeting is done based on the 
> > total hierarchical usage rather than individual cgroups in their 
> > hierarchies.
> 
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.
> 

I asked in https://marc.info/?l=linux-kernel&m=150956897302725 for a clear 
example of these issues when there is userspace control over target 
selection available.  It has received no response, but I can ask again: 
please enumerate the concerns you have with hierarchical accounting in the 
presence of userspace control over target selection.

Also note that even though userspace control over target selection is 
available (through memory.oom_score_adj in my suggestion), it does not 
need to be tuned.  It can be left to the default value of 0 if there 
should be no preference or bias of mem cgroups, just as 
/proc/pid/oom_score_adj does not need to be tuned.

> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > 
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> > from the oom killer, the cgroup aware oom killer does not provide any 
> > solution for the user to protect leaf mem cgroups.  This is a result of 
> > leaf mem cgroups being evaluated based on their anon, unevictable, and 
> > unreclaimable slab usage and disregarding any user tunable.
> > 
> > Absent the cgroup aware oom killer, users have the ability to strongly 
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > 
> > An example: a process knows its going to use a lot of memory, so it sets 
> > /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> > distrupting any other process.  If it's attached to the root mem cgroup, 
> > it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> > outside its control, it will never be oom killed unless that cgroup's 
> > usage is the largest single cgroup usage on the system.  The reverse also 
> > is true for processes that the admin does not want to be oom killed: set 
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> > cgroup has the highest usage on the system.
> > 
> > The result is that both admins and users have lost all control over which 
> > processes are oom killed.  They are left with only one alternative: set 
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> > kill.  It doesn't address the issue at all for memcg-constrained oom 
> > conditions since no processes are killable anymore, and risks panicking 
> > the system if it is the only process left on the system.  A process 
> > preferring that it is first in line for oom kill simply cannot volunteer 
> > anymore.
> > 
> > Solution: allow users and admins to control oom kill selection by 
> > introducing a memory.oom_score_adj to affect the oom score of that mem 
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> > of a process.
> 
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.
> 

Please read the rest of the email; we cannot introduce 
memory.oom_score_adj without hierarchical accounting, which requires a 
change to the core heuristic used by this patchset.  Because hierarchical 
accounting is needed for #1 and #2, the values userspace uses for 
memory.oom_score_adj would be radically different for hierarchical usage 
vs the current heuristic of local usage.  We simply cannot push this 
inconsistency onto the end user that depends on kernel version.  The core 
heuristic needs to change for this to be possible while still addressing 
#1 and #2.  We do not want to be in a situation where the heuristic 
changes after users have constructed their mem cgroup hierarchies to use 
the current heuristic; doing so would be a maintenance nightmare and, 
frankly, irresponsible.

> > It has come up time and time again that this support can be introduced on 
> > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > admins and users to have control over decisionmaking, it needs a 
> > oom_score_adj type tunable that cannot change semantics from kernel 
> > version to kernel version and without polluting the mem cgroup filesystem.  
> > That, in my suggestion, is an adjustment on the amount of total 
> > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > That requires that the heuristic uses hierarchical usage rather than 
> > considering each cgroup as independent consumers as it does today.  We 
> > need to implement that heuristic and introduce userspace influence over 
> > oom kill selection now rather than later because its implementation 
> > changes how this patchset is implemented.
> > 
> > I can implement these changes, if preferred, on top of the current 
> > patchset, but I do not believe we want inconsistencies between kernel 
> > versions that introduce user visible changes for the sole reason that this 
> > current implementation is incomplete and unfair.  We can implement and 
> > introduce it once without behavior changing later because the core 
> > heuristic has necessarily changed.
> 
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.
> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
> 

I am very much aware that you had hierarchical accounting implemented and 
you'll remember that I strongly advocated for it, because it is the 
correct way to address cgroup target selection when mem cgroups are 
hierarchical by nature.  The reasons you're referring to did not account 
for the proposed memory.oom_score_adj, but I cannot continue to ask for 
those reasons to be enumerated if there is no response.  If there are 
real-world concerns to what is proposed in my email, I'd love those 
concerns to be described as I've described the obvious shortcomings with 
this patchset.  I'm eager to debate that.

Absent any repsonse that shows #1, #2, and #3 above are incorrect or that 
shows what is proposed in my email makes real-world usecases impossible or 
somehow worse than what is proposed here, I'll ask that the cgroup aware 
oom killer is removed from -mm and, if you'd like me to do it, I'd 
repropose a modified version of your earlier hierarchical accounting 
patchset with the added memory.oom_score_adj.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-10 19:33       ` Andrew Morton
@ 2018-01-11  9:08         ` Michal Hocko
  2018-01-11 13:18           ` Roman Gushchin
  2018-01-11 21:57           ` David Rientjes
  2018-01-13 17:14         ` Johannes Weiner
  1 sibling, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2018-01-11  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, David Rientjes, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> 
> > Hello, David!
> > 
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > > 
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > > 
> > > > Thanks, I'll grab these.
> > > > 
> > > > There has been controversy over this patchset, to say the least.  I
> > > > can't say that I followed it closely!  Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > > 
> > > 
> > > Yes, I'll summarize what my concerns have been in the past and what they 
> > > are wrt the patchset as it stands in -mm.  None of them originate from my 
> > > current usecase or anticipated future usecase of the oom killer for 
> > > system-wide or memcg-constrained oom conditions.  They are based purely on 
> > > the patchset's use of an incomplete and unfair heuristic for deciding 
> > > which cgroup to target.
> > > 
> > > I'll also suggest simple changes to the patchset, which I have in the 
> > > past, that can be made to address all of these concerns.
> > > 
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > > 
> > > The patchset uses two different heuristics to compare root and leaf mem 
> > > cgroups and scores them based on number of pages.  For the root mem 
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > > unfairness comes from users who tune oom_score_adj.
> > > 
> > > An example: start a process that faults 1GB of anonymous memory and leave 
> > > it attached to the root mem cgroup.  Start six more processes that each 
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > > it.
> > > 
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > > 
> > > This is caused by the patchset disregarding oom_score_adj entirely for 
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > > the complete opposite result of what the cgroup aware oom killer 
> > > advertises.
> > > 
> > > It also works the other way, if a large memory hog is attached to the root 
> > > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > > processes are nuked solely because they happened to be attached to a leaf 
> > > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > > so I can't presume that it is either known nor tested.
> > > 
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > > the same criteria by doing hierarchical accounting of usage and 
> > > subtracting from total system usage to find root usage.
> > 
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
> 
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?

Well, I would rather look at that from the use case POV. The primary
user of the new OOM killer functionality are containers. I might be
wrong but I _assume_ that root cgroup will only contain the basic system
infrastructure there and all the workload will run containers (aka
cgroups). The only oom tuning inside the root cgroup would be to disable
oom for some of those processes. The current implementation should work
reasonably well for that configuration.
 
> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.

The root memcg will be always special and I would rather base the
semantic based on usecases rather than implement something based on
theoretical examples.

> > > 2. Evading the oom killer by attaching processes to child cgroups
> > > 
> > > Any cgroup on the system can attach all their processes to individual 
> > > child cgroups.  This is functionally the same as doing
> > > 
> > > 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > without the no internal process constraint introduced with cgroup v2.  All 
> > > child cgroups are evaluated based on their own usage: all anon, 
> > > unevictable, and unreclaimable slab as described previously.  It requires 
> > > an individual cgroup to be the single largest consumer to be targeted by 
> > > the oom killer.
> > > 
> > > An example: allow users to manage two different mem cgroup hierarchies 
> > > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > > memory in their respective hierarchies.  On a system oom condition, we'd 
> > > expect at least one process from user B's hierarchy would always be oom 
> > > killed with the cgroup aware oom killer.  In fact, the changelog 
> > > explicitly states it solves an issue where "1) There is no fairness 
> > > between containers. A small container with few large processes will be 
> > > chosen over a large one with huge number of small processes."
> > > 
> > > The opposite becomes true, however, if user B creates child cgroups and 
> > > distributes its processes such that each child cgroup's usage never 
> > > exceeds 10GB of memory.  This can either be done intentionally to 
> > > purposefully have a low cgroup memory footprint to evade the oom killer or 
> > > unintentionally with cgroup v2 to allow those individual processes to be 
> > > constrained by other cgroups in a single hierarchy model.  User A, using 
> > > 10% of his memory limit, is always oom killed instead of user B, using 90% 
> > > of his memory limit.
> > > 
> > > Others have commented its still possible to do this with a per-process 
> > > model if users split their processes into many subprocesses with small 
> > > memory footprints.
> > > 
> > > Solution: comparing cgroups must be done hierarchically.  Neither user A 
> > > nor user B can evade the oom killer because targeting is done based on the 
> > > total hierarchical usage rather than individual cgroups in their 
> > > hierarchies.
> > 
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
> 
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?

Hierarchical approach basically hardcodes the oom decision into the
hierarchy structure and that is simply a no go because that would turn
into a massive configuration PITA (more on that below). I consider the above
example rather artificial to be honest. Anyway, if we _really_ have to
address it in the future we can do that by providing a mechanism to
prioritize cgroups. It seems that this is required for some usecases
anyway.
 
> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > > 
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> > > from the oom killer, the cgroup aware oom killer does not provide any 
> > > solution for the user to protect leaf mem cgroups.  This is a result of 
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and 
> > > unreclaimable slab usage and disregarding any user tunable.
> > > 
> > > Absent the cgroup aware oom killer, users have the ability to strongly 
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > > 
> > > An example: a process knows its going to use a lot of memory, so it sets 
> > > /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> > > distrupting any other process.  If it's attached to the root mem cgroup, 
> > > it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> > > outside its control, it will never be oom killed unless that cgroup's 
> > > usage is the largest single cgroup usage on the system.  The reverse also 
> > > is true for processes that the admin does not want to be oom killed: set 
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> > > cgroup has the highest usage on the system.
> > > 
> > > The result is that both admins and users have lost all control over which 
> > > processes are oom killed.  They are left with only one alternative: set 
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> > > kill.  It doesn't address the issue at all for memcg-constrained oom 
> > > conditions since no processes are killable anymore, and risks panicking 
> > > the system if it is the only process left on the system.  A process 
> > > preferring that it is first in line for oom kill simply cannot volunteer 
> > > anymore.
> > > 
> > > Solution: allow users and admins to control oom kill selection by 
> > > introducing a memory.oom_score_adj to affect the oom score of that mem 
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> > > of a process.
> > 
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
> 
> It does make sense to have some form of per-cgroup tunability.  Why is
> the oom_score_adj approach inappropriate and what would be better?

oom_score_adj is basically unusable for any fine tuning on the process
level for most setups except for very specialized ones. The only
reasonable usage I've seen so far was to disable OOM killer for a
process or make it a prime candidate. Using the same limited concept for
cgroups sounds like repeating the same error to me.

> How hard is it to graft such a thing onto the -mm patchset?

I think this should be thought through very well before we add another
tuning. Moreover the current usecase doesn't seem to require it so I am
not really sure why we should implement something right away and later
suffer from API mistakes.
 
> > > I proposed a solution in 
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > comparing total usage at each level to select target cgroups.  Admins and 
> > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > each level.
> > > 
> > > This solves #1 because mem cgroups can be compared based on the same 
> > > classes of memory and the root mem cgroup's usage can be fairly compared 
> > > by subtracting top-level mem cgroup usage from system usage.  All of the 
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> > > counterpart that can be used to do the simple subtraction.
> > > 
> > > This solves #2 because evaluation is done hierarchically so that 
> > > distributing processes over a set of child cgroups either intentionally 
> > > or unintentionally no longer evades the oom killer.  Total usage is always 
> > > accounted to the parent and there is no escaping this criteria for users.
> > > 
> > > This solves #3 because it allows admins to protect important processes in 
> > > cgroups that are supposed to use, for example, 75% of system memory 
> > > without it unconditionally being selected for oom kill but still oom kill 
> > > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > > killer, as currently implemented, is selling mem cgroups short by 
> > > requiring the user to accept that the important process will be oom killed 
> > > iff it uses mem cgroups and isn't attached to root.  It also allows users 
> > > to actually volunteer to be oom killed first without majority usage.
> > > 
> > > It has come up time and time again that this support can be introduced on 
> > > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > > admins and users to have control over decisionmaking, it needs a 
> > > oom_score_adj type tunable that cannot change semantics from kernel 
> > > version to kernel version and without polluting the mem cgroup filesystem.  
> > > That, in my suggestion, is an adjustment on the amount of total 
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > > That requires that the heuristic uses hierarchical usage rather than 
> > > considering each cgroup as independent consumers as it does today.  We 
> > > need to implement that heuristic and introduce userspace influence over 
> > > oom kill selection now rather than later because its implementation 
> > > changes how this patchset is implemented.
> > > 
> > > I can implement these changes, if preferred, on top of the current 
> > > patchset, but I do not believe we want inconsistencies between kernel 
> > > versions that introduce user visible changes for the sole reason that this 
> > > current implementation is incomplete and unfair.  We can implement and 
> > > introduce it once without behavior changing later because the core 
> > > heuristic has necessarily changed.
> > 
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
> 
> Can you please summarize those issues for my understanding?

Because it makes the oom decision directly hardwired to the hierarchy
structure. Just take a simple example of the cgroup structure which
reflects a higher level organization
         root
	/  |  \ 
  admins   |   teachers
        students

Now your students group will be most like the largest one. Why should we
kill tasks/cgroups from that cgroup just because it is cumulatively the
largest one. It might have been some of the teacher blowing up the
memory usage.

Another example is when you need a mid layer cgroups for other
controllers to better control resources.
	root
	/  \
   cpuset1  cpuset2
   /    \   /  |  \
  A     B  C   D   E

You really do not want to select cpuset2 just because it has more
subgroups and potentially larger cumulative usage. The hierarchical
accounting works _only_ if higher level cgroups are semantically
_comparable_ which might be true for some workloads but by no means this
is true in general.

That all being said, I can see further improvements to happen on top of
the current work but I also think that the current implementation works
for the usecase which many users can use without those improvements.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-11  9:08         ` Michal Hocko
@ 2018-01-11 13:18           ` Roman Gushchin
  2018-01-12 22:03             ` David Rientjes
  2018-01-11 21:57           ` David Rientjes
  1 sibling, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-01-11 13:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu, Jan 11, 2018 at 10:08:09AM +0100, Michal Hocko wrote:
> On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> > On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> >
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> > 
> > It does make sense to have some form of per-cgroup tunability.  Why is
> > the oom_score_adj approach inappropriate and what would be better?
> 
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.

My 2c here: current oom_score_adj semantics is really non-trivial for
cgroups. It defines an addition/substraction in 1/1000s of total memory or
OOMing cgroup's memory limit, depending on the the scope of the OOM event.
This is totally out of control for a user, because he/she can even have no
idea about the limit of an upper cgroup in the hierarchy. I've provided
an example earlier, in which one or another of two processes in the
same cgroup can be killed, depending on the scope of the OOM event.

> 
> > How hard is it to graft such a thing onto the -mm patchset?
> 
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>  
> > > > I proposed a solution in 
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > > comparing total usage at each level to select target cgroups.  Admins and 
> > > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > > each level.
> > > > 
> > > > This solves #1 because mem cgroups can be compared based on the same 
> > > > classes of memory and the root mem cgroup's usage can be fairly compared 
> > > > by subtracting top-level mem cgroup usage from system usage.  All of the 
> > > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> > > > counterpart that can be used to do the simple subtraction.
> > > > 
> > > > This solves #2 because evaluation is done hierarchically so that 
> > > > distributing processes over a set of child cgroups either intentionally 
> > > > or unintentionally no longer evades the oom killer.  Total usage is always 
> > > > accounted to the parent and there is no escaping this criteria for users.
> > > > 
> > > > This solves #3 because it allows admins to protect important processes in 
> > > > cgroups that are supposed to use, for example, 75% of system memory 
> > > > without it unconditionally being selected for oom kill but still oom kill 
> > > > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > > > killer, as currently implemented, is selling mem cgroups short by 
> > > > requiring the user to accept that the important process will be oom killed 
> > > > iff it uses mem cgroups and isn't attached to root.  It also allows users 
> > > > to actually volunteer to be oom killed first without majority usage.
> > > > 
> > > > It has come up time and time again that this support can be introduced on 
> > > > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > > > admins and users to have control over decisionmaking, it needs a 
> > > > oom_score_adj type tunable that cannot change semantics from kernel 
> > > > version to kernel version and without polluting the mem cgroup filesystem.  
> > > > That, in my suggestion, is an adjustment on the amount of total 
> > > > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > > > That requires that the heuristic uses hierarchical usage rather than 
> > > > considering each cgroup as independent consumers as it does today.  We 
> > > > need to implement that heuristic and introduce userspace influence over 
> > > > oom kill selection now rather than later because its implementation 
> > > > changes how this patchset is implemented.
> > > > 
> > > > I can implement these changes, if preferred, on top of the current 
> > > > patchset, but I do not believe we want inconsistencies between kernel 
> > > > versions that introduce user visible changes for the sole reason that this 
> > > > current implementation is incomplete and unfair.  We can implement and 
> > > > introduce it once without behavior changing later because the core 
> > > > heuristic has necessarily changed.
> > > 
> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> > 
> > Can you please summarize those issues for my understanding?
> 
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
>          root
> 	/  |  \ 
>   admins   |   teachers
>         students
> 
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
> 
> Another example is when you need a mid layer cgroups for other
> controllers to better control resources.
> 	root
> 	/  \
>    cpuset1  cpuset2
>    /    \   /  |  \
>   A     B  C   D   E
> 
> You really do not want to select cpuset2 just because it has more
> subgroups and potentially larger cumulative usage. The hierarchical
> accounting works _only_ if higher level cgroups are semantically
> _comparable_ which might be true for some workloads but by no means this
> is true in general.
> 
> That all being said, I can see further improvements to happen on top of
> the current work but I also think that the current implementation works
> for the usecase which many users can use without those improvements.

Thank you, Michal, you wrote exactly what I wanted to write here!

Summarizing all this, following the hierarchy is good when it reflects
the "importance" of cgroup's memory for a user, and bad otherwise.
In generic case with unified hierarchy it's not true, so following
the hierarchy unconditionally is bad.

Current version of the patchset allows common evaluation of cgroups
by setting memory.groupoom to true. The only limitation is that it
also changes the OOM action: all belonging processes will be killed.
If we really want to preserve an option to evaluate cgroups
together without forcing "kill all processes" action, we can
convert memory.groupoom from being boolean to the multi-value knob.
For instance, "disabled" and "killall". This will allow to add
a third state ("evaluate", for example) later without breaking the API.

Re root cgroup evaluation:
I believe that it's discussed here mostly as an argument towards
hierarchical approach. The current heuristics can definitely be improved,
but it doesn't require changing the whole semantics. For example,
we can ignore oom_score_adj (except -1000) in this particular case,
that will fix David's example.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-11  9:08         ` Michal Hocko
  2018-01-11 13:18           ` Roman Gushchin
@ 2018-01-11 21:57           ` David Rientjes
  1 sibling, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-11 21:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, 11 Jan 2018, Michal Hocko wrote:

> > > I find this problem quite minor, because I haven't seen any practical problems
> > > caused by accounting of the root cgroup memory.
> > > If it's a serious problem for you, it can be solved without switching to the
> > > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > > substract them from global values. So, it can be a relatively small enhancement
> > > on top of the current mm tree. This has nothing to do with global victim selection
> > > approach.
> > 
> > It sounds like a significant shortcoming to me - the oom-killing
> > decisions which David describes are clearly incorrect?
> 
> Well, I would rather look at that from the use case POV. The primary
> user of the new OOM killer functionality are containers. I might be
> wrong but I _assume_ that root cgroup will only contain the basic system
> infrastructure there and all the workload will run containers (aka
> cgroups). The only oom tuning inside the root cgroup would be to disable
> oom for some of those processes. The current implementation should work
> reasonably well for that configuration.
>  

It's a decision made on the system level when cgroup2 is mounted, it 
affects all processes that get attached to a leaf cgroup, even regardless 
of whether or not it has the memory controller enabled for that subtree.  
In other words, mounting cgroup2 with the cgroup aware oom killer mount 
option immediately:

 - makes /proc/pid/oom_score_adj effective for all processes attached
   to the root cgroup only, and

 - makes /proc/pid/oom_score_adj a no-op for all processes attached to a
   non-root cgroup.

Note that there may be no active usage of any memory controller at the 
time of oom, yet this tunable inconsistency still exists for any process.

The initial response is correct: it clearly produces incorrect oom killing 
decisions.  This implementation detail either requires the entire system 
is not containerized at all (no processes attached to any leaf cgroup), or 
fully containerized (all processes attached to leaf cgroups).  It's a 
highly specialized usecase with very limited limited scope and is wrought 
with pitfalls if any oom_score_adj is tuned because it strictly depends on 
the cgroup to which those processes are attached at any given time to 
determine whether it is effective or not.

> > > We've discussed this a lot.
> > > Hierarchical approach has their own issues, which we've discussed during
> > > previous iterations of the patchset. If you know how to address them
> > > (I've no idea), please, go on and suggest your version.
> > 
> > Well, if a hierarchical approach isn't a workable fix for the problem
> > which David has identified then what *is* the fix?
> 
> Hierarchical approach basically hardcodes the oom decision into the
> hierarchy structure and that is simply a no go because that would turn
> into a massive configuration PITA (more on that below). I consider the above
> example rather artificial to be honest. Anyway, if we _really_ have to
> address it in the future we can do that by providing a mechanism to
> prioritize cgroups. It seems that this is required for some usecases
> anyway.
>  

I'll address the hierarchical accounting suggestion below.

The example is not artificial, it's not theoretical, it's a real-world 
example.  Users can attach processes to subcontainers purely for memory 
stats from a mem cgroup point of view without limiting usage, or to 
subcontainers for use with other controllers other than the memory 
controller.  We do this all the time: it's helpful to assign groups of 
processes to subcontainers simply to track statistics while the actual 
limitation is enforced by an ancestor.

So without hierarchical accounting, we can extend the above restriction, 
that a system is either fully containerized or not containerized at all, 
by saying that a fully containerized system must not use subcontainers to 
avoid the cgroup aware oom killer heuristic.  In other words, using this 
feature requires:

 - the entire system is not containerized at all, or

 - the entire system is fully containerized and no cgroup (any controller,
   not just memory) uses subcontainers to intentionally/unintentionally
   distribute usage to evade this heuristic.

Of course the second restriction severely limits the flexibility that 
cgroup v2 introduces as a whole as a caveat of an implementation detail of 
the memory cgroup aware oom killer.  Why not simply avoid using the cgroup 
aware oom killer?  It's not so easy since the it's a property of the 
machine itself: users probably have no control over it themselves and, in 
the worst case, can trivially evade ever being oom killed if used.

> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> > 
> > It does make sense to have some form of per-cgroup tunability.  Why is
> > the oom_score_adj approach inappropriate and what would be better?
> 
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.
> 

oom_score_adj is based on the heuristic that the oom killer uses to decide 
which process to kill, it kills the largest memory hogging process.  Being 
able to tune that based on a proportion of available memory, whether for 
the system as a whole or for a memory cgroup hierarchy makes sense for 
varying RAM capacities and memory cgroup limits.  It works quite well in 
practice, I'm not sure your experience with it based on the needs you have 
with overcommit.

The ability to protect important cgroups and bias against non-important 
cgroups is vital to any selection implementation.  Strictly requiring that 
important cgroups do not have majority usage is not a solution, which is 
what this patchset implements.  It's also insufficient to say that this 
can be added on later, even though the need is acknowledged, because any 
prioritization or userspace influence will require a change to the cgroup 
aware oom killer's core heuristic itself.  That needs to be decided now 
rather than later to avoid differing behavior between kernel versions for 
those who adopt this feature and carefully arrange their cgroup 
hierarchies to fit with the highly specialized usecase before it's 
completely changed out from under them.

> > How hard is it to graft such a thing onto the -mm patchset?
> 
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>  

The current usecase is so highly specialized that it's not usable for 
anybody else and will break that highly specialized usecase if we make it 
usable for anybody else in the future.

It's so highly specialized that you can't even protect an important 
process from oom kill, there is no remedy provided to userspace that still 
allows local mem cgroup oom to actually work.

# echo "+memory" > cgroup.subtree_control
# mkdir cg1
# echo $$ > cg1/cgroup.procs
<fork important process>
# echo -999 > /proc/$!/oom_score_adj
# echo f > /proc/sysrq-trigger

This kills the important process if cg1 has the highest usage, the user 
has no control other than purposefully evading the oom killer, if 
possible, by distributing the process's threads over subcontainers.

Because a highly specialized user is proposing this and it works well for 
them right now does not mean that it is in the best interest of Linux to 
merge, especially if it cannot become generally useful to others without 
core changes that affect the original highly specialized user.  This is 
why the patchset, as is, is currently incomplete.

> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> > 
> > Can you please summarize those issues for my understanding?
> 
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
>          root
> 	/  |  \ 
>   admins   |   teachers
>         students
> 
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
> 

There's one thing missing here: the use of the proposed 
memory.oom_score_adj.  You can use it to either polarize the decision so 
that oom kills always originate from any of /admins, /students, or 
/teachers, or bias against them.  You can also proportionally prefer 
/admins or /teachers, if desired, so they are selected if their usage 
exceeds a certain threshold based on the limit of the ancestor even though 
that hierarchical usage does not exceed, or even approach the hierarchical 
usage of /students.

This requires that the entity setting up this hierarchy know only one 
thing: the allowed memory usage of /admins or /teachers compared to the 
very large usage of /students.  It can actually be tuned to a great 
detail.  I only suggest memory.oom_score_adj because it can work for all 
/root capacities, regardless of RAM capacity, as a proportion of the 
system that should be set aside for /admins, /students, and/or /teachers 
rather than hardcoded bytes which would change depending on the amount of 
RAM you have.

FWIW, we do this exact thing and it works quite well.  This doesn't mean 
that I'm advocating for my own usecase, or anticipated usecase for the 
cgroup aware oom killer, but that you've picked an example that I have 
personally worked with and the bias can work quite well for oom kill 
selection.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-11 13:18           ` Roman Gushchin
@ 2018-01-12 22:03             ` David Rientjes
  2018-01-15 11:54               ` Michal Hocko
  0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-12 22:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Andrew Morton, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Thu, 11 Jan 2018, Roman Gushchin wrote:

> Summarizing all this, following the hierarchy is good when it reflects
> the "importance" of cgroup's memory for a user, and bad otherwise.
> In generic case with unified hierarchy it's not true, so following
> the hierarchy unconditionally is bad.
> 
> Current version of the patchset allows common evaluation of cgroups
> by setting memory.groupoom to true. The only limitation is that it
> also changes the OOM action: all belonging processes will be killed.
> If we really want to preserve an option to evaluate cgroups
> together without forcing "kill all processes" action, we can
> convert memory.groupoom from being boolean to the multi-value knob.
> For instance, "disabled" and "killall". This will allow to add
> a third state ("evaluate", for example) later without breaking the API.
> 

No, this isn't how kernel features get introduced.  We don't design a new 
kernel feature with its own API for a highly specialized usecase and then 
claim we'll fix the problems later.  Users will work around the 
constraints of the new feature, if possible, and then we can end up 
breaking them later.  Or, we can pollute the mem cgroup v2 filesystem with 
even more tunables to cover up for mistakes in earlier designs.

The key point to all three of my objections: extensibility.

Both you and Michal have acknowledged blantently obvious shortcomings in 
the design.  We do not need to merge an incomplete patchset that forces us 
into a corner later so that we need to add more tunables and behaviors.  
It's also an incompletely documented patchset: the user has *no* insight 
other than reading the kernel code that with this functionality their 
/proc/pid/oom_score_adj values are effective when attached to the root 
cgroup and ineffective when attached to a non-root cgroup, regardless of 
whether the memory controller is enabled or not.

Extensibility when designing a new kernel feature is important.  If you 
see shortcomings, which I've enumerated and have been acknowledged, it 
needs to be fixed.

The main consideration here is two choices:

 (1) any logic to determine process(es) to be oom killed belongs only in
     userspace, and has no place in the kernel, or

 (2) such logic is complete, documented, extensible, and is generally
     useful.

This patchset obviously is neither.  I believe we can move toward (2), but 
it needs to be sanely designed and implemented such that it addresses the 
concerns I've enumerated that we are all familiar with.

The solution is to define the oom policy as a trait of the mem cgroup 
itself.  That needs to be complemented with a per mem cgroup value member 
to specify the policy for child mem cgroups.  We don't need any new 
memory.groupoom, which depends on only this policy decision and is 
otherwise a complete no-op, and we don't need any new mount option.

It's quite obvious the system cannot have a single cgroup aware oom 
policy.  That's when you get crazy inconsistencies and functionally broken 
behavior such as influencing oom kill selection based on whether a cgroup 
distributed processes over a set of child cgroups, even if they do not 
have memory in their memory.subtree_control.  It's also where you can give 
the ability to user to still prefer to protect important cgroups or bias 
against non-important cgroups.

The point is that the functionality needs to be complete and it needs to 
be extensible.

I'd propose to introduce a new memory.oom_policy and a new 
memory.oom_value.  All cgroups have these files, but memory.oom_value is a 
no-op for the root mem cgroup, just as you can't limit the memory usage on 
the root mem cgroup.

memory.oom_policy defines the policy for selection when that mem cgroup is 
oom; system oom conditions refer to the root mem cgroup's 
memory.oom_policy.

memory.oom_value defines any parameter that child mem cgroups should be 
considered with when effecting that policy.

There's three memory.oom_policy values I can envision at this time, but it 
can be extended in the future:

 - "adj": cgroup aware just as you have implemented based on how large a
   cgroup is.  oom_value can be used to define any adjustment made to that
   usage so that userspace can protect important cgroups or bias against
   non-important cgroups (I'd suggest using that same semantics as
   oom_score_adj for consistency, but wouldn't object to using a byte
   value).

 - "hierarchy": same as "adj" but based on hierarchical usage.  oom_value
   has the same semantics as "adj".  This prevents users from 
   intentionally or unintentionally affecting oom kill selection by
   limiting groups of their processes by the memory controller, or any
   other cgroup controller.

 - "priority": cgroup aware just as you implemented memory.oom_priority in
   the past.  oom_value is a strict priority value where usage is not
   considered and only the priority of the subtree is compared.

With cgroup v2 sematics of no internal process constraint, this is 
extremely straight forward.  All of your oom evaluation function can be 
reused with a simple comparison based on the policy to score individual 
cgroups.  In the simplest policy, "priority", this is like a 10 line 
function, but extremely useful to userspace.

This allows users to have full power over the decisionmaking in every 
subtree wrt oom kill selection and doesn't regress or allow for any 
pitfalls of the current patchset.  The goal is not to have one single oom 
policy for the entire system, but define the policy that makes useful 
sense.  This is how an extensible feature is designed and does not require 
any further pollution of the mem cgroup filesystem.

If you have additional features such as groupoom, you can make 
memory.oom_policy comma delimited, just as vmpressure modes are comma 
delimited.  You would want to use "adj,groupoom".  We don't need another 
file that is pointless in other policy decisions.  We don't need a mount 
option to lock the entire system into a single methodology.

Cgroup v2 is a very clean interface and I think it's the responsibility of 
every controller to maintain that.  We should not fall into a cgroup v1 
mentality which became very difficult to make extensible.  Let's make a 
feature that is generally useful, complete, and empowers the user rather 
than push them into a corner with a system wide policy with obvious 
downsides.

For these reasons, and the three concerns that I enumerated earlier which 
have been acknowledged of obvious shortcoming with this approach:

Nacked-by: David Rientjes <rientjes@google.com>

I'll be happy to implement the core functionality that allows oom policies 
to be written by the user and introduce memory.oom_value, and then rework 
the logic defined in your patchset as "adj" by giving the user an optional 
way of perferring or biasing that usage in a way that is clearly 
documented and extended.  Root mem cgroup usage is obviously wrong in this 
current patchset since it uses oom_score_adj whereas leaf cgroups do not, 
so that will be fixed.  But I'll ask that the current patchset is removed 
from -mm since it has obvious downsides, pollutes the mem cgroup v2 
filesystem, is not extensible, is not documented wrt to oom_score_adj, 
allows evasion of the heuristic, and doesn't allow the user to have any 
power in the important decision of which of their important processes is 
oom killed such that this feature is not useful outside very specialized 
usecases.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-10 19:33       ` Andrew Morton
  2018-01-11  9:08         ` Michal Hocko
@ 2018-01-13 17:14         ` Johannes Weiner
  2018-01-14 23:44           ` David Rientjes
  1 sibling, 1 reply; 51+ messages in thread
From: Johannes Weiner @ 2018-01-13 17:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, David Rientjes, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Wed, Jan 10, 2018 at 11:33:45AM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > > 
> > > > Thanks, I'll grab these.
> > > > 
> > > > There has been controversy over this patchset, to say the least.  I
> > > > can't say that I followed it closely!  Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > > 
> > > 
> > > Yes, I'll summarize what my concerns have been in the past and what they 
> > > are wrt the patchset as it stands in -mm.  None of them originate from my 
> > > current usecase or anticipated future usecase of the oom killer for 
> > > system-wide or memcg-constrained oom conditions.  They are based purely on 
> > > the patchset's use of an incomplete and unfair heuristic for deciding 
> > > which cgroup to target.
> > > 
> > > I'll also suggest simple changes to the patchset, which I have in the 
> > > past, that can be made to address all of these concerns.
> > > 
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > > 
> > > The patchset uses two different heuristics to compare root and leaf mem 
> > > cgroups and scores them based on number of pages.  For the root mem 
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > > unfairness comes from users who tune oom_score_adj.
> > > 
> > > An example: start a process that faults 1GB of anonymous memory and leave 
> > > it attached to the root mem cgroup.  Start six more processes that each 
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > > it.
> > > 
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > > 
> > > This is caused by the patchset disregarding oom_score_adj entirely for 
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > > the complete opposite result of what the cgroup aware oom killer 
> > > advertises.
> > > 
> > > It also works the other way, if a large memory hog is attached to the root 
> > > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > > processes are nuked solely because they happened to be attached to a leaf 
> > > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > > so I can't presume that it is either known nor tested.
> > > 
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > > the same criteria by doing hierarchical accounting of usage and 
> > > subtracting from total system usage to find root usage.
> > 
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
> 
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?

As others have pointed out, it's an academic problem.

You don't have any control and no accounting of the stuff situated
inside the root cgroup, so it doesn't make sense to leave anything in
there while also using sophisticated containerization mechanisms like
this group oom setting.

In fact, the laptop I'm writing this email on runs an unmodified
mainstream Linux distribution. The only thing in the root cgroup are
kernel threads.

The decisions are good enough for the rare cases you forget something
in there and it explodes.

> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.

You'd have to sum up all the memory consumed by first-level cgroups
and then subtract from the respective system-wide counters. They're
implemented differently than the cgroup tracking, so it's kind of
messy. But it wouldn't be impossible.

> > > 2. Evading the oom killer by attaching processes to child cgroups
> > > 
> > > Any cgroup on the system can attach all their processes to individual 
> > > child cgroups.  This is functionally the same as doing
> > > 
> > > 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > without the no internal process constraint introduced with cgroup v2.  All 
> > > child cgroups are evaluated based on their own usage: all anon, 
> > > unevictable, and unreclaimable slab as described previously.  It requires 
> > > an individual cgroup to be the single largest consumer to be targeted by 
> > > the oom killer.
> > > 
> > > An example: allow users to manage two different mem cgroup hierarchies 
> > > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > > memory in their respective hierarchies.  On a system oom condition, we'd 
> > > expect at least one process from user B's hierarchy would always be oom 
> > > killed with the cgroup aware oom killer.  In fact, the changelog 
> > > explicitly states it solves an issue where "1) There is no fairness 
> > > between containers. A small container with few large processes will be 
> > > chosen over a large one with huge number of small processes."
> > > 
> > > The opposite becomes true, however, if user B creates child cgroups and 
> > > distributes its processes such that each child cgroup's usage never 
> > > exceeds 10GB of memory.  This can either be done intentionally to 
> > > purposefully have a low cgroup memory footprint to evade the oom killer or 
> > > unintentionally with cgroup v2 to allow those individual processes to be 
> > > constrained by other cgroups in a single hierarchy model.  User A, using 
> > > 10% of his memory limit, is always oom killed instead of user B, using 90% 
> > > of his memory limit.
> > > 
> > > Others have commented its still possible to do this with a per-process 
> > > model if users split their processes into many subprocesses with small 
> > > memory footprints.
> > > 
> > > Solution: comparing cgroups must be done hierarchically.  Neither user A 
> > > nor user B can evade the oom killer because targeting is done based on the 
> > > total hierarchical usage rather than individual cgroups in their 
> > > hierarchies.
> > 
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
> 
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?

This assumes you even need one. Right now, the OOM killer picks the
biggest MM, so you can evade selection by forking your MM. This patch
allows picking the biggest cgroup, so you can evade by forking groups.

It's not a new vector, and clearly nobody cares. This has never been
brought up against the current design that I know of.

Note, however, that there actually *is* a way to guard against it: in
cgroup2 there is a hierarchical limit you can configure for the number
of cgroups that are allowed to be created in the subtree. See
1a926e0bbab8 ("cgroup: implement hierarchy limits").

> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > > 
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> > > from the oom killer, the cgroup aware oom killer does not provide any 
> > > solution for the user to protect leaf mem cgroups.  This is a result of 
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and 
> > > unreclaimable slab usage and disregarding any user tunable.
> > > 
> > > Absent the cgroup aware oom killer, users have the ability to strongly 
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > > 
> > > An example: a process knows its going to use a lot of memory, so it sets 
> > > /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> > > distrupting any other process.  If it's attached to the root mem cgroup, 
> > > it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> > > outside its control, it will never be oom killed unless that cgroup's 
> > > usage is the largest single cgroup usage on the system.  The reverse also 
> > > is true for processes that the admin does not want to be oom killed: set 
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> > > cgroup has the highest usage on the system.
> > > 
> > > The result is that both admins and users have lost all control over which 
> > > processes are oom killed.  They are left with only one alternative: set 
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> > > kill.  It doesn't address the issue at all for memcg-constrained oom 
> > > conditions since no processes are killable anymore, and risks panicking 
> > > the system if it is the only process left on the system.  A process 
> > > preferring that it is first in line for oom kill simply cannot volunteer 
> > > anymore.
> > > 
> > > Solution: allow users and admins to control oom kill selection by 
> > > introducing a memory.oom_score_adj to affect the oom score of that mem 
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> > > of a process.
> > 
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
> 
> It does make sense to have some form of per-cgroup tunability.  Why is
> the oom_score_adj approach inappropriate and what would be better?  How
> hard is it to graft such a thing onto the -mm patchset?

It could be useful, but we have no concensus on the desired
semantics. And it's not clear why we couldn't add it later as long as
the default settings of a new knob maintain the default behavior
(which would have to be preserved anyway, since we rely on it).

> > > I proposed a solution in 
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > comparing total usage at each level to select target cgroups.  Admins and 
> > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > each level.

We did respond repeatedly: this doesn't work for a lot of setups.

Take for example our systems. They have two big top level cgroups:
system stuff - logging, monitoring, package management, etc. - and
then the workload subgroup.

Your scheme would always descend the workload group by default. Why
would that be obviously correct? Something in the system group could
have blown up.

What you propose could be done, but it would have to be opt-in - just
like this oom group feature is opt-in.

> > > This solves #1 because mem cgroups can be compared based on the same 
> > > classes of memory and the root mem cgroup's usage can be fairly compared 
> > > by subtracting top-level mem cgroup usage from system usage.  All of the 
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> > > counterpart that can be used to do the simple subtraction.
> > > 
> > > This solves #2 because evaluation is done hierarchically so that 
> > > distributing processes over a set of child cgroups either intentionally 
> > > or unintentionally no longer evades the oom killer.  Total usage is always 
> > > accounted to the parent and there is no escaping this criteria for users.
> > > 
> > > This solves #3 because it allows admins to protect important processes in 
> > > cgroups that are supposed to use, for example, 75% of system memory 
> > > without it unconditionally being selected for oom kill but still oom kill 
> > > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > > killer, as currently implemented, is selling mem cgroups short by 
> > > requiring the user to accept that the important process will be oom killed 
> > > iff it uses mem cgroups and isn't attached to root.  It also allows users 
> > > to actually volunteer to be oom killed first without majority usage.
> > > 
> > > It has come up time and time again that this support can be introduced on 
> > > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > > admins and users to have control over decisionmaking, it needs a 
> > > oom_score_adj type tunable that cannot change semantics from kernel 
> > > version to kernel version and without polluting the mem cgroup filesystem.  
> > > That, in my suggestion, is an adjustment on the amount of total 
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > > That requires that the heuristic uses hierarchical usage rather than 
> > > considering each cgroup as independent consumers as it does today.  We 
> > > need to implement that heuristic and introduce userspace influence over 
> > > oom kill selection now rather than later because its implementation 
> > > changes how this patchset is implemented.
> > > 
> > > I can implement these changes, if preferred, on top of the current 
> > > patchset, but I do not believe we want inconsistencies between kernel 
> > > versions that introduce user visible changes for the sole reason that this 
> > > current implementation is incomplete and unfair.  We can implement and 
> > > introduce it once without behavior changing later because the core 
> > > heuristic has necessarily changed.
> > 
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
> 
> Can you please summarize those issues for my understanding?

Comparing siblings at each level to find OOM victims is not
universally meaningful. As per above, our workload subgroup will
always be bigger than the system subgroup. Does that mean nothing
should *ever* be killed in the system group per default? That's silly.

In such a scheme, configuring priorities on all cgroups down the tree
is not just something you *can* do; it's something you *must* do to
get reasonable behavior.

That's fine for David, who does that already anyway. But we actually
*like* the way the current OOM killer heuristics work: find the
biggest indivisible memory consumer in the system and kill it (we
merely want to extend the concept of "indivisible memory consumer"
from individual MMs to cgroups of interdependent MMs).

Such a heuristic is mutually exclusive with a hierarchy walk. You
can't know, looking at the consumption of system group vs. workload
group, which subtree has the biggest indivisible consumer. The huge
workload group can consist of hundreds of independent jobs.

And a scoring system can't fix this. Once you're in manual mode, you
cannot configure your way back to letting the OOM killer decide.

A hierarchy mode would simply make the existing OOM heuristics
impossible. And we want to use them. So it's not an alternative
proposal, it's in direct opposition of what we're trying to do here.

A way to bias oom_group configured cgroups in their selection could be
useful. But 1) it's not necessary for these patches to be useful (we
certainly don't care). 2) It's absolutely possible to add such knobs
later, as long as you leave the *default* behavior alone. 3) David's
proposal is something entirely different and conflicts with existing
OOM heuristics and our usecase, so it couldn't possibly be the answer.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-13 17:14         ` Johannes Weiner
@ 2018-01-14 23:44           ` David Rientjes
  2018-01-15 16:25             ` Johannes Weiner
  0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-14 23:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Sat, 13 Jan 2018, Johannes Weiner wrote:

> You don't have any control and no accounting of the stuff situated
> inside the root cgroup, so it doesn't make sense to leave anything in
> there while also using sophisticated containerization mechanisms like
> this group oom setting.
> 
> In fact, the laptop I'm writing this email on runs an unmodified
> mainstream Linux distribution. The only thing in the root cgroup are
> kernel threads.
> 
> The decisions are good enough for the rare cases you forget something
> in there and it explodes.
> 

It's quite trivial to allow the root mem cgroup to be compared exactly the 
same as another cgroup.  Please see 
https://marc.info/?l=linux-kernel&m=151579459920305.

> This assumes you even need one. Right now, the OOM killer picks the
> biggest MM, so you can evade selection by forking your MM. This patch
> allows picking the biggest cgroup, so you can evade by forking groups.
> 

It's quite trivial to prevent any cgroup from evading the oom killer by 
either forking their mm or attaching all their processes to subcontainers.  
Please see https://marc.info/?l=linux-kernel&m=151579459920305.

> It's not a new vector, and clearly nobody cares. This has never been
> brought up against the current design that I know of.
> 

As cgroup v2 becomes more popular, people will organize their cgroup 
hierarchies for all controllers they need to use.  We do this today, for 
example, by attaching some individual consumers to child mem cgroups 
purely for the rich statistics and vmscan stats that mem cgroup provides 
without any limitation on those cgroups.

> Note, however, that there actually *is* a way to guard against it: in
> cgroup2 there is a hierarchical limit you can configure for the number
> of cgroups that are allowed to be created in the subtree. See
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 

Not allowing the user to create subcontainers to track statistics to paper 
over an obvious and acknowledged shortcoming in the design of the cgroup 
aware oom killer seems like a pretty nasty shortcoming itself.

> It could be useful, but we have no concensus on the desired
> semantics. And it's not clear why we couldn't add it later as long as
> the default settings of a new knob maintain the default behavior
> (which would have to be preserved anyway, since we rely on it).
>

The active proposal is 
https://marc.info/?l=linux-kernel&m=151579459920305, which describes an 
extendable interface and one that covers all the shortcomings of this 
patchset without polluting the mem cgroup filesystem.  The default oom 
policy in that proposal would be "none", i.e. we do what we do today, 
based on process usage.  You can configure that, without the mount option 
this patchset introduces for local or hierarchical cgroup targeting.
 
> > > > I proposed a solution in 
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > > comparing total usage at each level to select target cgroups.  Admins and 
> > > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > > each level.
> 
> We did respond repeatedly: this doesn't work for a lot of setups.
> 

We need to move this discussion to the active proposal at 
https://marc.info/?l=linux-kernel&m=151579459920305, because it does 
address your setup, so it's not good use of anyones time to further 
discuss simply memory.oom_score_adj.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-12 22:03             ` David Rientjes
@ 2018-01-15 11:54               ` Michal Hocko
  2018-01-16 21:36                 ` David Rientjes
  0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2018-01-15 11:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Andrew Morton, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 12-01-18 14:03:03, David Rientjes wrote:
> On Thu, 11 Jan 2018, Roman Gushchin wrote:
> 
> > Summarizing all this, following the hierarchy is good when it reflects
> > the "importance" of cgroup's memory for a user, and bad otherwise.
> > In generic case with unified hierarchy it's not true, so following
> > the hierarchy unconditionally is bad.
> > 
> > Current version of the patchset allows common evaluation of cgroups
> > by setting memory.groupoom to true. The only limitation is that it
> > also changes the OOM action: all belonging processes will be killed.
> > If we really want to preserve an option to evaluate cgroups
> > together without forcing "kill all processes" action, we can
> > convert memory.groupoom from being boolean to the multi-value knob.
> > For instance, "disabled" and "killall". This will allow to add
> > a third state ("evaluate", for example) later without breaking the API.
> > 
> 
> No, this isn't how kernel features get introduced.  We don't design a new 
> kernel feature with its own API for a highly specialized usecase and then 
> claim we'll fix the problems later.  Users will work around the 
> constraints of the new feature, if possible, and then we can end up 
> breaking them later.  Or, we can pollute the mem cgroup v2 filesystem with 
> even more tunables to cover up for mistakes in earlier designs.

This is a blatant misinterpretation of the proposed changes. I haven't
heard _any_ single argument against the proposed user interface except
for complaints for missing tunables. This is not how the kernel
development works and should work. The usecase was clearly described and
far from limited to a single workload or company.
 
> The key point to all three of my objections: extensibility.

And it has been argued that further _features_ can be added on top. I am
absolutely fed up discussing those things again and again without any
progress. You simply keep _ignoring_ counter arguments and that is a
major PITA to be honest with you. You are basically blocking a useful
feature because it doesn't solve your particular workload. This is
simply not acceptable in the kernel development.

> Both you and Michal have acknowledged blantently obvious shortcomings in 
> the design.

What you call blatant shortcomings we do not see affecting any
_existing_ workloads. If they turn out to be real issues then we can fix
them without deprecating any user APIs added by this patchset.

Look, I am not the one who is a heavy container user. The primary reason
I am supporting this is because a) it has a _real_ user and I can see
more to emerge and b) because it _makes_ sense in its current form and
it doesn't bring heavy maintenance burden in the OOM code base which I
do care about.

I am deliberately skipping the rest of your email because it is _yet_
_again_ a form of obstructing the current patchset which is what you
have been doing for quite some time. I will leave the final decision
for merging to Andrew. If you want to build a more fine grained control
on top, you are free to do so. I will be reviewing those like any other
upstream oom changes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-14 23:44           ` David Rientjes
@ 2018-01-15 16:25             ` Johannes Weiner
  2018-01-16 21:21               ` David Rientjes
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Weiner @ 2018-01-15 16:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Sun, Jan 14, 2018 at 03:44:09PM -0800, David Rientjes wrote:
> On Sat, 13 Jan 2018, Johannes Weiner wrote:
> 
> > You don't have any control and no accounting of the stuff situated
> > inside the root cgroup, so it doesn't make sense to leave anything in
> > there while also using sophisticated containerization mechanisms like
> > this group oom setting.
> > 
> > In fact, the laptop I'm writing this email on runs an unmodified
> > mainstream Linux distribution. The only thing in the root cgroup are
> > kernel threads.
> > 
> > The decisions are good enough for the rare cases you forget something
> > in there and it explodes.
> 
> It's quite trivial to allow the root mem cgroup to be compared exactly the 
> same as another cgroup.  Please see 
> https://marc.info/?l=linux-kernel&m=151579459920305.

This only says "that will be fixed" and doesn't address why I care.

> > This assumes you even need one. Right now, the OOM killer picks the
> > biggest MM, so you can evade selection by forking your MM. This patch
> > allows picking the biggest cgroup, so you can evade by forking groups.
> 
> It's quite trivial to prevent any cgroup from evading the oom killer by 
> either forking their mm or attaching all their processes to subcontainers.  
> Please see https://marc.info/?l=linux-kernel&m=151579459920305.

This doesn't address anything I wrote.

> > It's not a new vector, and clearly nobody cares. This has never been
> > brought up against the current design that I know of.
> 
> As cgroup v2 becomes more popular, people will organize their cgroup 
> hierarchies for all controllers they need to use.  We do this today, for 
> example, by attaching some individual consumers to child mem cgroups 
> purely for the rich statistics and vmscan stats that mem cgroup provides 
> without any limitation on those cgroups.

There is no connecting tissue between what I wrote and what you wrote.

> > Note, however, that there actually *is* a way to guard against it: in
> > cgroup2 there is a hierarchical limit you can configure for the number
> > of cgroups that are allowed to be created in the subtree. See
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
> 
> Not allowing the user to create subcontainers to track statistics to paper 
> over an obvious and acknowledged shortcoming in the design of the cgroup 
> aware oom killer seems like a pretty nasty shortcoming itself.

It's not what I proposed. There is a big difference between cgroup
fork bombs and having a couple of groups for statistics.

> > > > > I proposed a solution in 
> > > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > > > comparing total usage at each level to select target cgroups.  Admins and 
> > > > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > > > each level.
> > 
> > We did respond repeatedly: this doesn't work for a lot of setups.
> 
> We need to move this discussion to the active proposal at 
> https://marc.info/?l=linux-kernel&m=151579459920305, because it does 
> address your setup, so it's not good use of anyones time to further 
> discuss simply memory.oom_score_adj.

No, we don't.

We have a patch set that was developed, iterated and improved over 10+
revisions, based on evaluating and weighing trade-offs. It's reached a
state where the memcg maintainers are happy with it and don't have any
concern about future extendabilty to cover more specialized setups.

You've had nine months to contribute and shape this patch series
productively, and instead resorted to a cavalcade of polemics,
evasion, and repetition of truisms and refuted points. A ten paragraph
proposal of vague ideas at this point is simply not a valid argument.

You can send patches to replace or improve on Roman's code and make
the case for them.

Thanks

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-15 16:25             ` Johannes Weiner
@ 2018-01-16 21:21               ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-16 21:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Mon, 15 Jan 2018, Johannes Weiner wrote:

> > It's quite trivial to allow the root mem cgroup to be compared exactly the 
> > same as another cgroup.  Please see 
> > https://marc.info/?l=linux-kernel&m=151579459920305.
> 
> This only says "that will be fixed" and doesn't address why I care.
> 

Because the design of the cgroup aware oom killer requires the entire 
system to be fully containerized to be useful and select the 
correct/anticipated victim.  If anything is left behind in the root mem 
cgroup, or on systems that use mem cgroups in ways that you do not, it 
returns wildly unpredictable and downright wrong results; it factors 
oom_score_adj into the selection criteria only for processes attached to 
the root mem cgroup and ignores it otherwise.  If we treated root and leaf 
cgroups as comparable, this problem wouldn't arise.

> > > This assumes you even need one. Right now, the OOM killer picks the
> > > biggest MM, so you can evade selection by forking your MM. This patch
> > > allows picking the biggest cgroup, so you can evade by forking groups.
> > 
> > It's quite trivial to prevent any cgroup from evading the oom killer by 
> > either forking their mm or attaching all their processes to subcontainers.  
> > Please see https://marc.info/?l=linux-kernel&m=151579459920305.
> 
> This doesn't address anything I wrote.
> 

It prevents both problems if you are attached to a mem cgroup subtree.  
You can't fork the mm and you can't fork groups to evade the selection 
logic.  If the root mem cgroup and leaf cgroups were truly comparable, it 
also prevents both problems regardless of which cgroup the processes 
attached to.

> > > It's not a new vector, and clearly nobody cares. This has never been
> > > brought up against the current design that I know of.
> > 
> > As cgroup v2 becomes more popular, people will organize their cgroup 
> > hierarchies for all controllers they need to use.  We do this today, for 
> > example, by attaching some individual consumers to child mem cgroups 
> > purely for the rich statistics and vmscan stats that mem cgroup provides 
> > without any limitation on those cgroups.
> 
> There is no connecting tissue between what I wrote and what you wrote.
> 

You're completely ignoring other usecases other than your own highly 
specialized usecase.  You may attach every user process on the system to 
leaf cgroups and you may not have any users who have control over a 
subtree.  Other people do.  And this patchset, as implemented, returns 
very inaccurate results or allows users to intentionally or 
unintentionally evade the oom killer just because they want to use 
subcontainers.

Creating and attaching a subset of processes to either top-level 
containers or subcontainers for either limitation by mem cgroup or for 
statistics is a valid usecase, and one that is used in practice.  Your 
suggestion of avoiding that problem to work around the shortcomings of 
this patchset by limiting how many subcontainers can be created by the 
user is utterly ridiculous.

> We have a patch set that was developed, iterated and improved over 10+
> revisions, based on evaluating and weighing trade-offs. It's reached a
> state where the memcg maintainers are happy with it and don't have any
> concern about future extendabilty to cover more specialized setups.
> 

It's also obviously untested in those 10+ revisions since it uses 
oom_badness() for the root mem cgroup and not leaf mem cgroups which is 
why it breaks any system where user processes are attached to the root mem 
cgroup.  See my example where a 1GB worker attached to the root mem cgroup 
is preferred over a cgroup with 6GB workers.  It may have been tested with 
your own highly specialized usecase, but not with anything else, and the 
author obviously admits its shortcomings.

> You've had nine months to contribute and shape this patch series
> productively, and instead resorted to a cavalcade of polemics,
> evasion, and repetition of truisms and refuted points. A ten paragraph
> proposal of vague ideas at this point is simply not a valid argument.
> 

The patch series has gone through massive changes over the past nine 
months and I've brought up three very specific concerns with its current 
form that makes it non-extensible.  I know the patchset has very 
significant changes or rewrites in its history, but I'm only concerned 
with its present form because it's not something that can easily be 
extended later.  We don't need useless files polutting the cgroup v2 
filesystem for things that don't matter with other oom policies and we 
don't need the mount option, actually.

> You can send patches to replace or improve on Roman's code and make
> the case for them.
> 

I volunteered in the email thread where I proposed an alternative to 
replace it, I'm actively seeking any feedback on that proposal to address 
anybody else's concerns early on.  Your participation in that would be 
very useful.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-15 11:54               ` Michal Hocko
@ 2018-01-16 21:36                 ` David Rientjes
  2018-01-16 22:09                   ` Michal Hocko
  0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-16 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Mon, 15 Jan 2018, Michal Hocko wrote:

> > No, this isn't how kernel features get introduced.  We don't design a new 
> > kernel feature with its own API for a highly specialized usecase and then 
> > claim we'll fix the problems later.  Users will work around the 
> > constraints of the new feature, if possible, and then we can end up 
> > breaking them later.  Or, we can pollute the mem cgroup v2 filesystem with 
> > even more tunables to cover up for mistakes in earlier designs.
> 
> This is a blatant misinterpretation of the proposed changes. I haven't
> heard _any_ single argument against the proposed user interface except
> for complaints for missing tunables. This is not how the kernel
> development works and should work. The usecase was clearly described and
> far from limited to a single workload or company.
>  

The complaint about the user interface is that it is not extensible, as my 
next line states.  This doesn't need to be opted into with a mount option 
locking the entire system into a single oom policy.  That, itself, is the 
result of a poor design.  What is needed is a way for users to define an 
oom policy that is generally useful, not something that is locked in for 
the whole system.  We don't need several different cgroup mount options 
only for mem cgroup oom policies.  We also don't need random 
memory.groupoom files being added to the mem cgroup v2 filesystem only for 
one or two particular policies and being no-ops otherwise.  It can easily 
be specified as part of the policy itself.  My suggestion adds two new 
files to the mem cgroup v2 filesystem and no mount option, and allows any 
policy to be added later that only uses these two files.  I see you've 
ignored all of that in this email, so perhaps reading it would be 
worthwhile so that you can provide constructive feedback.

> > The key point to all three of my objections: extensibility.
> 
> And it has been argued that further _features_ can be added on top. I am
> absolutely fed up discussing those things again and again without any
> progress. You simply keep _ignoring_ counter arguments and that is a
> major PITA to be honest with you. You are basically blocking a useful
> feature because it doesn't solve your particular workload. This is
> simply not acceptable in the kernel development.
> 

As the thread says, this has nothing to do with my own particular 
workload, it has to do with three obvious shortcomings in the design that 
the user has no control over.  We can't add features on top if the 
heuristic itself changes as a result of the proposal, it needs to be 
introduced in an extensible way so that additional changes can be made 
later, if necessary, while still working around the very obvious problems 
with this current implementation.  My suggestion is that we introduce a 
way to define the oom policy once so that we don't have to change it later 
and are left with needless mount options or mem cgroup v2 files that 
become no-ops with the suggested design.  I hope that you will read the 
proposal for that extensible interface and comment on it about any 
concerns that you have, because that feedback would generally be useful.

> > Both you and Michal have acknowledged blantently obvious shortcomings in 
> > the design.
> 
> What you call blatant shortcomings we do not see affecting any
> _existing_ workloads. If they turn out to be real issues then we can fix
> them without deprecating any user APIs added by this patchset.
> 

There are existing workloads that use mem cgroup subcontainers purely for 
tracking charging and vmscan stats, which results in this logic being 
evaded.  It's a real issue, and a perfectly acceptable usecase for mem 
cgroup.  It's a result of the entire oom policy either being opted into or 
opted out of for the entire system and impossible for the user to 
configure or avoid.  That can be done better by enabling the oom policy 
only for a subtree, as I've suggested, but you've ignored.  It would also 
render both the mount option and the additional file in the mem cgroup v2 
filesystem added by this patchset to be no-ops.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-01-16 21:36                 ` David Rientjes
@ 2018-01-16 22:09                   ` Michal Hocko
  0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2018-01-16 22:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Tue 16-01-18 13:36:21, David Rientjes wrote:
> On Mon, 15 Jan 2018, Michal Hocko wrote:
> 
> > > No, this isn't how kernel features get introduced.  We don't design a new 
> > > kernel feature with its own API for a highly specialized usecase and then 
> > > claim we'll fix the problems later.  Users will work around the 
> > > constraints of the new feature, if possible, and then we can end up 
> > > breaking them later.  Or, we can pollute the mem cgroup v2 filesystem with 
> > > even more tunables to cover up for mistakes in earlier designs.
> > 
> > This is a blatant misinterpretation of the proposed changes. I haven't
> > heard _any_ single argument against the proposed user interface except
> > for complaints for missing tunables. This is not how the kernel
> > development works and should work. The usecase was clearly described and
> > far from limited to a single workload or company.
> >  
> 
> The complaint about the user interface is that it is not extensible, as my 
> next line states.

I disagree and will not repeat argument why.

> This doesn't need to be opted into with a mount option 
> locking the entire system into a single oom policy.  That, itself, is the 
> result of a poor design.  What is needed is a way for users to define an 
> oom policy that is generally useful, not something that is locked in for 
> the whole system. 

We have been discussing general oom policies for years now and there was
_no_ _single_ useful/acceptable approach suggested. Nor is your sketch
I am afraid because we could argue how that one doesn't address other
usecases out there which need a more specific control. All that without
having _no code_ merged. The current one is a policy that addresses a
reasonably large class of usecases out there based on containers without
forcing everybody else to use the same policy.

> We don't need several different cgroup mount options 
> only for mem cgroup oom policies.

cgroup mount option sounds like a reasonable approach already used for
the unified hierarchy in early stages.

> We also don't need random 
> memory.groupoom files being added to the mem cgroup v2 filesystem only for 
> one or two particular policies and being no-ops otherwise.

This groupoom is a fundamental API allowing to kill the whole cgroup
which is a reasonable thing to do and also sane user API regardless of
implementation details. Any oom selection policy can be built on top.

> It can easily 
> be specified as part of the policy itself.

No it cannot, because it would conflate oom selection _and_ oom action
together. And that would be wrong _semantically_, I believe. And I am quite
sure we can discuss what kind of policies we need to death and won't
move on. Exactly like, ehm, until now.

So let me repeat. There are users for the functionality. Users have to
explicitly opt-in so existing users are not in a risk of regressions.
Further more fine grained oom selection policies can be implemented on top
without breaking new users.
In short: There is no single reason to block this to be merged.

If your usecase is not covered yet then feel free to extend the existing
code/APIs to do so. I will happily review and discuss them like I've
been doing here even though I am myself not a user of this new
functionality.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
                   ` (8 preceding siblings ...)
  2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2018-06-05 11:47 ` Michal Hocko
  2018-06-05 12:13   ` Michal Hocko
  2018-07-13 21:59   ` David Rientjes
  9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 11:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

It seems that this is still in limbo mostly because of David's concerns.
So let me reiterate them and provide my POV once more (and the last
time) just to help Andrew make a decision:

1) comparision root with tail memcgs during the OOM killer is not fair
because we are comparing tasks with memcgs.

This is true, but I do not think this matters much for workloads which
are going to use the feature. Why? Because the main consumers of the new
feature seem to be containers which really need some fairness when
comparing _workloads_ rather than processes. Those are unlikely to
contain any significant memory consumers in the root memcg. That would
be mostly common infrastructure.

Is this is fixable? Yes, we would need to account in the root memcgs.
Why are we not doing that now? Because it has some negligible
performance overhead. Are there other ways? Yes we can approximate root
memcg memory consumption but I would rather wait for somebody seeing
that as a real problem rather than add hacks now without a strong
reason.


2) Evading the oom killer by attaching processes to child cgroups which
basically means that a task can split up the workload into smaller
memcgs to hide their real memory consumption.

Again true but not really anything new. Processes can already fork and
split up the memory consumption. Moreover it doesn't even require any
special privileges to do so unlike creating a sub memcg. Is this
fixable? Yes, untrusted workloads can setup group oom evaluation at the
delegation layer so all subgroups would be considered together.

3) Userspace has zero control over oom kill selection in leaf mem
cgroups.

Again true but this is something that needs a good evaluation to not end
up in the fiasko we have seen with oom_score*. Current users demanding
this feature can live without any prioritization so blocking the whole
feature seems unreasonable.

4) Future extensibility to be backward compatible.

David is wrong here IMHO. Any prioritization or oom selection policy
controls added in future are orthogonal to the oom_group concept added
by this patchset. Allowing memcg to be an oom entity is something that
we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
semantic and softening it will be possible by a adding a new knob to
tell whether a memcg/hierarchy is a workload or a set of tasks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
@ 2018-06-05 12:13   ` Michal Hocko
  2018-07-13 21:59   ` David Rientjes
  1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 12:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Tue 05-06-18 13:47:29, Michal Hocko wrote:
> It seems that this is still in limbo mostly because of David's concerns.
> So let me reiterate them and provide my POV once more (and the last
> time) just to help Andrew make a decision:

Sorry, I forgot to add reference to the email with the full David's
reasoning. Here it is http://lkml.kernel.org/r/alpine.DEB.2.10.1801091556490.173445@chino.kir.corp.google.com
 
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
> 
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
> 
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
> 
> 
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
> 
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
> 
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
> 
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
> 
> 4) Future extensibility to be backward compatible.
> 
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
  2018-06-05 12:13   ` Michal Hocko
@ 2018-07-13 21:59   ` David Rientjes
  2018-07-14  1:55     ` Tetsuo Handa
  2018-07-16  9:36     ` Michal Hocko
  1 sibling, 2 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-13 21:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Tue, 5 Jun 2018, Michal Hocko wrote:

> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
> 
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
> 

There are users (us) who want to use the feature and not all processes are 
attached to leaf mem cgroups.  The functionality can be provided in a 
generally useful way that doesn't require any specific hierarchy, and I 
implemented this in my patch series at 
https://marc.info/?l=linux-mm&m=152175563004458&w=2.  That proposal to fix 
*all* of my concerns with the cgroup-aware oom killer as it sits in -mm, 
in it's entirety, only extends it so it is generally useful and does not 
remove any functionality.  I'm not sure why we are discussing ways of 
moving forward when that patchset has been waiting for review for almost 
four months and, to date, I haven't seen an objection to.

I don't know why we cannot agree on making solutions generally useful nor 
why that patchset has not been merged into -mm so that the whole feature 
can be merged.  It's baffling.  This is the first time I've encountered a 
perceived stalemate when there is a patchset sitting, unreviewed, that 
fixes all of the concerns that there are about the implementation sitting 
in -mm.

This isn't a criticism just of comparing processes attached to root 
differently than leaf mem cgroups, it's how oom_score_adj influences that 
decision.  It's trivial for a very small consumer (not "significant" 
memory consumer, as you put it) to require an oom kill from root instead 
of a leaf mem cgroup.  I show in 
https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the 
oom_score_adj of my bash shell attached to the root mem cgroup is 
considered equal to a 95GB leaf mem cgroup with the current 
implementation.

> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
> 

I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from 
what I remmeber Roman actually liked it.

> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
> 
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
> 

Processes being able to fork to split up memory consumption is also fixed 
by https://marc.info/?l=linux-mm&m=152175564104467 just as creating 
subcontainers to intentionally or unintentionally subverting the oom 
policy is fixed.  It solves both problems.

> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
> 
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
> 

One objection here is how the oom_score_adj of a process means something 
or doesn't mean something depending on what cgroup it is attached to.  The 
cgroup-aware oom killer is cgroup aware.  oom_score_adj should play no 
part.  I fixed this with https://marc.info/?t=152175564500007&r=1&w=2.  
The other objection is that users do have cgroups that shouldn't be oom 
killed because they are important, either because they are required to 
provide a service for a smaller cgroup or because of business goals.  We 
have cgroups that use more than half of system memory and they are allowed 
to do so because they are important.  We would love to be able to bias 
against that cgroup to prefer others, or prefer cgroups for oom kill 
because they are less important.  This was done for processes with 
oom_score_adj, we need it for a cgroup aware oom killer for the same 
reason.

But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2 
that I said priority or adjustment can be added on top of the feature 
after it's merged.  This itself is not precluding anything from being 
merged.

> 4) Future extensibility to be backward compatible.
> 
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.

I've always said that the mechanism and policy in this patchset should be 
separated.  I do that exact thing in 
https://marc.info/?l=linux-mm&m=152175564304469&w=2.  I suggest that 
different subtrees will want (or the admin will require) different 
behaviors with regard to the mechanism.


I've stated the problems (and there are others wrt mempolicy selection) 
that the current implementation has and given a full solution at 
https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been 
reviewed.  I would love feedback from anybody on this thread on that.  I'm 
not trying to preclude the cgroup-aware oom killer from being merged, I'm 
the only person actively trying to get it merged.

Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-13 21:59   ` David Rientjes
@ 2018-07-14  1:55     ` Tetsuo Handa
  2018-07-16 21:13       ` Tetsuo Handa
  2018-07-16  9:36     ` Michal Hocko
  1 sibling, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-14  1:55 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On 2018/07/14 6:59, David Rientjes wrote:
> I'm not trying to preclude the cgroup-aware oom killer from being merged,
> I'm the only person actively trying to get it merged.

Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
and my cleanup? The gap between linux.git and linux-next.git keeps us unable
to use agreed baseline.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-13 21:59   ` David Rientjes
  2018-07-14  1:55     ` Tetsuo Handa
@ 2018-07-16  9:36     ` Michal Hocko
  2018-07-17  3:59       ` David Rientjes
  1 sibling, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2018-07-16  9:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Fri 13-07-18 14:59:59, David Rientjes wrote:
> On Tue, 5 Jun 2018, Michal Hocko wrote:
> 
> > 1) comparision root with tail memcgs during the OOM killer is not fair
> > because we are comparing tasks with memcgs.
> > 
> > This is true, but I do not think this matters much for workloads which
> > are going to use the feature. Why? Because the main consumers of the new
> > feature seem to be containers which really need some fairness when
> > comparing _workloads_ rather than processes. Those are unlikely to
> > contain any significant memory consumers in the root memcg. That would
> > be mostly common infrastructure.
> > 
> 
> There are users (us) who want to use the feature and not all processes are 
> attached to leaf mem cgroups.  The functionality can be provided in a 
> generally useful way that doesn't require any specific hierarchy, and I 
> implemented this in my patch series at 
> https://marc.info/?l=linux-mm&m=152175563004458&w=2.  That proposal to fix 
> *all* of my concerns with the cgroup-aware oom killer as it sits in -mm, 
> in it's entirety, only extends it so it is generally useful and does not 
> remove any functionality.  I'm not sure why we are discussing ways of 
> moving forward when that patchset has been waiting for review for almost 
> four months and, to date, I haven't seen an objection to.

Well, I didn't really get to your patches yet. The last time I've
checked I had some pretty serious concerns about the consistency of your
proposal. Those might have been fixed in the lastest version of your
patchset I haven't seen. But I still strongly suspect that you are
largerly underestimating the complexity of more generic oom policies
which you are heading to.

Considering user API failures from the past (oom_*adj fiasco for
example) suggests that we should start with smaller steps and only
provide a clear and simple API. oom_group is such a simple and
semantically consistent thing which is the reason I am OK with it much
more than your "we can be more generic" approach. I simply do not trust
we can agree on sane and consistent api in a reasonable time.

And it is quite mind boggling that a simpler approach has been basically
blocked for months because there are some concerns for workloads which
are not really asking for the feature. Sure your usecase might need to
handle root memcg differently. That is a fair point but that shouldn't
really block containers users who can use the proposed solution without
any further changes. If we ever decide to handle root memcg differently
we are free to do so because the oom selection policy is not carved in
stone by any api.
 
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-14  1:55     ` Tetsuo Handa
@ 2018-07-16 21:13       ` Tetsuo Handa
  2018-07-16 22:09         ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-16 21:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Michal Hocko, Roman Gushchin, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

No response from Roman and David...

Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
Roman's series has a bug which I mentioned and which can be avoided by my patch.
David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
next OOM victim without trying to reclaim any memory.

Since they are not responding to my mail, I suggest once dropping from linux-next.

https://www.spinics.net/lists/linux-mm/msg153212.html
https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u

On 2018/07/14 10:55, Tetsuo Handa wrote:
> On 2018/07/14 6:59, David Rientjes wrote:
>> I'm not trying to preclude the cgroup-aware oom killer from being merged,
>> I'm the only person actively trying to get it merged.
> 
> Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
> and my cleanup? The gap between linux.git and linux-next.git keeps us unable
> to use agreed baseline.
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16 21:13       ` Tetsuo Handa
@ 2018-07-16 22:09         ` Roman Gushchin
  2018-07-17  0:55           ` Tetsuo Handa
  0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-07-16 22:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> No response from Roman and David...
> 
> Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> Roman's series has a bug which I mentioned and which can be avoided by my patch.
> David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> next OOM victim without trying to reclaim any memory.
> 
> Since they are not responding to my mail, I suggest once dropping from linux-next.

I was in cc, and didn't thought that you're expecting something from me.

I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
I'm happy to help with rebasing and everything else.

Thanks,
Roman

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16 22:09         ` Roman Gushchin
@ 2018-07-17  0:55           ` Tetsuo Handa
  2018-07-31 14:14             ` Tetsuo Handa
  0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-17  0:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> > 
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> > next OOM victim without trying to reclaim any memory.
> > 
> > Since they are not responding to my mail, I suggest once dropping from linux-next.
> 
> I was in cc, and didn't thought that you're expecting something from me.

Oops. I was waiting for your response. ;-)

  But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch
  in linux-next. And it seems to me that your patch contains a bug which leads to
  premature memory allocation failure explained below.

  Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
  (which eliminates "delay" and "out:" from your patch) so that people can easily
  backport my patch? Or, do you want to apply a fix (which eliminates "delay" and
  "out:" from linux-next) prior to my patch?

> 
> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> I'm happy to help with rebasing and everything else.

Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
and easy to cleanly backport (if applied before your series).

Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-16  9:36     ` Michal Hocko
@ 2018-07-17  3:59       ` David Rientjes
  0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-17  3:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel, linux-mm

On Mon, 16 Jul 2018, Michal Hocko wrote:

> Well, I didn't really get to your patches yet. The last time I've
> checked I had some pretty serious concerns about the consistency of your
> proposal. Those might have been fixed in the lastest version of your
> patchset I haven't seen. But I still strongly suspect that you are
> largerly underestimating the complexity of more generic oom policies
> which you are heading to.
> 

I don't believe it's underestimated since it's used.  It's perfectly valid 
the lock an entire hierarchy or individual subtrees into a single policy 
if that's what is preferred.  Any use of a different policy at a subtree 
root is a conscious decision made by the owner of that subtree.  If they 
prefer to kill the largest process, the largest descendant cgroup, or the 
largest subtree, it is up to them.  All three have valid usecases, the 
goal is not to lock the entire hierarchy into a single policy: this 
introduces the ability for users to subvert the selection policy either 
intentionally or unintentionally because they are using a unified single 
hierarchy with cgroup v2 and they are using controllers other than mem 
cgroup.

> Considering user API failures from the past (oom_*adj fiasco for
> example) suggests that we should start with smaller steps and only
> provide a clear and simple API. oom_group is such a simple and
> semantically consistent thing which is the reason I am OK with it much
> more than your "we can be more generic" approach. I simply do not trust
> we can agree on sane and consistent api in a reasonable time.
> 
> And it is quite mind boggling that a simpler approach has been basically
> blocked for months because there are some concerns for workloads which
> are not really asking for the feature. Sure your usecase might need to
> handle root memcg differently. That is a fair point but that shouldn't
> really block containers users who can use the proposed solution without
> any further changes. If we ever decide to handle root memcg differently
> we are free to do so because the oom selection policy is not carved in
> stone by any api.
>  

Please respond directly to the patchset which clearly enumerates the 
problems with the current implementation in -mm.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-17  0:55           ` Tetsuo Handa
@ 2018-07-31 14:14             ` Tetsuo Handa
  2018-08-01 16:37               ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-31 14:14 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: David Rientjes, Michal Hocko, linux-mm, Vladimir Davydov,
	Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On 2018/07/17 9:55, Tetsuo Handa wrote:
>> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
>> I'm happy to help with rebasing and everything else.
> 
> Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> and easy to cleanly backport (if applied before your series).
> 
> Also, I expect you to check whether my cleanup patch which removes "abort" path
> ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> simplifying your series. I don't know detailed behavior of your series, but I
> assume that your series do not kill threads which current thread should not wait
> for MMF_OOM_SKIP.

syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

I can't tell what change is triggering this race. Maybe removal of oom_lock from
the oom reaper made more likely to hit. But anyway I suspect that

static bool oom_kill_memcg_victim(struct oom_control *oc)
{
        if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
                return oc->chosen_memcg; // <= This line is still broken

because

                /* We have one or more terminating processes at this point. */
                oc->chosen_task = INFLIGHT_VICTIM;

is not called.

Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
with oom_lock held.

Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
apply my cleanup patch? Since the merge window is approaching, I really want to
see how next -rc1 would look like...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-07-31 14:14             ` Tetsuo Handa
@ 2018-08-01 16:37               ` Roman Gushchin
  2018-08-01 22:01                 ` Tetsuo Handa
  0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
	Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
	cgroups, linux-doc, linux-kernel, linux-mm

On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> On 2018/07/17 9:55, Tetsuo Handa wrote:
> >> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> >> I'm happy to help with rebasing and everything else.
> > 
> > Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> > patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> > and easy to cleanly backport (if applied before your series).
> > 
> > Also, I expect you to check whether my cleanup patch which removes "abort" path
> > ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> > simplifying your series. I don't know detailed behavior of your series, but I
> > assume that your series do not kill threads which current thread should not wait
> > for MMF_OOM_SKIP.
> 
> syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fid-3Dea8c7912757d253537375e981b61749b2da69258&d=DwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=h9FJRAWtCmDLT-cVwvXKCYIUVRSrD--0XFJE-OnNY64&s=If6eFu6MlYjnfLXeg5_S-3tuhCZhSMv8_qfSrMfwOQ0&e=
> 
> I can't tell what change is triggering this race. Maybe removal of oom_lock from
> the oom reaper made more likely to hit. But anyway I suspect that
> 
> static bool oom_kill_memcg_victim(struct oom_control *oc)
> {
>         if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
>                 return oc->chosen_memcg; // <= This line is still broken
> 
> because
> 
>                 /* We have one or more terminating processes at this point. */
>                 oc->chosen_task = INFLIGHT_VICTIM;
> 
> is not called.
> 
> Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
> with oom_lock held.
> 
> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> apply my cleanup patch? Since the merge window is approaching, I really want to
> see how next -rc1 would look like...

Hi Tetsuo!

Has this cleanup patch been acked by somebody?
Which problem does it solve?
Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.

Anyway, there is a good chance that current cgroup-aware OOM killer
implementation will be replaced by a lightweight version (memory.oom.group).
Please, take a look at it, probably your cleanup will not conflict with it
at all.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-08-01 16:37               ` Roman Gushchin
@ 2018-08-01 22:01                 ` Tetsuo Handa
  2018-08-01 22:55                   ` Roman Gushchin
  0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-08-01 22:01 UTC (permalink / raw)
  To: Roman Gushchin, Andrew Morton
  Cc: David Rientjes, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On 2018/08/02 1:37, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
>> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
>> apply my cleanup patch? Since the merge window is approaching, I really want to
>> see how next -rc1 would look like...
> 
> Hi Tetsuo!
> 
> Has this cleanup patch been acked by somebody?

Not yet. But since Michal considers this cleanup as "a nice shortcut"
( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
I will get an ACK regarding this cleanup.

> Which problem does it solve?

It simplifies tricky out_of_memory() return value decision, and
it also fixes a bug in your series which syzbot is pointing out.

> Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.

What I need is a git tree which I can use as a baseline for making this cleanup.
linux.git is not suitable because it does not include Michal's fix, but
linux-next.git is not suitable because Michal's fix is overwritten by your series.
I want a git tree which includes Michal's fix and does not include your series.

> 
> Anyway, there is a good chance that current cgroup-aware OOM killer
> implementation will be replaced by a lightweight version (memory.oom.group).
> Please, take a look at it, probably your cleanup will not conflict with it
> at all.

Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
I want to see how next -rc1 would look like (for testing purpose) and want to use
linux-next.git as a baseline (for making this cleanup).

> 
> Thanks!
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v13 0/7] cgroup-aware OOM killer
  2018-08-01 22:01                 ` Tetsuo Handa
@ 2018-08-01 22:55                   ` Roman Gushchin
  0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, David Rientjes, Michal Hocko, Vladimir Davydov,
	Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

On Thu, Aug 02, 2018 at 07:01:28AM +0900, Tetsuo Handa wrote:
> On 2018/08/02 1:37, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> >> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> >> apply my cleanup patch? Since the merge window is approaching, I really want to
> >> see how next -rc1 would look like...
> > 
> > Hi Tetsuo!
> > 
> > Has this cleanup patch been acked by somebody?
> 
> Not yet. But since Michal considers this cleanup as "a nice shortcut"
> ( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
> I will get an ACK regarding this cleanup.
> 
> > Which problem does it solve?
> 
> It simplifies tricky out_of_memory() return value decision, and
> it also fixes a bug in your series which syzbot is pointing out.
> 
> > Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
> 
> What I need is a git tree which I can use as a baseline for making this cleanup.
> linux.git is not suitable because it does not include Michal's fix, but
> linux-next.git is not suitable because Michal's fix is overwritten by your series.
> I want a git tree which includes Michal's fix and does not include your series.
> 
> > 
> > Anyway, there is a good chance that current cgroup-aware OOM killer
> > implementation will be replaced by a lightweight version (memory.oom.group).
> > Please, take a look at it, probably your cleanup will not conflict with it
> > at all.
> 
> Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
> I want to see how next -rc1 would look like (for testing purpose) and want to use
> linux-next.git as a baseline (for making this cleanup).

I'll post memory.oom.group v2 later today, and if there will be no objections,
I'll ask Andrew to drop current memcg-aware OOM killer and replace it
with lightweight memory.oom.group.

These changes will be picked by linux-next in few days.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2018-08-01 22:56 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-12-01  8:35   ` Michal Hocko
2017-12-07  1:24   ` Andrew Morton
2017-12-07 13:39     ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-12-01  8:41   ` Michal Hocko
2017-12-01 13:15     ` Roman Gushchin
2017-12-01 13:31       ` Michal Hocko
2017-12-01 17:00         ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
2017-12-01  8:41   ` Michal Hocko
2017-12-01 17:01     ` Roman Gushchin
2017-12-01 17:13       ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
2018-01-10  0:57   ` David Rientjes
2018-01-10 13:11     ` Roman Gushchin
2018-01-10 19:33       ` Andrew Morton
2018-01-11  9:08         ` Michal Hocko
2018-01-11 13:18           ` Roman Gushchin
2018-01-12 22:03             ` David Rientjes
2018-01-15 11:54               ` Michal Hocko
2018-01-16 21:36                 ` David Rientjes
2018-01-16 22:09                   ` Michal Hocko
2018-01-11 21:57           ` David Rientjes
2018-01-13 17:14         ` Johannes Weiner
2018-01-14 23:44           ` David Rientjes
2018-01-15 16:25             ` Johannes Weiner
2018-01-16 21:21               ` David Rientjes
2018-01-10 20:50       ` David Rientjes
2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
2017-12-01 13:26   ` Tetsuo Handa
2017-12-01 13:32   ` Roman Gushchin
2017-12-01 13:54     ` Michal Hocko
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
2018-06-05 12:13   ` Michal Hocko
2018-07-13 21:59   ` David Rientjes
2018-07-14  1:55     ` Tetsuo Handa
2018-07-16 21:13       ` Tetsuo Handa
2018-07-16 22:09         ` Roman Gushchin
2018-07-17  0:55           ` Tetsuo Handa
2018-07-31 14:14             ` Tetsuo Handa
2018-08-01 16:37               ` Roman Gushchin
2018-08-01 22:01                 ` Tetsuo Handa
2018-08-01 22:55                   ` Roman Gushchin
2018-07-16  9:36     ` Michal Hocko
2018-07-17  3:59       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).