linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [v10 0/6] cgroup-aware OOM killer
@ 2017-10-04 15:46 Roman Gushchin
  2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
                   ` (5 more replies)
  0 siblings, 6 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
    if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
    memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 248 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 209 ++++++++++++++++++++++++-------------
 7 files changed, 491 insertions(+), 78 deletions(-)

-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [v10 1/6] mm, oom: refactor the oom_kill_process() function
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 19:14   ` Johannes Weiner
  2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e284810b9851..1e7b8a27e6cc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -822,68 +822,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -957,6 +901,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 19:15   ` Johannes Weiner
  2017-10-04 20:10   ` David Rientjes
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5f3a62887cf..b4de17a78dc1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
  2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 19:27   ` Johannes Weiner
                     ` (3 more replies)
  2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest leaf memory cgroup
and kills the biggest task belonging to it. The following patches
will extend this functionality to consider non-leaf memory cgroups
as well, and also provide an ability to kill all tasks belonging
to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with leaf memory cgroups.
Due to memcg statistics implementation a special algorithm
is used for estimating it's oom_score: we define it as maximum
oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  17 +++++
 include/linux/oom.h        |  12 +++-
 mm/memcontrol.c            | 172 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  76 +++++++++++++++-----
 4 files changed, 257 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b4de17a78dc1..79f30c281185 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	/*
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
+	 *
+	 * As tasks in the root memcg unlikely are parts of a
+	 * single workload, and we don't have to implement
+	 * group killing, this approximation is reasonable.
+	 *
+	 * But if we will have necessary stats for the root memcg,
+	 * we might switch to the approach which is used for all
+	 * other memcgs.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (score > max_score)
+				max_score = score;
+		}
+		css_task_iter_end(&it);
+
+		return max_score;
+	}
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+
+	oc->chosen_memcg = NULL;
+	oc->chosen_points = 0;
+
+	/*
+	 * The oom_score is calculated for leaf memory cgroups (including
+	 * the root memcg).
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		long score;
+
+		if (memcg_has_children(iter))
+			continue;
+
+		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need
+		 * to look further for new victims.
+		 */
+		if (score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (score > oc->chosen_points) {
+			oc->chosen_points = score;
+			oc->chosen_memcg = iter;
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1e7b8a27e6cc..63e38cf971e0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -290,7 +290,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -324,26 +324,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -903,7 +909,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -964,6 +970,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Kill a task in the chosen memcg with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	__oom_kill_process(oc->chosen_task);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1016,6 +1043,7 @@ bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
+	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1060,27 +1088,37 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
+		delay = true;
+		goto out;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
+		delay = true;
 	}
-	return !!oc->chosen;
+
+out:
+	/*
+	 * Give the killed process a good chance to exit before trying
+	 * to allocate memory again.
+	 */
+	if (delay)
+		schedule_timeout_killable(1);
+
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v10 4/6] mm, oom: introduce memory.oom_group
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 19:37   ` Johannes Weiner
  2017-10-05 12:06   ` Michal Hocko
  2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
  2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
  5 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed.

The default value is 0.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 17 +++++++++++
 mm/memcontrol.c            | 74 +++++++++++++++++++++++++++++++++++++++++++---
 mm/oom_kill.c              | 38 +++++++++++++++++-------
 3 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/*
+	 * Treat the sub-tree as an indivisible memory consumer,
+	 * kill all belonging tasks if the memory cgroup selected
+	 * as OOM victim.
+	 */
+	bool oom_group;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 {
 	return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79f30c281185..1fcd6cc353d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2776,19 +2776,50 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
-	struct mem_cgroup *iter;
+	struct mem_cgroup *iter, *group = NULL;
+	long group_score = 0;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
 
 	/*
+	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+	 * all tasks belonging to the memcg should be killed.
+	 * So, we mark the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		return;
+	}
+
+	/*
 	 * The oom_score is calculated for leaf memory cgroups (including
 	 * the root memcg).
+	 * Non-leaf oom_group cgroups accumulating score of descendant
+	 * leaf memory cgroups.
 	 */
 	rcu_read_lock();
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
+		/*
+		 * We don't consider non-leaf non-oom_group memory cgroups
+		 * as OOM victims.
+		 */
+		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
+			continue;
+
+		/*
+		 * If group is not set or we've ran out of the group's sub-tree,
+		 * we should set group and reset group_score.
+		 */
+		if (!group || group == root_mem_cgroup ||
+		    !mem_cgroup_is_descendant(iter, group)) {
+			group = iter;
+			group_score = 0;
+		}
+
 		if (memcg_has_children(iter))
 			continue;
 
@@ -2810,9 +2841,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			break;
 		}
 
-		if (score > oc->chosen_points) {
-			oc->chosen_points = score;
-			oc->chosen_memcg = iter;
+		group_score += score;
+
+		if (group_score > oc->chosen_points) {
+			oc->chosen_points = group_score;
+			oc->chosen_memcg = group;
 		}
 	}
 
@@ -5439,6 +5472,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes,
+				      loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5559,6 +5619,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 63e38cf971e0..5bdb8cae3acc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -970,21 +970,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
 {
+	get_task_struct(task);
+	__oom_kill_process(task);
+	return 0;
+}
 
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
 	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
 		return oc->chosen_memcg;
 
-	/* Kill a task in the chosen memcg with the biggest memory footprint */
-	oc->chosen_points = 0;
-	oc->chosen_task = NULL;
-	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
-	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
-		goto out;
-
-	__oom_kill_process(oc->chosen_task);
+	/*
+	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+	 * of the chosen memory cgroup, otherwise kill the task with the biggest
+	 * memory footprint.
+	 */
+	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+		/* We have one or more terminating processes at this point. */
+		oc->chosen_task = INFLIGHT_VICTIM;
+	} else {
+		oc->chosen_points = 0;
+		oc->chosen_task = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+		if (oc->chosen_task == NULL ||
+		    oc->chosen_task == INFLIGHT_VICTIM)
+			goto out;
+
+		__oom_kill_process(oc->chosen_task);
+	}
 
 out:
 	mem_cgroup_put(oc->chosen_memcg);
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 20:04   ` Johannes Weiner
  2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
  5 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1fcd6cc353d5..2e82625bd354 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v10 6/6] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
                   ` (4 preceding siblings ...)
  2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-10-04 15:46 ` Roman Gushchin
  2017-10-04 20:08   ` Johannes Weiner
  5 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..28429e62b0ea 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup as an
+	indivisible memory consumers and compare it with other memory
+	consumers by it's memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	This applies to system-wide OOM conditions and reaching
+	the hard memory limit of the cgroup and their ancestor.
+	If OOM condition happens in a descendant cgroup with it's own
+	memory limit, the memory cgroup can't be considered
+	as an OOM victim, and OOM killer will not kill all belonging
+	tasks.
+
+	Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [v10 1/6] mm, oom: refactor the oom_kill_process() function
  2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-10-04 19:14   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 19:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, Oct 04, 2017 at 04:46:33PM +0100, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@kernel.org>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-10-04 19:15   ` Johannes Weiner
  2017-10-04 20:10   ` David Rientjes
  1 sibling, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 19:15 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, Oct 04, 2017 at 04:46:34PM +0100, Roman Gushchin wrote:
> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup is treated as a leaf cgroup, so only tasks
> which are directly belonging to the root cgroup are iterated over.
> 
> This patch doesn't introduce any functional change as
> mem_cgroup_scan_tasks() is never called for the root memcg.
> This is preparatory work for the cgroup-aware OOM killer,
> which will use this function to iterate over tasks belonging
> to the root memcg.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-10-04 19:27   ` Johannes Weiner
  2017-10-04 19:51     ` Roman Gushchin
  2017-10-04 19:48   ` Shakeel Butt
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 19:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 04:46:35PM +0100, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

This looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I just have one question:

> @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
>  	struct mm_struct *mm;
>  	bool can_oom_reap = true;
>  
> +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		put_task_struct(victim);
> +		return;
> +	}
> +
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);

Is this necessary? The callers of this function use oom_badness() to
find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 4/6] mm, oom: introduce memory.oom_group
  2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-10-04 19:37   ` Johannes Weiner
  2017-10-05 12:06   ` Michal Hocko
  1 sibling, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 19:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 04:46:36PM +0100, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed.
> 
> The default value is 0.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Those semantics make sense to me and the code looks good.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
  2017-10-04 19:27   ` Johannes Weiner
@ 2017-10-04 19:48   ` Shakeel Butt
  2017-10-04 20:15     ` Roman Gushchin
  2017-10-04 20:27   ` David Rientjes
  2017-10-05 11:40   ` Michal Hocko
  3 siblings, 1 reply; 43+ messages in thread
From: Shakeel Butt @ 2017-10-04 19:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +       struct mem_cgroup *iter;
> +
> +       oc->chosen_memcg = NULL;
> +       oc->chosen_points = 0;
> +
> +       /*
> +        * The oom_score is calculated for leaf memory cgroups (including
> +        * the root memcg).
> +        */
> +       rcu_read_lock();
> +       for_each_mem_cgroup_tree(iter, root) {
> +               long score;
> +
> +               if (memcg_has_children(iter))
> +                       continue;

&& iter != root_mem_cgroup ?

> +
> +               score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +               /*
> +                * Ignore empty and non-eligible memory cgroups.
> +                */
> +               if (score == 0)
> +                       continue;
> +
> +               /*
> +                * If there are inflight OOM victims, we don't need
> +                * to look further for new victims.
> +                */
> +               if (score == -1) {
> +                       oc->chosen_memcg = INFLIGHT_VICTIM;
> +                       mem_cgroup_iter_break(root, iter);
> +                       break;
> +               }
> +

Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
end why not css_tryget_online() here and css_put for the previous
selected one.

> +               if (score > oc->chosen_points) {
> +                       oc->chosen_points = score;
> +                       oc->chosen_memcg = iter;
> +               }
> +       }
> +
> +       if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +               css_get(&oc->chosen_memcg->css);
> +
> +       rcu_read_unlock();
> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 19:27   ` Johannes Weiner
@ 2017-10-04 19:51     ` Roman Gushchin
  2017-10-04 20:17       ` David Rientjes
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 19:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 03:27:20PM -0400, Johannes Weiner wrote:
> On Wed, Oct 04, 2017 at 04:46:35PM +0100, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: kernel-team@fb.com
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> 
> This looks good to me.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I just have one question:
> 
> > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
> >  	struct mm_struct *mm;
> >  	bool can_oom_reap = true;
> >  
> > +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +		put_task_struct(victim);
> > +		return;
> > +	}
> > +
> >  	p = find_lock_task_mm(victim);
> >  	if (!p) {
> >  		put_task_struct(victim);
> 
> Is this necessary? The callers of this function use oom_badness() to
> find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.

It is. __oom_kill_process() is used to kill all processes belonging
to the selected memory cgroup, so we should perform these checks
to avoid killing unkillable processes.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-10-04 20:04   ` Johannes Weiner
  2017-10-05 13:14     ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 20:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 04:46:37PM +0100, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/cgroup-defs.h |  5 +++++
>  kernel/cgroup/cgroup.c      | 10 ++++++++++
>  mm/memcontrol.c             |  3 +++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 3e55bbd31ad1..cae5343a8b21 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -80,6 +80,11 @@ enum {
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
>  	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> +	/*
> +	 * Enable cgroup-aware OOM killer.
> +	 */
> +	CGRP_GROUP_OOM = (1 << 5),
>  };
>  
>  /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c3421ee0d230..8d8aa46ff930 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
>  		if (!strcmp(token, "nsdelegate")) {
>  			*root_flags |= CGRP_ROOT_NS_DELEGATE;
>  			continue;
> +		} else if (!strcmp(token, "groupoom")) {
> +			*root_flags |= CGRP_GROUP_OOM;
> +			continue;
>  		}
>  
>  		pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>  			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>  		else
>  			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> +		if (root_flags & CGRP_GROUP_OOM)
> +			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> +		else
> +			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
>  	}
>  }
>  
> @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>  {
>  	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>  		seq_puts(seq, ",nsdelegate");
> +	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> +		seq_puts(seq, ",groupoom");
>  	return 0;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1fcd6cc353d5..2e82625bd354 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return false;
>  
> +	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> +		return false;

That will silently ignore what the user writes to the memory.oom_group
control files across the system's cgroup tree.

We'll have a knob that lets the workload declare itself an indivisible
memory consumer, that it would like to get killed in one piece, and
it's silently ignored because of a mount option they forgot to pass.

That's not good from an interface perspective.

On the other hand, the only benefit of this patch is to shield users
from changes to the OOM killing heuristics. Yet, it's really hard to
imagine that modifying the victim selection process slightly could be
called a regression in any way. We have done that many times over,
without a second thought on backwards compatibility:

5e9d834a0e0c oom: sacrifice child with highest badness score for parent
a63d83f427fb oom: badness heuristic rewrite
778c14affaf9 mm, oom: base root bonus on current usage

Let's not make the userspace interface crap because of some misguided
idea that the OOM heuristic is a hard promise to userspace. It's never
been, and nobody has complained about changes in the past.

This case is doubly silly, as the behavior change only applies to
cgroup2, which doesn't exactly have a large base of legacy users yet.

Let's just drop this 5/6 patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 6/6] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
@ 2017-10-04 20:08   ` Johannes Weiner
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 20:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 04:46:38PM +0100, Roman Gushchin wrote:
> Document the cgroup-aware OOM killer.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
  2017-10-04 19:15   ` Johannes Weiner
@ 2017-10-04 20:10   ` David Rientjes
  1 sibling, 0 replies; 43+ messages in thread
From: David Rientjes @ 2017-10-04 20:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..b4de17a78dc1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * value, the function breaks the iteration loop and returns the value.
>   * Otherwise, it will iterate over all tasks and return 0.
>   *
> - * This function must not be called for the root memory cgroup.
> + * If memcg is the root memory cgroup, this function will iterate only
> + * over tasks belonging directly to the root memory cgroup.
>   */
>  int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  			  int (*fn)(struct task_struct *, void *), void *arg)
> @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	struct mem_cgroup *iter;
>  	int ret = 0;
>  
> -	BUG_ON(memcg == root_mem_cgroup);
> -
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  		while (!ret && (task = css_task_iter_next(&it)))
>  			ret = fn(task, arg);
>  		css_task_iter_end(&it);
> -		if (ret) {
> +		if (ret || memcg == root_mem_cgroup) {
>  			mem_cgroup_iter_break(memcg, iter);
>  			break;
>  		}

I think this is fine, but a little strange to start an iteration that 
never loops :)  No objection to the patch but it could also be extracted 
into a new mem_cgroup_scan_tasks() which actually scans the tasks in that 
mem cgroup and then a wrapper that does the iteration that calls into it, 
say, mem_cgroup_scan_tasks_tree().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 19:48   ` Shakeel Butt
@ 2017-10-04 20:15     ` Roman Gushchin
  2017-10-04 21:24       ` Shakeel Butt
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 20:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

On Wed, Oct 04, 2017 at 12:48:03PM -0700, Shakeel Butt wrote:
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +       struct mem_cgroup *iter;
> > +
> > +       oc->chosen_memcg = NULL;
> > +       oc->chosen_points = 0;
> > +
> > +       /*
> > +        * The oom_score is calculated for leaf memory cgroups (including
> > +        * the root memcg).
> > +        */
> > +       rcu_read_lock();
> > +       for_each_mem_cgroup_tree(iter, root) {
> > +               long score;
> > +
> > +               if (memcg_has_children(iter))
> > +                       continue;
> 
> && iter != root_mem_cgroup ?

Oh, sure. I had a stupid bug in my test script, which prevented me from
catching this. Thanks!

This should fix the problem.
--
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2e82625bd354..b3848bce4c86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 		 * We don't consider non-leaf non-oom_group memory cgroups
 		 * as OOM victims.
 		 */
-		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
+		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+		    !mem_cgroup_oom_group(iter))
 			continue;
 
 		/*
@@ -2820,7 +2821,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 			group_score = 0;
 		}
 
-		if (memcg_has_children(iter))
+		if (memcg_has_children(iter) && iter != root_mem_cgroup)
 			continue;
 
 		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);

--

> 
> > +
> > +               score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> > +
> > +               /*
> > +                * Ignore empty and non-eligible memory cgroups.
> > +                */
> > +               if (score == 0)
> > +                       continue;
> > +
> > +               /*
> > +                * If there are inflight OOM victims, we don't need
> > +                * to look further for new victims.
> > +                */
> > +               if (score == -1) {
> > +                       oc->chosen_memcg = INFLIGHT_VICTIM;
> > +                       mem_cgroup_iter_break(root, iter);
> > +                       break;
> > +               }
> > +
> 
> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
> end why not css_tryget_online() here and css_put for the previous
> selected one.

Hm, why do we need to check this? I do not see, how we can choose
an OFFLINE memcg as a victim, tbh. Please, explain the problem.

Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 19:51     ` Roman Gushchin
@ 2017-10-04 20:17       ` David Rientjes
  2017-10-04 20:22         ` Roman Gushchin
  2017-10-04 20:31         ` Johannes Weiner
  0 siblings, 2 replies; 43+ messages in thread
From: David Rientjes @ 2017-10-04 20:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
> > >  	struct mm_struct *mm;
> > >  	bool can_oom_reap = true;
> > >  
> > > +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > +		put_task_struct(victim);
> > > +		return;
> > > +	}
> > > +
> > >  	p = find_lock_task_mm(victim);
> > >  	if (!p) {
> > >  		put_task_struct(victim);
> > 
> > Is this necessary? The callers of this function use oom_badness() to
> > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> 
> It is. __oom_kill_process() is used to kill all processes belonging
> to the selected memory cgroup, so we should perform these checks
> to avoid killing unkillable processes.
> 

That's only true after the next patch in the series which uses the 
oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
Would it be possible to move this check to that patch so it's more 
obvious?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:17       ` David Rientjes
@ 2017-10-04 20:22         ` Roman Gushchin
  2017-10-04 20:31         ` Johannes Weiner
  1 sibling, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-04 20:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Roman Gushchin wrote:
> 
> > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
> > > >  	struct mm_struct *mm;
> > > >  	bool can_oom_reap = true;
> > > >  
> > > > +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > +		put_task_struct(victim);
> > > > +		return;
> > > > +	}
> > > > +
> > > >  	p = find_lock_task_mm(victim);
> > > >  	if (!p) {
> > > >  		put_task_struct(victim);
> > > 
> > > Is this necessary? The callers of this function use oom_badness() to
> > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > 
> > It is. __oom_kill_process() is used to kill all processes belonging
> > to the selected memory cgroup, so we should perform these checks
> > to avoid killing unkillable processes.
> > 
> 
> That's only true after the next patch in the series which uses the 
> oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> Would it be possible to move this check to that patch so it's more 
> obvious?

Sure, no problems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
  2017-10-04 19:27   ` Johannes Weiner
  2017-10-04 19:48   ` Shakeel Butt
@ 2017-10-04 20:27   ` David Rientjes
  2017-10-04 20:41     ` Johannes Weiner
  2017-10-05 11:40   ` Michal Hocko
  3 siblings, 1 reply; 43+ messages in thread
From: David Rientjes @ 2017-10-04 20:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 4 Oct 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b4de17a78dc1..79f30c281185 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	/*
> +	 * We don't have necessary stats for the root memcg,
> +	 * so we define it's oom_score as the maximum oom_score
> +	 * of the belonging tasks.
> +	 *
> +	 * As tasks in the root memcg unlikely are parts of a
> +	 * single workload, and we don't have to implement
> +	 * group killing, this approximation is reasonable.
> +	 *
> +	 * But if we will have necessary stats for the root memcg,
> +	 * we might switch to the approach which is used for all
> +	 * other memcgs.
> +	 */
> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score, max_score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			score = oom_badness(task, memcg, nodemask,
> +					    totalpages);
> +			if (score > max_score)
> +				max_score = score;
> +		}
> +		css_task_iter_end(&it);
> +
> +		return max_score;
> +	}
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +
> +	oc->chosen_memcg = NULL;
> +	oc->chosen_points = 0;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memory cgroups (including
> +	 * the root memcg).
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		long score;
> +
> +		if (memcg_has_children(iter))
> +			continue;
> +
> +		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need
> +		 * to look further for new victims.
> +		 */
> +		if (score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (score > oc->chosen_points) {
> +			oc->chosen_points = score;
> +			oc->chosen_memcg = iter;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}
> +

By only considering leaf memcgs, does this penalize users if their memcg 
becomes oc->chosen_memcg purely because it has aggregated all of its 
processes to be members of that memcg, which would otherwise be the 
standard behavior?

What prevents me from spreading my memcg with N processes attached over N 
child memcgs instead so that memcg_oom_badness() becomes very small for 
each child memcg specifically to avoid being oom killed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:17       ` David Rientjes
  2017-10-04 20:22         ` Roman Gushchin
@ 2017-10-04 20:31         ` Johannes Weiner
  2017-10-05 11:14           ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 20:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Roman Gushchin wrote:
> 
> > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
> > > >  	struct mm_struct *mm;
> > > >  	bool can_oom_reap = true;
> > > >  
> > > > +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > +		put_task_struct(victim);
> > > > +		return;
> > > > +	}
> > > > +
> > > >  	p = find_lock_task_mm(victim);
> > > >  	if (!p) {
> > > >  		put_task_struct(victim);
> > > 
> > > Is this necessary? The callers of this function use oom_badness() to
> > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > 
> > It is. __oom_kill_process() is used to kill all processes belonging
> > to the selected memory cgroup, so we should perform these checks
> > to avoid killing unkillable processes.
> > 
> 
> That's only true after the next patch in the series which uses the 
> oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> Would it be possible to move this check to that patch so it's more 
> obvious?

Yup, I realized it when reviewing the next patch. Moving this hunk to
the next patch would probably make sense. Although, us reviewers have
been made aware of this now, so I don't feel strongly about it. Won't
make much of a difference once the patches are merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:27   ` David Rientjes
@ 2017-10-04 20:41     ` Johannes Weiner
  2017-10-05  8:40       ` David Rientjes
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-04 20:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Oct 04, 2017 at 01:27:14PM -0700, David Rientjes wrote:
> By only considering leaf memcgs, does this penalize users if their memcg 
> becomes oc->chosen_memcg purely because it has aggregated all of its 
> processes to be members of that memcg, which would otherwise be the 
> standard behavior?
> 
> What prevents me from spreading my memcg with N processes attached over N 
> child memcgs instead so that memcg_oom_badness() becomes very small for 
> each child memcg specifically to avoid being oom killed?

It's no different from forking out multiple mm to avoid being the
biggest process.

It's up to the parent to enforce limits on that group and prevent you
from being able to cause global OOM in the first place, in particular
if you delegate to untrusted and potentially malicious users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:15     ` Roman Gushchin
@ 2017-10-04 21:24       ` Shakeel Butt
  2017-10-05 10:27         ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: Shakeel Butt @ 2017-10-04 21:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

>> > +               if (memcg_has_children(iter))
>> > +                       continue;
>>
>> && iter != root_mem_cgroup ?
>
> Oh, sure. I had a stupid bug in my test script, which prevented me from
> catching this. Thanks!
>
> This should fix the problem.
> --
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2e82625bd354..b3848bce4c86 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>                  * We don't consider non-leaf non-oom_group memory cgroups
>                  * as OOM victims.
>                  */
> -               if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> +               if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> +                   !mem_cgroup_oom_group(iter))
>                         continue;

I think you are mixing the 3rd and 4th patch. The root_mem_cgroup
check should be in 3rd while oom_group stuff should be in 4th.


>>
>> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
>> end why not css_tryget_online() here and css_put for the previous
>> selected one.
>
> Hm, why do we need to check this? I do not see, how we can choose
> an OFFLINE memcg as a victim, tbh. Please, explain the problem.
>

Sorry about the confusion. There are two things. First, should we do a
css_get on the newly selected memcg within the for loop when we still
have a reference to it?

Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
return 0 for offlined memcgs. Maybe no need to call
oom_evaluate_memcg() for offlined memcgs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:41     ` Johannes Weiner
@ 2017-10-05  8:40       ` David Rientjes
  2017-10-05 10:27         ` Johannes Weiner
  2017-10-05 10:44         ` Roman Gushchin
  0 siblings, 2 replies; 43+ messages in thread
From: David Rientjes @ 2017-10-05  8:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 4 Oct 2017, Johannes Weiner wrote:

> > By only considering leaf memcgs, does this penalize users if their memcg 
> > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > processes to be members of that memcg, which would otherwise be the 
> > standard behavior?
> > 
> > What prevents me from spreading my memcg with N processes attached over N 
> > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > each child memcg specifically to avoid being oom killed?
> 
> It's no different from forking out multiple mm to avoid being the
> biggest process.
> 

It is, because it can quite clearly be a DoS, and was prevented with 
Roman's earlier design of iterating usage up the hierarchy and comparing 
siblings based on that criteria.  I know exactly why he chose that 
implementation detail early on, and it was to prevent cases such as this 
and to not let userspace hide from the oom killer.

> It's up to the parent to enforce limits on that group and prevent you
> from being able to cause global OOM in the first place, in particular
> if you delegate to untrusted and potentially malicious users.
> 

Let's resolve that global oom is a real condition and getting into that 
situation is not a userspace problem.  It's the result of overcommiting 
the system, and is used in the enterprise to address business goals.  If 
the above is true, and its up to memcg to prevent global oom in the first 
place, then this entire patchset is absolutely pointless.  Limit userspace 
to 95% of memory and when usage is approaching that limit, let userspace 
attached to the root memcg iterate the hierarchy itself and kill from the 
largest consumer.

This patchset exists because overcommit is real, exactly the same as 
overcommit within memcg hierarchies is real.  99% of the time we don't run 
into global oom because people aren't using their limits so it just works 
out.  1% of the time we run into global oom and we need a decision to made 
based for forward progress.  Using Michal's earlier example of admins and 
students, a student can easily use all of his limit and also, with v10 of 
this patchset, 99% of the time avoid being oom killed just by forking N 
processes over N cgroups.  It's going to oom kill an admin every single 
time.

I know exactly why earlier versions of this patchset iterated that usage 
up the tree so you would pick from students, pick from this troublemaking 
student, and then oom kill from his hierarchy.  Roman has made that point 
himself.  My suggestion was to add userspace influence to it so that 
enterprise users and users with business goals can actually define that we 
really do want 80% of memory to be used by this process or this hierarchy, 
it's in our best interest.

Earlier iterations of this patchset did this, and did it correctly.  
Userspace influence over the decisionmaking makes it a very powerful 
combination because you _can_ specify what your goals are or choose to 
leave the priorities as default so you can compare based solely on usage.  
It was a beautiful solution to the problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 21:24       ` Shakeel Butt
@ 2017-10-05 10:27         ` Roman Gushchin
  2017-10-05 11:12           ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-05 10:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Linux MM, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> >> > +               if (memcg_has_children(iter))
> >> > +                       continue;
> >>
> >> && iter != root_mem_cgroup ?
> >
> > Oh, sure. I had a stupid bug in my test script, which prevented me from
> > catching this. Thanks!
> >
> > This should fix the problem.
> > --
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2e82625bd354..b3848bce4c86 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> >                  * We don't consider non-leaf non-oom_group memory cgroups
> >                  * as OOM victims.
> >                  */
> > -               if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> > +                   !mem_cgroup_oom_group(iter))
> >                         continue;
> 
> I think you are mixing the 3rd and 4th patch. The root_mem_cgroup
> check should be in 3rd while oom_group stuff should be in 4th.
>

Right. This "patch" should fix them both, it was just confusing to
send two patches. I'll split it before final landing.

> 
> >>
> >> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
> >> end why not css_tryget_online() here and css_put for the previous
> >> selected one.
> >
> > Hm, why do we need to check this? I do not see, how we can choose
> > an OFFLINE memcg as a victim, tbh. Please, explain the problem.
> >
> 
> Sorry about the confusion. There are two things. First, should we do a
> css_get on the newly selected memcg within the for loop when we still
> have a reference to it?

We're holding rcu_read_lock, it should be enough. We're bumping css counter
just before releasing rcu lock.

> 
> Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> return 0 for offlined memcgs. Maybe no need to call
> oom_evaluate_memcg() for offlined memcgs.

Sounds like a good optimization, which can be done on top of the current
patchset.

Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05  8:40       ` David Rientjes
@ 2017-10-05 10:27         ` Johannes Weiner
  2017-10-05 21:53           ` David Rientjes
  2017-10-05 10:44         ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-05 10:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> > 
> 
> It is, because it can quite clearly be a DoSand was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.

This doesn't address how it's different from a single process
following the same pattern right now.

> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

We overcommit too, but our workloads organize themselves based on
managing their resources, not based on evading the OOM killer. I'd
wager that's true for many if not most users.

Untrusted workloads can evade the OOM killer now, and they can after
these patches are in. Nothing changed. It's not what this work tries
to address at all.

The changelogs are pretty clear on what the goal and the scope of this
is. Just because it doesn't address your highly specialized usecase
doesn't make it pointless. I think we've established that in the past.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05  8:40       ` David Rientjes
  2017-10-05 10:27         ` Johannes Weiner
@ 2017-10-05 10:44         ` Roman Gushchin
  2017-10-05 22:02           ` David Rientjes
  1 sibling, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-05 10:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> >

Hi, David!

> 
> It is, because it can quite clearly be a DoS, and was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.
> 
> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

Overcommit is real, but configuring the system in a way that system-wide OOM
happens often is a strange idea. As we all know, the system can barely work
adequate under global memory shortage: network packets are dropped, latency
is bad, weird kernel issues are revealed periodically, etc.
I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
avoiding system-wide OOM to happen.

> 
> I know exactly why earlier versions of this patchset iterated that usage 
> up the tree so you would pick from students, pick from this troublemaking 
> student, and then oom kill from his hierarchy.  Roman has made that point 
> himself.  My suggestion was to add userspace influence to it so that 
> enterprise users and users with business goals can actually define that we 
> really do want 80% of memory to be used by this process or this hierarchy, 
> it's in our best interest.

I'll repeat myself: I believe that there is a range of possible policies:
from a complete flat (what Johannes did suggested few weeks ago), to a very
hierarchical (as in v8). Each with their pros and cons.
(Michal did provide a clear example of bad behavior of the hierarchical approach).

I assume, that v10 is a good middle point, and it's good because it doesn't
prevent further development. Just for example, you can introduce a third state
of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
And this is what will solve your particular case, right?

> 
> Earlier iterations of this patchset did this, and did it correctly.  
> Userspace influence over the decisionmaking makes it a very powerful 
> combination because you _can_ specify what your goals are or choose to 
> leave the priorities as default so you can compare based solely on usage.  
> It was a beautiful solution to the problem.

I did, but then I did agree with Tejun's point, that proposed semantics will
limit us further. Really, oom_priorities do not guarantee the killing order
(remember numa issues, as well as oom_score_adj), so in practice it can be even
reverted (e.g. low prio cgroup killed before high prio). We shouldn't cause
users rely on these priorities more than some hints to the kernel.
But the way how they are defined doesn't allow to change anything, it's too
rigid.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 10:27         ` Roman Gushchin
@ 2017-10-05 11:12           ` Michal Hocko
  2017-10-05 11:45             ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 11:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Linux MM, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

On Thu 05-10-17 11:27:07, Roman Gushchin wrote:
> On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
[...]
> > Sorry about the confusion. There are two things. First, should we do a
> > css_get on the newly selected memcg within the for loop when we still
> > have a reference to it?
> 
> We're holding rcu_read_lock, it should be enough. We're bumping css counter
> just before releasing rcu lock.

yes

> > 
> > Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> > return 0 for offlined memcgs. Maybe no need to call
> > oom_evaluate_memcg() for offlined memcgs.
> 
> Sounds like a good optimization, which can be done on top of the current
> patchset.

You could achive this by checking whether a memcg has tasks rather than
explicitly checking for children memcgs as I've suggested already.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 20:31         ` Johannes Weiner
@ 2017-10-05 11:14           ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 11:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Roman Gushchin, linux-mm, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 04-10-17 16:31:38, Johannes Weiner wrote:
> On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> > On Wed, 4 Oct 2017, Roman Gushchin wrote:
> > 
> > > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct *victim)
> > > > >  	struct mm_struct *mm;
> > > > >  	bool can_oom_reap = true;
> > > > >  
> > > > > +	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > > +	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > > +		put_task_struct(victim);
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > >  	p = find_lock_task_mm(victim);
> > > > >  	if (!p) {
> > > > >  		put_task_struct(victim);
> > > > 
> > > > Is this necessary? The callers of this function use oom_badness() to
> > > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > > 
> > > It is. __oom_kill_process() is used to kill all processes belonging
> > > to the selected memory cgroup, so we should perform these checks
> > > to avoid killing unkillable processes.
> > > 
> > 
> > That's only true after the next patch in the series which uses the 
> > oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> > Would it be possible to move this check to that patch so it's more 
> > obvious?
> 
> Yup, I realized it when reviewing the next patch. Moving this hunk to
> the next patch would probably make sense. Although, us reviewers have
> been made aware of this now, so I don't feel strongly about it. Won't
> make much of a difference once the patches are merged.

I think it would be better to move it because it will be less confusing
that way. Especially for those who are going to read git history in
order to understand why this is needed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
                     ` (2 preceding siblings ...)
  2017-10-04 20:27   ` David Rientjes
@ 2017-10-05 11:40   ` Michal Hocko
  3 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 11:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 04-10-17 16:46:35, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.

Thanks for separating the group_oom part. This is getting in the
mergeable state. I will ack it once the suggested fixes are folded in.
There is some clean up potential on top (I find the oc->chosen_memcg
quite ugly and will post a patch on top of yours) but that can be done
later.
 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 11:12           ` Michal Hocko
@ 2017-10-05 11:45             ` Roman Gushchin
  0 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2017-10-05 11:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Linux MM, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, Cgroups, linux-doc, LKML

On Thu, Oct 05, 2017 at 01:12:30PM +0200, Michal Hocko wrote:
> On Thu 05-10-17 11:27:07, Roman Gushchin wrote:
> > On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> [...]
> > > Sorry about the confusion. There are two things. First, should we do a
> > > css_get on the newly selected memcg within the for loop when we still
> > > have a reference to it?
> > 
> > We're holding rcu_read_lock, it should be enough. We're bumping css counter
> > just before releasing rcu lock.
> 
> yes
> 
> > > 
> > > Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> > > return 0 for offlined memcgs. Maybe no need to call
> > > oom_evaluate_memcg() for offlined memcgs.
> > 
> > Sounds like a good optimization, which can be done on top of the current
> > patchset.
> 
> You could achive this by checking whether a memcg has tasks rather than
> explicitly checking for children memcgs as I've suggested already.

Using cgroup_has_tasks() will require additional locking, so I'm not sure
it worth it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 4/6] mm, oom: introduce memory.oom_group
  2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
  2017-10-04 19:37   ` Johannes Weiner
@ 2017-10-05 12:06   ` Michal Hocko
  2017-10-05 12:32     ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 12:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> The cgroup-aware OOM killer treats leaf memory cgroups as memory
> consumption entities and performs the victim selection by comparing
> them based on their memory footprint. Then it kills the biggest task
> inside the selected memory cgroup.
> 
> But there are workloads, which are not tolerant to a such behavior.
> Killing a random task may leave the workload in a broken state.
> 
> To solve this problem, memory.oom_group knob is introduced.
> It will define, whether a memory group should be treated as an
> indivisible memory consumer, compared by total memory consumption
> with other memory consumers (leaf memory cgroups and other memory
> cgroups with memory.oom_group set), and whether all belonging tasks
> should be killed if the cgroup is selected.
> 
> If set on memcg A, it means that in case of system-wide OOM or
> memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> belonging to the sub-tree of A will be killed. If OOM event is
> scoped to a descendant cgroup (A/B, for example), only tasks in
> that cgroup can be affected. OOM killer will never touch any tasks
> outside of the scope of the OOM event.
> 
> Also, tasks with oom_score_adj set to -1000 will not be killed.

I would extend the last sentence with an explanation. What about the
following:
"
Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the oom killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.
"

few mostly nit picks below but this looks good other than that. Once the
fix mentioned in patch 3 is folded I will ack this.

[...]

>  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  {
> -	struct mem_cgroup *iter;
> +	struct mem_cgroup *iter, *group = NULL;
> +	long group_score = 0;
>  
>  	oc->chosen_memcg = NULL;
>  	oc->chosen_points = 0;
>  
>  	/*
> +	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> +	 * all tasks belonging to the memcg should be killed.
> +	 * So, we mark the memcg as a victim.
> +	 */
> +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {

we have is_memcg_oom() helper which is esier to read and understand than
the explicit oc->memcg check

> +		oc->chosen_memcg = oc->memcg;
> +		css_get(&oc->chosen_memcg->css);
> +		return;
> +	}
> +
> +	/*
>  	 * The oom_score is calculated for leaf memory cgroups (including
>  	 * the root memcg).
> +	 * Non-leaf oom_group cgroups accumulating score of descendant
> +	 * leaf memory cgroups.
>  	 */
>  	rcu_read_lock();
>  	for_each_mem_cgroup_tree(iter, root) {
>  		long score;
>  
> +		/*
> +		 * We don't consider non-leaf non-oom_group memory cgroups
> +		 * as OOM victims.
> +		 */
> +		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> +			continue;
> +
> +		/*
> +		 * If group is not set or we've ran out of the group's sub-tree,
> +		 * we should set group and reset group_score.
> +		 */
> +		if (!group || group == root_mem_cgroup ||
> +		    !mem_cgroup_is_descendant(iter, group)) {
> +			group = iter;
> +			group_score = 0;
> +		}
> +

hmm, I thought you would go with a recursive oom_evaluate_memcg
implementation that would result in a more readable code IMHO. It is
true that we would traverse oom_group more times. But I do not expect
we would have very deep memcg hierarchies in the majority of workloads
and even if we did then this is a cold path which should focus on
readability more than a performance. Also implementing
mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
turns out a real problem.

Anyway this is nothing really fundamental so I will leave the decision
on you.

> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
>  	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
>  		return oc->chosen_memcg;
>  
> -	/* Kill a task in the chosen memcg with the biggest memory footprint */
> -	oc->chosen_points = 0;
> -	oc->chosen_task = NULL;
> -	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> -
> -	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> -		goto out;
> -
> -	__oom_kill_process(oc->chosen_task);
> +	/*
> +	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> +	 * of the chosen memory cgroup, otherwise kill the task with the biggest
> +	 * memory footprint.
> +	 */
> +	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> +				      NULL);
> +		/* We have one or more terminating processes at this point. */
> +		oc->chosen_task = INFLIGHT_VICTIM;

it took me a while to realize we need this because of return
!!oc->chosen_task in out_of_memory. Subtle... Also a reason to hate
oc->chosen_* thingy. As I've said in other reply, don't worry about this
I will probably turn my hate into a patch ;)

> +	} else {
> +		oc->chosen_points = 0;
> +		oc->chosen_task = NULL;
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +		if (oc->chosen_task == NULL ||
> +		    oc->chosen_task == INFLIGHT_VICTIM)
> +			goto out;

How can this happen? There shouldn't be any INFLIGHT_VICTIM in our memcg
because we have checked for that already. I can see how we do not find
any task because those can terminate by the time we get here but no new
oom victim should appear we are under the oom_lock.

> +
> +		__oom_kill_process(oc->chosen_task);
> +	}
>  
>  out:
>  	mem_cgroup_put(oc->chosen_memcg);
> -- 
> 2.13.6

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 4/6] mm, oom: introduce memory.oom_group
  2017-10-05 12:06   ` Michal Hocko
@ 2017-10-05 12:32     ` Roman Gushchin
  2017-10-05 12:58       ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-05 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > consumption entities and performs the victim selection by comparing
> > them based on their memory footprint. Then it kills the biggest task
> > inside the selected memory cgroup.
> > 
> > But there are workloads, which are not tolerant to a such behavior.
> > Killing a random task may leave the workload in a broken state.
> > 
> > To solve this problem, memory.oom_group knob is introduced.
> > It will define, whether a memory group should be treated as an
> > indivisible memory consumer, compared by total memory consumption
> > with other memory consumers (leaf memory cgroups and other memory
> > cgroups with memory.oom_group set), and whether all belonging tasks
> > should be killed if the cgroup is selected.
> > 
> > If set on memcg A, it means that in case of system-wide OOM or
> > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > belonging to the sub-tree of A will be killed. If OOM event is
> > scoped to a descendant cgroup (A/B, for example), only tasks in
> > that cgroup can be affected. OOM killer will never touch any tasks
> > outside of the scope of the OOM event.
> > 
> > Also, tasks with oom_score_adj set to -1000 will not be killed.
> 
> I would extend the last sentence with an explanation. What about the
> following:
> "
> Also, tasks with oom_score_adj set to -1000 will not be killed because
> this has been a long established way to protect a particular process
> from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> user defined configuration might lead to data corruptions or other
> misbehavior.
> "

Added, thanks!

> 
> few mostly nit picks below but this looks good other than that. Once the
> fix mentioned in patch 3 is folded I will ack this.
> 
> [...]
> 
> >  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> >  {
> > -	struct mem_cgroup *iter;
> > +	struct mem_cgroup *iter, *group = NULL;
> > +	long group_score = 0;
> >  
> >  	oc->chosen_memcg = NULL;
> >  	oc->chosen_points = 0;
> >  
> >  	/*
> > +	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > +	 * all tasks belonging to the memcg should be killed.
> > +	 * So, we mark the memcg as a victim.
> > +	 */
> > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> 
> we have is_memcg_oom() helper which is esier to read and understand than
> the explicit oc->memcg check

It's defined in oom_kill.c and not exported, so I'm not sure.

> 
> > +		oc->chosen_memcg = oc->memcg;
> > +		css_get(&oc->chosen_memcg->css);
> > +		return;
> > +	}
> > +
> > +	/*
> >  	 * The oom_score is calculated for leaf memory cgroups (including
> >  	 * the root memcg).
> > +	 * Non-leaf oom_group cgroups accumulating score of descendant
> > +	 * leaf memory cgroups.
> >  	 */
> >  	rcu_read_lock();
> >  	for_each_mem_cgroup_tree(iter, root) {
> >  		long score;
> >  
> > +		/*
> > +		 * We don't consider non-leaf non-oom_group memory cgroups
> > +		 * as OOM victims.
> > +		 */
> > +		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +			continue;
> > +
> > +		/*
> > +		 * If group is not set or we've ran out of the group's sub-tree,
> > +		 * we should set group and reset group_score.
> > +		 */
> > +		if (!group || group == root_mem_cgroup ||
> > +		    !mem_cgroup_is_descendant(iter, group)) {
> > +			group = iter;
> > +			group_score = 0;
> > +		}
> > +
> 
> hmm, I thought you would go with a recursive oom_evaluate_memcg
> implementation that would result in a more readable code IMHO. It is
> true that we would traverse oom_group more times. But I do not expect
> we would have very deep memcg hierarchies in the majority of workloads
> and even if we did then this is a cold path which should focus on
> readability more than a performance. Also implementing
> mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> turns out a real problem.

I've tried to go this way, but I didn't like the result. These both
loops will share a lot of code (e.g. breaking on finding a previous victim,
skipping non-leaf non-oom-group memcgs, etc), so the result is more messy.
And actually it's strange to start a new loop to iterate exactly over
the same sub-tree, which you want to skip in the first loop.

> 
> Anyway this is nothing really fundamental so I will leave the decision
> on you.
> 
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> >  	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> >  		return oc->chosen_memcg;
> >  
> > -	/* Kill a task in the chosen memcg with the biggest memory footprint */
> > -	oc->chosen_points = 0;
> > -	oc->chosen_task = NULL;
> > -	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > -
> > -	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > -		goto out;
> > -
> > -	__oom_kill_process(oc->chosen_task);
> > +	/*
> > +	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> > +	 * of the chosen memory cgroup, otherwise kill the task with the biggest
> > +	 * memory footprint.
> > +	 */
> > +	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > +				      NULL);
> > +		/* We have one or more terminating processes at this point. */
> > +		oc->chosen_task = INFLIGHT_VICTIM;
> 
> it took me a while to realize we need this because of return
> !!oc->chosen_task in out_of_memory. Subtle... Also a reason to hate
> oc->chosen_* thingy. As I've said in other reply, don't worry about this
> I will probably turn my hate into a patch ;)
> 
> > +	} else {
> > +		oc->chosen_points = 0;
> > +		oc->chosen_task = NULL;
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > +
> > +		if (oc->chosen_task == NULL ||
> > +		    oc->chosen_task == INFLIGHT_VICTIM)
> > +			goto out;
> 
> How can this happen? There shouldn't be any INFLIGHT_VICTIM in our memcg
> because we have checked for that already. I can see how we do not find
> any task because those can terminate by the time we get here but no new
> oom victim should appear we are under the oom_lock.

You're probably right, but I would prefer to have this check in place,
rather then get a panic on attempt to kill an INFLIGHT_VICTIM task one day.
In general, I do not like this trick with using this special value
to signal the existence of in-flight victims. It adds a lot of complexity,
and non-obvious code.
I assume, it's a good target for the following refactoring.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 4/6] mm, oom: introduce memory.oom_group
  2017-10-05 12:32     ` Roman Gushchin
@ 2017-10-05 12:58       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 12:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu 05-10-17 13:32:14, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > > consumption entities and performs the victim selection by comparing
> > > them based on their memory footprint. Then it kills the biggest task
> > > inside the selected memory cgroup.
> > > 
> > > But there are workloads, which are not tolerant to a such behavior.
> > > Killing a random task may leave the workload in a broken state.
> > > 
> > > To solve this problem, memory.oom_group knob is introduced.
> > > It will define, whether a memory group should be treated as an
> > > indivisible memory consumer, compared by total memory consumption
> > > with other memory consumers (leaf memory cgroups and other memory
> > > cgroups with memory.oom_group set), and whether all belonging tasks
> > > should be killed if the cgroup is selected.
> > > 
> > > If set on memcg A, it means that in case of system-wide OOM or
> > > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > > belonging to the sub-tree of A will be killed. If OOM event is
> > > scoped to a descendant cgroup (A/B, for example), only tasks in
> > > that cgroup can be affected. OOM killer will never touch any tasks
> > > outside of the scope of the OOM event.
> > > 
> > > Also, tasks with oom_score_adj set to -1000 will not be killed.
> > 
> > I would extend the last sentence with an explanation. What about the
> > following:
> > "
> > Also, tasks with oom_score_adj set to -1000 will not be killed because
> > this has been a long established way to protect a particular process
> > from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> > user defined configuration might lead to data corruptions or other
> > misbehavior.
> > "
> 
> Added, thanks!
> 
> > 
> > few mostly nit picks below but this looks good other than that. Once the
> > fix mentioned in patch 3 is folded I will ack this.
> > 
> > [...]
> > 
> > >  static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > >  {
> > > -	struct mem_cgroup *iter;
> > > +	struct mem_cgroup *iter, *group = NULL;
> > > +	long group_score = 0;
> > >  
> > >  	oc->chosen_memcg = NULL;
> > >  	oc->chosen_points = 0;
> > >  
> > >  	/*
> > > +	 * If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > > +	 * all tasks belonging to the memcg should be killed.
> > > +	 * So, we mark the memcg as a victim.
> > > +	 */
> > > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > 
> > we have is_memcg_oom() helper which is esier to read and understand than
> > the explicit oc->memcg check
> 
> It's defined in oom_kill.c and not exported, so I'm not sure.

putting it to oom.h shouldn't be a big deal.
 
> > > +		oc->chosen_memcg = oc->memcg;
> > > +		css_get(&oc->chosen_memcg->css);
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > >  	 * The oom_score is calculated for leaf memory cgroups (including
> > >  	 * the root memcg).
> > > +	 * Non-leaf oom_group cgroups accumulating score of descendant
> > > +	 * leaf memory cgroups.
> > >  	 */
> > >  	rcu_read_lock();
> > >  	for_each_mem_cgroup_tree(iter, root) {
> > >  		long score;
> > >  
> > > +		/*
> > > +		 * We don't consider non-leaf non-oom_group memory cgroups
> > > +		 * as OOM victims.
> > > +		 */
> > > +		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * If group is not set or we've ran out of the group's sub-tree,
> > > +		 * we should set group and reset group_score.
> > > +		 */
> > > +		if (!group || group == root_mem_cgroup ||
> > > +		    !mem_cgroup_is_descendant(iter, group)) {
> > > +			group = iter;
> > > +			group_score = 0;
> > > +		}
> > > +
> > 
> > hmm, I thought you would go with a recursive oom_evaluate_memcg
> > implementation that would result in a more readable code IMHO. It is
> > true that we would traverse oom_group more times. But I do not expect
> > we would have very deep memcg hierarchies in the majority of workloads
> > and even if we did then this is a cold path which should focus on
> > readability more than a performance. Also implementing
> > mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> > turns out a real problem.
> 
> I've tried to go this way, but I didn't like the result. These both
> loops will share a lot of code (e.g. breaking on finding a previous victim,
> skipping non-leaf non-oom-group memcgs, etc), so the result is more messy.
> And actually it's strange to start a new loop to iterate exactly over
> the same sub-tree, which you want to skip in the first loop.

As I've said, I will not insist. It just makes more sense to me to do
the hierarchical behavior in a single place rather than open code it in
the main loop.
 
> > Anyway this is nothing really fundamental so I will leave the decision
> > on you.
> > 
> > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > +{
> > >  	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > >  		return oc->chosen_memcg;
> > >  
> > > -	/* Kill a task in the chosen memcg with the biggest memory footprint */
> > > -	oc->chosen_points = 0;
> > > -	oc->chosen_task = NULL;
> > > -	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > -
> > > -	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > -		goto out;
> > > -
> > > -	__oom_kill_process(oc->chosen_task);
> > > +	/*
> > > +	 * If memory.oom_group is set, kill all tasks belonging to the sub-tree
> > > +	 * of the chosen memory cgroup, otherwise kill the task with the biggest
> > > +	 * memory footprint.
> > > +	 */
> > > +	if (mem_cgroup_oom_group(oc->chosen_memcg)) {
> > > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > > +				      NULL);
> > > +		/* We have one or more terminating processes at this point. */
> > > +		oc->chosen_task = INFLIGHT_VICTIM;
> > 
> > it took me a while to realize we need this because of return
> > !!oc->chosen_task in out_of_memory. Subtle... Also a reason to hate
> > oc->chosen_* thingy. As I've said in other reply, don't worry about this
> > I will probably turn my hate into a patch ;)
> > 
> > > +	} else {
> > > +		oc->chosen_points = 0;
> > > +		oc->chosen_task = NULL;
> > > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > +
> > > +		if (oc->chosen_task == NULL ||
> > > +		    oc->chosen_task == INFLIGHT_VICTIM)
> > > +			goto out;
> > 
> > How can this happen? There shouldn't be any INFLIGHT_VICTIM in our memcg
> > because we have checked for that already. I can see how we do not find
> > any task because those can terminate by the time we get here but no new
> > oom victim should appear we are under the oom_lock.
> 
> You're probably right, but I would prefer to have this check in place,
> rather then get a panic on attempt to kill an INFLIGHT_VICTIM task one day.

This would be a bug which you just paper over.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-04 20:04   ` Johannes Weiner
@ 2017-10-05 13:14     ` Michal Hocko
  2017-10-05 13:41       ` Roman Gushchin
                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 13:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
[...]
> That will silently ignore what the user writes to the memory.oom_group
> control files across the system's cgroup tree.
> 
> We'll have a knob that lets the workload declare itself an indivisible
> memory consumer, that it would like to get killed in one piece, and
> it's silently ignored because of a mount option they forgot to pass.
> 
> That's not good from an interface perspective.

Yes and that is why I think a boot time knob would be the most simple
way. It will also open doors for more oom policies in future which I
believe come sooner or later.

> On the other hand, the only benefit of this patch is to shield users
> from changes to the OOM killing heuristics. Yet, it's really hard to
> imagine that modifying the victim selection process slightly could be
> called a regression in any way. We have done that many times over,
> without a second thought on backwards compatibility:
> 
> 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> a63d83f427fb oom: badness heuristic rewrite
> 778c14affaf9 mm, oom: base root bonus on current usage

yes we have changed that without a deeper considerations. Some of those
changes are arguable (e.g. child scarification). The oom badness
heuristic rewrite has triggered quite some complains AFAIR (I remember
Kosaki has made several attempts to revert it). I think that we are
trying to be more careful about user visible changes than we used to be.

More importantly I do not think that the current (non-memcg aware) OOM
policy is somehow obsolete and many people expect it to behave
consistently. As I've said already, I have seen many complains that the
OOM killer doesn't kill the right task. Most of them were just NUMA
related issues where the oom report was not clear enough. I do not want
to repeat that again now. Memcg awareness is certainly a useful
heuristic but I do not see it universally applicable to all workloads.

> Let's not make the userspace interface crap because of some misguided
> idea that the OOM heuristic is a hard promise to userspace. It's never
> been, and nobody has complained about changes in the past.
> 
> This case is doubly silly, as the behavior change only applies to
> cgroup2, which doesn't exactly have a large base of legacy users yet.

I agree on the interface part but I disagree with making it default just
because v2 is not largerly adopted yet.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 13:14     ` Michal Hocko
@ 2017-10-05 13:41       ` Roman Gushchin
  2017-10-05 14:10         ` Michal Hocko
  2017-10-05 14:54       ` Johannes Weiner
  2017-10-05 15:51       ` Tejun Heo
  2 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2017-10-05 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

So, we would rely on grub config to set up OOM policy? Sounds weird.

We use boot options, when it's hard to implement on the fly switching
(like turning on/off socket memory accounting), but here is not this case.

> 
> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.
> 
> More importantly I do not think that the current (non-memcg aware) OOM
> policy is somehow obsolete and many people expect it to behave
> consistently. As I've said already, I have seen many complains that the
> OOM killer doesn't kill the right task. Most of them were just NUMA
> related issues where the oom report was not clear enough. I do not want
> to repeat that again now. Memcg awareness is certainly a useful
> heuristic but I do not see it universally applicable to all workloads.
> 
> > Let's not make the userspace interface crap because of some misguided
> > idea that the OOM heuristic is a hard promise to userspace. It's never
> > been, and nobody has complained about changes in the past.
> > 
> > This case is doubly silly, as the behavior change only applies to
> > cgroup2, which doesn't exactly have a large base of legacy users yet.
> 
> I agree on the interface part but I disagree with making it default just
> because v2 is not largerly adopted yet.

I believe that the only real regression can be caused by active using of
oom_score_adj. I really don't know how many cgroup v2 users are relying
on it (hopefully, 0).

So, personally I would prefer to have an opt-out cgroup v2 mount option
(sane new behavior for most users, 100% backward compatibility for rare
strange setups), but I don't have a very strong opinion here.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 13:41       ` Roman Gushchin
@ 2017-10-05 14:10         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 14:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, linux-mm, Vladimir Davydov, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu 05-10-17 14:41:13, Roman Gushchin wrote:
> On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> > [...]
> > > That will silently ignore what the user writes to the memory.oom_group
> > > control files across the system's cgroup tree.
> > > 
> > > We'll have a knob that lets the workload declare itself an indivisible
> > > memory consumer, that it would like to get killed in one piece, and
> > > it's silently ignored because of a mount option they forgot to pass.
> > > 
> > > That's not good from an interface perspective.
> > 
> > Yes and that is why I think a boot time knob would be the most simple
> > way. It will also open doors for more oom policies in future which I
> > believe come sooner or later.
> 
> So, we would rely on grub config to set up OOM policy? Sounds weird.
> 
> We use boot options, when it's hard to implement on the fly switching
> (like turning on/off socket memory accounting), but here is not this case.

Well we define global policies with kernel command line so I do not
think it would be something unusual. An advantage is that you do not
have deal with semantic of the policy change during the runtime which is
something I am not sure we need or even want.
 
> > > On the other hand, the only benefit of this patch is to shield users
> > > from changes to the OOM killing heuristics. Yet, it's really hard to
> > > imagine that modifying the victim selection process slightly could be
> > > called a regression in any way. We have done that many times over,
> > > without a second thought on backwards compatibility:
> > > 
> > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > > a63d83f427fb oom: badness heuristic rewrite
> > > 778c14affaf9 mm, oom: base root bonus on current usage
> > 
> > yes we have changed that without a deeper considerations. Some of those
> > changes are arguable (e.g. child scarification). The oom badness
> > heuristic rewrite has triggered quite some complains AFAIR (I remember
> > Kosaki has made several attempts to revert it). I think that we are
> > trying to be more careful about user visible changes than we used to be.
> > 
> > More importantly I do not think that the current (non-memcg aware) OOM
> > policy is somehow obsolete and many people expect it to behave
> > consistently. As I've said already, I have seen many complains that the
> > OOM killer doesn't kill the right task. Most of them were just NUMA
> > related issues where the oom report was not clear enough. I do not want
> > to repeat that again now. Memcg awareness is certainly a useful
> > heuristic but I do not see it universally applicable to all workloads.
> > 
> > > Let's not make the userspace interface crap because of some misguided
> > > idea that the OOM heuristic is a hard promise to userspace. It's never
> > > been, and nobody has complained about changes in the past.
> > > 
> > > This case is doubly silly, as the behavior change only applies to
> > > cgroup2, which doesn't exactly have a large base of legacy users yet.
> > 
> > I agree on the interface part but I disagree with making it default just
> > because v2 is not largerly adopted yet.
> 
> I believe that the only real regression can be caused by active using of
> oom_score_adj. I really don't know how many cgroup v2 users are relying
> on it (hopefully, 0).

Not only. A memcg with many small tasks could regress as well.

> So, personally I would prefer to have an opt-out cgroup v2 mount option
> (sane new behavior for most users, 100% backward compatibility for rare
> strange setups), but I don't have a very strong opinion here.

I fail to see why should people disable the feature after they see an
unexpected behavior rather than other way around when the feature is
enabled when it is really wanted. The opt-in is more correct just from
the "least surprise POV".
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 13:14     ` Michal Hocko
  2017-10-05 13:41       ` Roman Gushchin
@ 2017-10-05 14:54       ` Johannes Weiner
  2017-10-05 16:40         ` Michal Hocko
  2017-10-05 15:51       ` Tejun Heo
  2 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2017-10-05 14:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

A boot time knob makes less sense to me than the mount option. It
doesn't require a reboot to change this behavior, we shouldn't force
the user to reboot when a runtime configuration is possible.

But I don't see how dropping this patch as part of this series would
prevent adding modular oom policies in the future?

That said, selectable OOM policies sound like a total deadend to
me. The kernel OOM happens way too late to be useful for any kind of
resource policy already. Even now it won't prevent you from thrashing
indefinitely, with only 5% of your workload's time spent productively.

What kind of service quality do you have at this point?

The *minority* of our OOM situations (in terms of "this isn't making
real progress anymore due to a lack of memory") is even *seeing* OOM
kills at this point. And it'll get worse as storage gets faster and
memory bigger.

How is that useful as a resource arbitration point?

Then there is the question of reliability. I mean, we still don't have
a global OOM killer that is actually free from deadlocks. We don't
have reserves measured to the exact requirements of reclaim that would
guarantee recovery, the OOM reaper requires a lock that we hope isn't
taken, etc. I wouldn't want any of my fleet to rely on this for
regular operation - I'm just glad that, when we do mess up and hit
this event, we don't have to reboot.

It makes much more sense to monitor memory pressure from userspace and
smartly intervene when things turn unproductive, which is a long way
from the point where the kernel is about to *deadlock* due to memory.

Global OOM kills can still happen, but their goal should really be 1)
to save the kernel, 2) respect the integrity of a memory consumer and
3) be comprehensible to userspace. (These patches are about 2 and 3.)

But abstracting such a rudimentary and fragile deadlock avoidance
mechanism into higher-level resource management, or co-opting it as a
policy enforcement tool, is crazy to me.

And it seems reckless to present it as those things to our users by
encoding any such elaborate policy interfaces.

> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.

Whatever grumbling might have come up, it has not resulted in a revert
or a way to switch back to the old behavior. So I don't think this can
be considered an actual regression.

We change heuristics in the MM all the time. If you track for example
allocator behavior over different kernel versions, you can see how
much our caching policy, our huge page policy etc. fluctuates. The
impact of that is way bigger to regular workloads than how we go about
choosing an OOM victim.

We don't want to regress anybody, but let's also keep perspective here
and especially consider the userspace interfaces we are willing to put
in for at least the next few years, the promises we want to make, the
further fragmentation of the config space, for such a negligible risk.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 13:14     ` Michal Hocko
  2017-10-05 13:41       ` Roman Gushchin
  2017-10-05 14:54       ` Johannes Weiner
@ 2017-10-05 15:51       ` Tejun Heo
  2 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2017-10-05 15:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, linux-mm, Vladimir Davydov,
	Tetsuo Handa, David Rientjes, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

Hello, Michal.

On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

While boot params are fine for development and debugging, as a
user-interface, they aren't great.

* The user can't easily confirm whether the config they input is
  correct and when they get it wrong what's wrong can be pretty
  mysterious.

* While kernel params can be made r/w through /proc, people usually
  don't expect that and using that can become really confusing because
  a lot of people use "dmesg|grep" to confirm the boot params and that
  won't agree with the setting written later.

* It can't be scoped.  What if we want to choose different policies
  per delegated subtree?

* Boot params aren't the easiest (again, if you're a developer,
  they're but most aren't developers) to play with and prone to cause
  deployment issues.

* In this case, even worse because it ends up silently ignoring a
  clearly explicit configuration in an interface file.

If the behavior differences we get from group oom code isn't critical
(and it doesn't seem to be), I'd greatly prefer just enabling it when
cgroup2 is in use.  If it absolutely must be opt-in even on cgroup2,
we can discuss other ways but I'd really like to see stronger
rationales before going that route.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-05 14:54       ` Johannes Weiner
@ 2017-10-05 16:40         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-05 16:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Tetsuo Handa, David Rientjes,
	Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
	linux-kernel

On Thu 05-10-17 10:54:01, Johannes Weiner wrote:
> On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> > On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> > [...]
> > > That will silently ignore what the user writes to the memory.oom_group
> > > control files across the system's cgroup tree.
> > > 
> > > We'll have a knob that lets the workload declare itself an indivisible
> > > memory consumer, that it would like to get killed in one piece, and
> > > it's silently ignored because of a mount option they forgot to pass.
> > > 
> > > That's not good from an interface perspective.
> > 
> > Yes and that is why I think a boot time knob would be the most simple
> > way. It will also open doors for more oom policies in future which I
> > believe come sooner or later.
> 
> A boot time knob makes less sense to me than the mount option. It
> doesn't require a reboot to change this behavior, we shouldn't force
> the user to reboot when a runtime configuration is possible.

Do we need such a runtime configurability, though? If yes, what is the
usecase?

> But I don't see how dropping this patch as part of this series would
> prevent adding modular oom policies in the future?

I didn't say that dropping this patch would prevent further oom policies.
My point was that a command line option could be more generic to allow
more policies in future.

> That said, selectable OOM policies sound like a total deadend to
> me. The kernel OOM happens way too late to be useful for any kind of
> resource policy already. Even now it won't prevent you from thrashing
> indefinitely, with only 5% of your workload's time spent productively.
> 
> What kind of service quality do you have at this point?

The OOM killer is a disruptive operation which can be really costly
from the workload perspective (you are losing work) and as such the
victim selection really depends on the workload. Most of them are just
fine with the most rudimentary kill-the-largest approach but think of
workloads where the amount or type of work really matters much more
(think of a long running computational jobs taking weeks). We cannot
really handle all of those so I really expect that we will eventually
have to provide a way to allow different policies _somehow_.

> The *minority* of our OOM situations (in terms of "this isn't making
> real progress anymore due to a lack of memory") is even *seeing* OOM
> kills at this point. And it'll get worse as storage gets faster and
> memory bigger.

This is imho a separate problem which is independent on the oom victim
selection.

> How is that useful as a resource arbitration point?
> 
> Then there is the question of reliability. I mean, we still don't have
> a global OOM killer that is actually free from deadlocks.

Well, I believe that we should be deadlock free now.

> We don't
> have reserves measured to the exact requirements of reclaim that would
> guarantee recovery, the OOM reaper requires a lock that we hope isn't
> taken, etc. I wouldn't want any of my fleet to rely on this for
> regular operation - I'm just glad that, when we do mess up and hit
> this event, we don't have to reboot.
> 
> It makes much more sense to monitor memory pressure from userspace and
> smartly intervene when things turn unproductive, which is a long way
> from the point where the kernel is about to *deadlock* due to memory.

again this is independent on the oom selection policy.
 
> Global OOM kills can still happen, but their goal should really be 1)
> to save the kernel, 2) respect the integrity of a memory consumer and
> 3) be comprehensible to userspace. (These patches are about 2 and 3.)

I agree on these but I would add 4) make sure that the impact on the
system is acceptable/least disruptive possible.

> But abstracting such a rudimentary and fragile deadlock avoidance
> mechanism into higher-level resource management, or co-opting it as a
> policy enforcement tool, is crazy to me.
> 
> And it seems reckless to present it as those things to our users by
> encoding any such elaborate policy interfaces.
> 
> > > On the other hand, the only benefit of this patch is to shield users
> > > from changes to the OOM killing heuristics. Yet, it's really hard to
> > > imagine that modifying the victim selection process slightly could be
> > > called a regression in any way. We have done that many times over,
> > > without a second thought on backwards compatibility:
> > > 
> > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > > a63d83f427fb oom: badness heuristic rewrite
> > > 778c14affaf9 mm, oom: base root bonus on current usage
> > 
> > yes we have changed that without a deeper considerations. Some of those
> > changes are arguable (e.g. child scarification). The oom badness
> > heuristic rewrite has triggered quite some complains AFAIR (I remember
> > Kosaki has made several attempts to revert it). I think that we are
> > trying to be more careful about user visible changes than we used to be.
> 
> Whatever grumbling might have come up, it has not resulted in a revert
> or a way to switch back to the old behavior. So I don't think this can
> be considered an actual regression.
> 
> We change heuristics in the MM all the time. If you track for example
> allocator behavior over different kernel versions, you can see how
> much our caching policy, our huge page policy etc. fluctuates. The
> impact of that is way bigger to regular workloads than how we go about
> choosing an OOM victim.

And some of those examples have taught me to be much more careful when
imposing features users are not really asking for (remember the THP by
default story?). I have handled many OOM reports last years and I _know_
for fact that people are very sensitive to the oom victim selection.
 
> We don't want to regress anybody, but let's also keep perspective here
> and especially consider the userspace interfaces we are willing to put
> in for at least the next few years, the promises we want to make, the
> further fragmentation of the config space, for such a negligible risk.

Well, I thought the opt-in aspect has been already agreed on and my
ack is based on that assumption. I do not really care what will be
the mechanism to do so but I _strongly_ believe that enabling this feature
automatically just because cgroupv2 is enabled is a wrong approach. I
will not nack the patchset if there is a broader agreement that this is
acceptable but you have to live without my acks.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 10:27         ` Johannes Weiner
@ 2017-10-05 21:53           ` David Rientjes
  0 siblings, 0 replies; 43+ messages in thread
From: David Rientjes @ 2017-10-05 21:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 5 Oct 2017, Johannes Weiner wrote:

> > It is, because it can quite clearly be a DoSand was prevented with 
> > Roman's earlier design of iterating usage up the hierarchy and comparing 
> > siblings based on that criteria.  I know exactly why he chose that 
> > implementation detail early on, and it was to prevent cases such as this 
> > and to not let userspace hide from the oom killer.
> 
> This doesn't address how it's different from a single process
> following the same pattern right now.
> 

Are you referring to a single process being rewritten into N different 
subprocesses that do the same work as the single process but is separated 
in this manner to avoid having a large rss for any single process to avoid  
being oom killed?

This is solved by a cgroup-aware oom killer because these subprocesses 
should not be able to escape their own chargable entity.  It's exactly the 
usecase that Roman is addressing, correct?  My suggestion is to continue 
to iterate the usage up the hierarchy so that users can't easily defeat 
this by creating N subcontainers instead.

> > Let's resolve that global oom is a real condition and getting into that 
> > situation is not a userspace problem.  It's the result of overcommiting 
> > the system, and is used in the enterprise to address business goals.  If 
> > the above is true, and its up to memcg to prevent global oom in the first 
> > place, then this entire patchset is absolutely pointless.  Limit userspace 
> > to 95% of memory and when usage is approaching that limit, let userspace 
> > attached to the root memcg iterate the hierarchy itself and kill from the 
> > largest consumer.
> > 
> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> We overcommit too, but our workloads organize themselves based on
> managing their resources, not based on evading the OOM killer. I'd
> wager that's true for many if not most users.
> 

No workloads are based on evading the oom killer, we are specifically 
trying to avoid that with oom priorities.  They have the power over 
increasing their own priority to be preferred to kill, but not decreasing 
their oom priority that was set by an activity manager.  This is exactly 
the same as how /proc/pid/oom_score_adj works.  With a cgroup-aware oom 
killer, which we'd love, nothing can possibly evade the oom killer if 
there are oom priorities.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 10:44         ` Roman Gushchin
@ 2017-10-05 22:02           ` David Rientjes
  2017-10-06  5:43             ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: David Rientjes @ 2017-10-05 22:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 5 Oct 2017, Roman Gushchin wrote:

> > This patchset exists because overcommit is real, exactly the same as 
> > overcommit within memcg hierarchies is real.  99% of the time we don't run 
> > into global oom because people aren't using their limits so it just works 
> > out.  1% of the time we run into global oom and we need a decision to made 
> > based for forward progress.  Using Michal's earlier example of admins and 
> > students, a student can easily use all of his limit and also, with v10 of 
> > this patchset, 99% of the time avoid being oom killed just by forking N 
> > processes over N cgroups.  It's going to oom kill an admin every single 
> > time.
> 
> Overcommit is real, but configuring the system in a way that system-wide OOM
> happens often is a strange idea.

I wouldn't consider 1% of the time to be often, but the incident rate 
depends on many variables and who is sharing the same machine.  We can be 
smart about it and limit the potential for it in many ways, but the end 
result is that we still do overcommit and the system oom killer can be 
used to free memory from a low priority process.

> As we all know, the system can barely work
> adequate under global memory shortage: network packets are dropped, latency
> is bad, weird kernel issues are revealed periodically, etc.
> I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
> avoiding system-wide OOM to happen.
> 

Whether it's a system oom or whether its part of the cgroup hierarchy 
doesn't really matter, what matters is that overcommit occurs and we'd 
like to kill based on cgroup usage for each cgroup and its subtree, much 
like your earlier iterations, and also have the ability for userspace to 
influence that.

Without a cgroup-aware oom killer, I can prefer against killing an 
important job that uses 80% of memory and I want it to continue using 80% 
of memory.  We don't have that control over the cgroup-aware oom killer 
although we want to consider cgroup and subtree usage when choosing 
amongst cgroups with the same priority.  If you are not interested in 
defining the oom priority, all can remain at the default and there is no 
compatibility issue.

> > I know exactly why earlier versions of this patchset iterated that usage 
> > up the tree so you would pick from students, pick from this troublemaking 
> > student, and then oom kill from his hierarchy.  Roman has made that point 
> > himself.  My suggestion was to add userspace influence to it so that 
> > enterprise users and users with business goals can actually define that we 
> > really do want 80% of memory to be used by this process or this hierarchy, 
> > it's in our best interest.
> 
> I'll repeat myself: I believe that there is a range of possible policies:
> from a complete flat (what Johannes did suggested few weeks ago), to a very
> hierarchical (as in v8). Each with their pros and cons.
> (Michal did provide a clear example of bad behavior of the hierarchical approach).
> 
> I assume, that v10 is a good middle point, and it's good because it doesn't
> prevent further development. Just for example, you can introduce a third state
> of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
> And this is what will solve your particular case, right?
> 

I would need to add patches to add the "evaluate as a whole but do not 
kill all" knob and a knob for "oom priority" so that userspace has the 
same influence over a cgroup based comparison that it does with a process 
based comparison to meet business goals.  I'm not sure I'd be happy to 
pollute the mem cgroup v2 filesystem with such knobs when you can easily 
just not set the priority if you don't want to, and increase the priority 
if you have a leaf cgroup that should be preferred to be killed because of 
excess usage.  It has worked quite well in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
  2017-10-05 22:02           ` David Rientjes
@ 2017-10-06  5:43             ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2017-10-06  5:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Johannes Weiner, linux-mm, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu 05-10-17 15:02:18, David Rientjes wrote:
[...]
> I would need to add patches to add the "evaluate as a whole but do not 
> kill all" knob and a knob for "oom priority" so that userspace has the 
> same influence over a cgroup based comparison that it does with a process 
> based comparison to meet business goals.

I do not think 2 knobs would be really necessary for your usecase. If we
allow priorities on non-leaf memcgs then a non 0 priority on such a
memcg would mean that we have to check the cumulative consumption. You
can safely use kill all knob on top of that if you need.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-10-06  5:43 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-04 15:46 [v10 0/6] cgroup-aware OOM killer Roman Gushchin
2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-04 19:14   ` Johannes Weiner
2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-10-04 19:15   ` Johannes Weiner
2017-10-04 20:10   ` David Rientjes
2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-10-04 19:27   ` Johannes Weiner
2017-10-04 19:51     ` Roman Gushchin
2017-10-04 20:17       ` David Rientjes
2017-10-04 20:22         ` Roman Gushchin
2017-10-04 20:31         ` Johannes Weiner
2017-10-05 11:14           ` Michal Hocko
2017-10-04 19:48   ` Shakeel Butt
2017-10-04 20:15     ` Roman Gushchin
2017-10-04 21:24       ` Shakeel Butt
2017-10-05 10:27         ` Roman Gushchin
2017-10-05 11:12           ` Michal Hocko
2017-10-05 11:45             ` Roman Gushchin
2017-10-04 20:27   ` David Rientjes
2017-10-04 20:41     ` Johannes Weiner
2017-10-05  8:40       ` David Rientjes
2017-10-05 10:27         ` Johannes Weiner
2017-10-05 21:53           ` David Rientjes
2017-10-05 10:44         ` Roman Gushchin
2017-10-05 22:02           ` David Rientjes
2017-10-06  5:43             ` Michal Hocko
2017-10-05 11:40   ` Michal Hocko
2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-04 19:37   ` Johannes Weiner
2017-10-05 12:06   ` Michal Hocko
2017-10-05 12:32     ` Roman Gushchin
2017-10-05 12:58       ` Michal Hocko
2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-04 20:04   ` Johannes Weiner
2017-10-05 13:14     ` Michal Hocko
2017-10-05 13:41       ` Roman Gushchin
2017-10-05 14:10         ` Michal Hocko
2017-10-05 14:54       ` Johannes Weiner
2017-10-05 16:40         ` Michal Hocko
2017-10-05 15:51       ` Tejun Heo
2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
2017-10-04 20:08   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).