[v9 0/5] cgroup-aware OOM killer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [v9 0/5] cgroup-aware OOM killer
@ 2017-09-27 13:09 ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

Roman Gushchin (5):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  44 +++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  38 +++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 235 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 194 +++++++++++++++++++++++-------------
 7 files changed, 465 insertions(+), 73 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v9 0/5] cgroup-aware OOM killer
@ 2017-09-27 13:09 ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
    make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

Roman Gushchin (5):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  44 +++++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  38 +++++++
 include/linux/oom.h         |  12 ++-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 235 +++++++++++++++++++++++++++++++++++++++++++-
 mm/oom_kill.c               | 194 +++++++++++++++++++++++-------------
 7 files changed, 465 insertions(+), 73 deletions(-)

-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v9 1/5] mm, oom: refactor the oom_kill_process() function
  2017-09-27 13:09 ` Roman Gushchin
@ 2017-09-27 13:09   ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dee0f75c3013..9d65a173e0f6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,68 +820,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -955,6 +899,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 1/5] mm, oom: refactor the oom_kill_process() function
@ 2017-09-27 13:09   ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dee0f75c3013..9d65a173e0f6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,68 +820,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -955,6 +899,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-09-27 13:09 ` Roman Gushchin
@ 2017-09-27 13:09   ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup should be treated as a leaf cgroup,
so only tasks which are directly belonging to the root cgroup
should be iterated over.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae37b5624eb2..fa1a5120ce3f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
@ 2017-09-27 13:09   ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup should be treated as a leaf cgroup,
so only tasks which are directly belonging to the root cgroup
should be iterated over.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/memcontrol.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae37b5624eb2..fa1a5120ce3f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 			  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	struct mem_cgroup *iter;
 	int ret = 0;
 
-	BUG_ON(memcg == root_mem_cgroup);
-
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
-		if (ret) {
+		if (ret || memcg == root_mem_cgroup) {
 			mem_cgroup_iter_break(memcg, iter);
 			break;
 		}
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-09-27 13:09 ` Roman Gushchin
@ 2017-09-27 13:09   ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest memory consumer:
a leaf memory cgroup or a memory cgroup with the memory.oom_group
option set. Then it kills either a task with the biggest memory
footprint, either all belonging tasks, if memory.oom_group is set.
If a cgroup has memory.oom_group set, all descendant cgroups
implicitly inherit the memory.oom_group setting.

Tasks with oom_score_adj set to -1000 are considered as unkillable.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf and oom_group memory cgroups.
The oom_group option is not supported for the root cgroup.
Due to memcg statistics implementation a special algorithm
is used for estimating root cgroup oom_score: we define it
as maximum oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  38 ++++++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 225 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  79 +++++++++++++---
 4 files changed, 339 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..0289dc3d7434 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks in the subtree in case of OOM */
+	bool oom_group;
+
+	/* cached OOM score */
+	long oom_score;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,18 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	do {
+		if (memcg->oom_group)
+			return true;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return false;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +768,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +964,16 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fa1a5120ce3f..353bb590713e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,198 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	/*
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (score > max_score)
+				max_score = score;
+		}
+		css_task_iter_end(&it);
+
+		return max_score;
+	}
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter, *parent;
+
+	/*
+	 * If OOM is memcg-wide, and the memcg or it's ancestor has
+	 * the oom_group flag, simple select the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		oc->chosen_points = oc->memcg->oom_score;
+		return;
+	}
+
+	oc->chosen_memcg = NULL;
+
+	/*
+	 * The oom_score is calculated for leaf memcgs and propagated upwards
+	 * by the tree.
+	 *
+	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
+	 * so we simple reset oom_score for non-lead cgroups before
+	 * starting accumulating an actual value from underlying sub-tree.
+	 *
+	 * Root memcg is treated as a leaf memcg.
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
+			iter->oom_score = 0;
+			continue;
+		}
+
+		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
+						     oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (iter->oom_score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need to look
+		 * further for new victims.
+		 */
+		if (iter->oom_score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (iter->oom_score > oc->chosen_points) {
+			oc->chosen_memcg = iter;
+			oc->chosen_points = iter->oom_score;
+		}
+
+		for (parent = parent_mem_cgroup(iter); parent && parent != root;
+		     parent = parent_mem_cgroup(parent)) {
+			parent->oom_score += iter->oom_score;
+
+			if (mem_cgroup_oom_group(parent) &&
+			    parent->oom_score > oc->chosen_points) {
+				oc->chosen_memcg = parent;
+				oc->chosen_points = parent->oom_score;
+			}
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5267,6 +5459,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5387,6 +5606,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9d65a173e0f6..c69e5e397407 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -289,7 +289,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -323,26 +323,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -826,6 +826,12 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -901,7 +907,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+	if (!tsk_is_oom_victim(task)) {
+		get_task_struct(task);
+		__oom_kill_process(task);
+	}
+	return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Always begin with the task with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, oc->chosen_task);
+
+	__oom_kill_process(oc->chosen_task);
+
+	/* If oom_group flag is set, kill all belonging tasks */
+	if (mem_cgroup_oom_group(oc->chosen_memcg))
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+
+	schedule_timeout_killable(1);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1058,18 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
 		/*
@@ -1078,7 +1129,7 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		schedule_timeout_killable(1);
 	}
-	return !!oc->chosen;
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-09-27 13:09   ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest memory consumer:
a leaf memory cgroup or a memory cgroup with the memory.oom_group
option set. Then it kills either a task with the biggest memory
footprint, either all belonging tasks, if memory.oom_group is set.
If a cgroup has memory.oom_group set, all descendant cgroups
implicitly inherit the memory.oom_group setting.

Tasks with oom_score_adj set to -1000 are considered as unkillable.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf and oom_group memory cgroups.
The oom_group option is not supported for the root cgroup.
Due to memcg statistics implementation a special algorithm
is used for estimating root cgroup oom_score: we define it
as maximum oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  38 ++++++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 225 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  79 +++++++++++++---
 4 files changed, 339 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..0289dc3d7434 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks in the subtree in case of OOM */
+	bool oom_group;
+
+	/* cached OOM score */
+	long oom_score;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,18 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	do {
+		if (memcg->oom_group)
+			return true;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return false;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +768,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +964,16 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fa1a5120ce3f..353bb590713e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,198 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	/*
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (score > max_score)
+				max_score = score;
+		}
+		css_task_iter_end(&it);
+
+		return max_score;
+	}
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter, *parent;
+
+	/*
+	 * If OOM is memcg-wide, and the memcg or it's ancestor has
+	 * the oom_group flag, simple select the memcg as a victim.
+	 */
+	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		oc->chosen_points = oc->memcg->oom_score;
+		return;
+	}
+
+	oc->chosen_memcg = NULL;
+
+	/*
+	 * The oom_score is calculated for leaf memcgs and propagated upwards
+	 * by the tree.
+	 *
+	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
+	 * so we simple reset oom_score for non-lead cgroups before
+	 * starting accumulating an actual value from underlying sub-tree.
+	 *
+	 * Root memcg is treated as a leaf memcg.
+	 */
+	rcu_read_lock();
+	for_each_mem_cgroup_tree(iter, root) {
+		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
+			iter->oom_score = 0;
+			continue;
+		}
+
+		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
+						     oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (iter->oom_score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need to look
+		 * further for new victims.
+		 */
+		if (iter->oom_score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			break;
+		}
+
+		if (iter->oom_score > oc->chosen_points) {
+			oc->chosen_memcg = iter;
+			oc->chosen_points = iter->oom_score;
+		}
+
+		for (parent = parent_mem_cgroup(iter); parent && parent != root;
+		     parent = parent_mem_cgroup(parent)) {
+			parent->oom_score += iter->oom_score;
+
+			if (mem_cgroup_oom_group(parent) &&
+			    parent->oom_score > oc->chosen_points) {
+				oc->chosen_memcg = parent;
+				oc->chosen_points = parent->oom_score;
+			}
+		}
+	}
+
+	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+		css_get(&oc->chosen_memcg->css);
+
+	rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	select_victim_memcg(root, oc);
+
+	return oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5267,6 +5459,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5387,6 +5606,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9d65a173e0f6..c69e5e397407 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -289,7 +289,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -323,26 +323,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -826,6 +826,12 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -901,7 +907,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+	if (!tsk_is_oom_victim(task)) {
+		get_task_struct(task);
+		__oom_kill_process(task);
+	}
+	return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+		return oc->chosen_memcg;
+
+	/* Always begin with the task with the biggest memory footprint */
+	oc->chosen_points = 0;
+	oc->chosen_task = NULL;
+	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+		goto out;
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, oc->chosen_task);
+
+	__oom_kill_process(oc->chosen_task);
+
+	/* If oom_group flag is set, kill all belonging tasks */
+	if (mem_cgroup_oom_group(oc->chosen_memcg))
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+				      NULL);
+
+	schedule_timeout_killable(1);
+
+out:
+	mem_cgroup_put(oc->chosen_memcg);
+	return oc->chosen_task;
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1058,18 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
 		/*
@@ -1078,7 +1129,7 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		schedule_timeout_killable(1);
 	}
-	return !!oc->chosen;
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-27 13:09 ` Roman Gushchin
@ 2017-09-27 13:09   ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 353bb590713e..3f82b6f22d63 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2852,6 +2852,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
@ 2017-09-27 13:09   ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 353bb590713e..3f82b6f22d63 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2852,6 +2852,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 5/5] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-09-27 13:09 ` Roman Gushchin
@ 2017-09-27 13:09   ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..936dd60b8d6a 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1043,6 +1044,21 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup and all
+	descendant cgroups as indivisible memory consumers and compare
+	them with other memory consumers by their memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1246,6 +1262,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v9 5/5] mm, oom, docs: describe the cgroup-aware OOM killer
@ 2017-09-27 13:09   ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-09-27 13:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..936dd60b8d6a 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1043,6 +1044,21 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will consider the memory cgroup and all
+	descendant cgroups as indivisible memory consumers and compare
+	them with other memory consumers by their memory footprint.
+	If such memory cgroup is selected as an OOM victim, all
+	processes belonging to it or it's descendants will be killed.
+
+	OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1246,6 +1262,34 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-09-27 13:09   ` Roman Gushchin
@ 2017-10-03 10:49     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 10:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:33, Roman Gushchin wrote:
> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup should be treated as a leaf cgroup,
> so only tasks which are directly belonging to the root cgroup
> should be iterated over.

I would only add that this patch doesn't introduce any functionally
visible change because we never trigger oom killer with the root memcg
as the root of the hierarchy. So this is just a preparatory work for
later changes.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae37b5624eb2..fa1a5120ce3f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * value, the function breaks the iteration loop and returns the value.
>   * Otherwise, it will iterate over all tasks and return 0.
>   *
> - * This function must not be called for the root memory cgroup.
> + * If memcg is the root memory cgroup, this function will iterate only
> + * over tasks belonging directly to the root memory cgroup.
>   */
>  int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  			  int (*fn)(struct task_struct *, void *), void *arg)
> @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	struct mem_cgroup *iter;
>  	int ret = 0;
>  
> -	BUG_ON(memcg == root_mem_cgroup);
> -
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  		while (!ret && (task = css_task_iter_next(&it)))
>  			ret = fn(task, arg);
>  		css_task_iter_end(&it);
> -		if (ret) {
> +		if (ret || memcg == root_mem_cgroup) {
>  			mem_cgroup_iter_break(memcg, iter);
>  			break;
>  		}
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
@ 2017-10-03 10:49     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 10:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:33, Roman Gushchin wrote:
> Implement mem_cgroup_scan_tasks() functionality for the root
> memory cgroup to use this function for looking for a OOM victim
> task in the root memory cgroup by the cgroup-ware OOM killer.
> 
> The root memory cgroup should be treated as a leaf cgroup,
> so only tasks which are directly belonging to the root cgroup
> should be iterated over.

I would only add that this patch doesn't introduce any functionally
visible change because we never trigger oom killer with the root memcg
as the root of the hierarchy. So this is just a preparatory work for
later changes.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae37b5624eb2..fa1a5120ce3f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * value, the function breaks the iteration loop and returns the value.
>   * Otherwise, it will iterate over all tasks and return 0.
>   *
> - * This function must not be called for the root memory cgroup.
> + * If memcg is the root memory cgroup, this function will iterate only
> + * over tasks belonging directly to the root memory cgroup.
>   */
>  int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  			  int (*fn)(struct task_struct *, void *), void *arg)
> @@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	struct mem_cgroup *iter;
>  	int ret = 0;
>  
> -	BUG_ON(memcg == root_mem_cgroup);
> -
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> @@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  		while (!ret && (task = css_task_iter_next(&it)))
>  			ret = fn(task, arg);
>  		css_task_iter_end(&it);
> -		if (ret) {
> +		if (ret || memcg == root_mem_cgroup) {
>  			mem_cgroup_iter_break(memcg, iter);
>  			break;
>  		}
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-09-27 13:09   ` Roman Gushchin
@ 2017-10-03 11:48     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 11:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest memory consumer:
> a leaf memory cgroup or a memory cgroup with the memory.oom_group
> option set. Then it kills either a task with the biggest memory
> footprint, either all belonging tasks, if memory.oom_group is set.
> If a cgroup has memory.oom_group set, all descendant cgroups
> implicitly inherit the memory.oom_group setting.

I think it would be better to separate oom_group into its own patch.
So this patch would just add the cgroup awareness and oom_group will
build on top of that.

Wrt. to the implicit inheritance you brought up in a separate email
thread [1]. Let me quote
: after some additional thinking I don't think anymore that implicit
: propagation of oom_group is a good idea.  Let me explain: assume we
: have memcg A with memory.max and memory.oom_group set, and nested
: memcg A/B with memory.max set. Let's imagine we have an OOM event if
: A/B. What is an expected system behavior?
: We have OOM scoped to A/B, and any action should be also scoped to A/B.
: We really shouldn't touch processes which are not belonging to A/B.
: That means we should either kill the biggest process in A/B, either all
: processes in A/B. It's natural to make A/B/memory.oom_group responsible
: for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
: It really makes no sense, and makes oom_group knob really hard to describe.
: 
: Also, after some off-list discussion, we've realized that memory.oom_knob
: should be delegatable. The workload should have control over it to express
: dependency between processes.

OK, I have asked about this already but I am not sure the answer was
very explicit. So let me ask again. When exactly a subtree would
disagree with the parent on oom_group? In other words when do we want a
different cleanup based on the OOM root? I am not saying this is wrong
I am just curious about a practical example.

> Tasks with oom_score_adj set to -1000 are considered as unkillable.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf and oom_group memory cgroups.
> The oom_group option is not supported for the root cgroup.
> Due to memcg statistics implementation a special algorithm
> is used for estimating root cgroup oom_score: we define it
> as maximum oom_score of the belonging tasks.

[1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com

[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	/*
> +	 * We don't have necessary stats for the root memcg,
> +	 * so we define it's oom_score as the maximum oom_score
> +	 * of the belonging tasks.
> +	 */

Why not a sum of all tasks which would more resemble what we do for
other memcgs? Sure this would require ignoring oom_score_adj so
oom_badness would have to be tweaked a bit (basically split it into
__oom_badness which calculates the value without the bias and
oom_badness on top adding the bias on top of the scaled value).

> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score, max_score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			score = oom_badness(task, memcg, nodemask,
> +					    totalpages);
> +			if (score > max_score)
> +				max_score = score;
> +		}
> +		css_task_iter_end(&it);
> +
> +		return max_score;
> +	}
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter, *parent;
> +
> +	/*
> +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> +	 * the oom_group flag, simple select the memcg as a victim.
> +	 */
> +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> +		oc->chosen_memcg = oc->memcg;
> +		css_get(&oc->chosen_memcg->css);
> +		oc->chosen_points = oc->memcg->oom_score;
> +		return;
> +	}
> +
> +	oc->chosen_memcg = NULL;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> +	 * by the tree.
> +	 *
> +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> +	 * so we simple reset oom_score for non-lead cgroups before
> +	 * starting accumulating an actual value from underlying sub-tree.
> +	 *
> +	 * Root memcg is treated as a leaf memcg.
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> +			iter->oom_score = 0;
> +			continue;
> +		}
> +
> +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> +						     oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (iter->oom_score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need to look
> +		 * further for new victims.
> +		 */
> +		if (iter->oom_score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (iter->oom_score > oc->chosen_points) {
> +			oc->chosen_memcg = iter;
> +			oc->chosen_points = iter->oom_score;
> +		}
> +
> +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +		     parent = parent_mem_cgroup(parent)) {
> +			parent->oom_score += iter->oom_score;
> +
> +			if (mem_cgroup_oom_group(parent) &&
> +			    parent->oom_score > oc->chosen_points) {
> +				oc->chosen_memcg = parent;
> +				oc->chosen_points = parent->oom_score;
> +			}
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}


As I've written in a private email, things will get much easier if you
get rid of memcg->oom_score and simply do the recursive oom_score
evaluation of eligible inter nodes. You would basically do
	for_each_mem_cgroup_tree(root, iter) {
		if (!memcg_oom_eligible(iter))
			continue;

		oom_score = oom_evaluate_memcg(iter, mask);
		if (oom_score == -1) {
			oc->chosen_memcg = INFLIGHT_VICTIM;
			mem_cgroup_iter_break(root, iter);
			break;
		}
		if (oom_score > oc->chosen_points) {
			mark_new_oom_memcg(iter);
		}

		/* potential optimization to skip the whole subtree if
		 * iter is not leaf */
	}

where
bool memcg_oom_eligible(struct mem_cgroup *memcg)
{
	if (cgroup_has_tasks(memcg->css.cgroup))
		return true;
	if (mem_cgroup_oom_group(memcg))
		return true;
	return false;
}

unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
	/* calculate badness */
}

unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
	unsigned long score = 0;

	if (memcg == root_mem_cgroup) {
		for_each_task()
			score += __oom_badness(task, mask);
		return score
	}

	for_each_mem_cgroup_tree(memcg, iter) {
		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
		if (memcg_score == -1) {
			mem_cgroup_iter_break(memcg, iter);
			return -1;
		}
	}

	return score;
}

This should be also simple to split for oom_group in a separate patch
while keeping the overall code structure.
Does this make any sense to you?

[...]
> @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> +	if (!tsk_is_oom_victim(task)) {

How can this happen?

> +		get_task_struct(task);
> +		__oom_kill_process(task);
> +	}
> +	return 0;
> +}
> +
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Always begin with the task with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	if (__ratelimit(&oom_rs))
> +		dump_header(oc, oc->chosen_task);

Hmm, does the full dump_header really apply for the new heuristic? E.g.
does it make sense to dump_tasks()? Would it make sense to print stats
of all eligible memcgs instead?

> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +	/* If oom_group flag is set, kill all belonging tasks */
> +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> +				      NULL);
> +
> +	schedule_timeout_killable(1);

I would prefer if we had this timeout at a single place in
out_of_memory()

Other than that the semantic (sans oom_group which needs more
clarification) makes sense to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 11:48     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 11:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest memory consumer:
> a leaf memory cgroup or a memory cgroup with the memory.oom_group
> option set. Then it kills either a task with the biggest memory
> footprint, either all belonging tasks, if memory.oom_group is set.
> If a cgroup has memory.oom_group set, all descendant cgroups
> implicitly inherit the memory.oom_group setting.

I think it would be better to separate oom_group into its own patch.
So this patch would just add the cgroup awareness and oom_group will
build on top of that.

Wrt. to the implicit inheritance you brought up in a separate email
thread [1]. Let me quote
: after some additional thinking I don't think anymore that implicit
: propagation of oom_group is a good idea.  Let me explain: assume we
: have memcg A with memory.max and memory.oom_group set, and nested
: memcg A/B with memory.max set. Let's imagine we have an OOM event if
: A/B. What is an expected system behavior?
: We have OOM scoped to A/B, and any action should be also scoped to A/B.
: We really shouldn't touch processes which are not belonging to A/B.
: That means we should either kill the biggest process in A/B, either all
: processes in A/B. It's natural to make A/B/memory.oom_group responsible
: for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
: It really makes no sense, and makes oom_group knob really hard to describe.
: 
: Also, after some off-list discussion, we've realized that memory.oom_knob
: should be delegatable. The workload should have control over it to express
: dependency between processes.

OK, I have asked about this already but I am not sure the answer was
very explicit. So let me ask again. When exactly a subtree would
disagree with the parent on oom_group? In other words when do we want a
different cleanup based on the OOM root? I am not saying this is wrong
I am just curious about a practical example.

> Tasks with oom_score_adj set to -1000 are considered as unkillable.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf and oom_group memory cgroups.
> The oom_group option is not supported for the root cgroup.
> Due to memcg statistics implementation a special algorithm
> is used for estimating root cgroup oom_score: we define it
> as maximum oom_score of the belonging tasks.

[1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com

[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	/*
> +	 * We don't have necessary stats for the root memcg,
> +	 * so we define it's oom_score as the maximum oom_score
> +	 * of the belonging tasks.
> +	 */

Why not a sum of all tasks which would more resemble what we do for
other memcgs? Sure this would require ignoring oom_score_adj so
oom_badness would have to be tweaked a bit (basically split it into
__oom_badness which calculates the value without the bias and
oom_badness on top adding the bias on top of the scaled value).

> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score, max_score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			score = oom_badness(task, memcg, nodemask,
> +					    totalpages);
> +			if (score > max_score)
> +				max_score = score;
> +		}
> +		css_task_iter_end(&it);
> +
> +		return max_score;
> +	}
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +			       const nodemask_t *nodemask,
> +			       unsigned long totalpages)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int eligible = 0;
> +
> +	/*
> +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> +	 *
> +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +	 * as unkillable.
> +	 *
> +	 * If there are inflight OOM victim tasks inside the memcg,
> +	 * we return -1.
> +	 */
> +	css_task_iter_start(&memcg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (!eligible &&
> +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +			eligible = 1;
> +
> +		if (tsk_is_oom_victim(task) &&
> +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +			eligible = -1;
> +			break;
> +		}
> +	}
> +	css_task_iter_end(&it);
> +
> +	if (eligible <= 0)
> +		return eligible;
> +
> +	return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter, *parent;
> +
> +	/*
> +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> +	 * the oom_group flag, simple select the memcg as a victim.
> +	 */
> +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> +		oc->chosen_memcg = oc->memcg;
> +		css_get(&oc->chosen_memcg->css);
> +		oc->chosen_points = oc->memcg->oom_score;
> +		return;
> +	}
> +
> +	oc->chosen_memcg = NULL;
> +
> +	/*
> +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> +	 * by the tree.
> +	 *
> +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> +	 * so we simple reset oom_score for non-lead cgroups before
> +	 * starting accumulating an actual value from underlying sub-tree.
> +	 *
> +	 * Root memcg is treated as a leaf memcg.
> +	 */
> +	rcu_read_lock();
> +	for_each_mem_cgroup_tree(iter, root) {
> +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> +			iter->oom_score = 0;
> +			continue;
> +		}
> +
> +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> +						     oc->totalpages);
> +
> +		/*
> +		 * Ignore empty and non-eligible memory cgroups.
> +		 */
> +		if (iter->oom_score == 0)
> +			continue;
> +
> +		/*
> +		 * If there are inflight OOM victims, we don't need to look
> +		 * further for new victims.
> +		 */
> +		if (iter->oom_score == -1) {
> +			oc->chosen_memcg = INFLIGHT_VICTIM;
> +			mem_cgroup_iter_break(root, iter);
> +			break;
> +		}
> +
> +		if (iter->oom_score > oc->chosen_points) {
> +			oc->chosen_memcg = iter;
> +			oc->chosen_points = iter->oom_score;
> +		}
> +
> +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +		     parent = parent_mem_cgroup(parent)) {
> +			parent->oom_score += iter->oom_score;
> +
> +			if (mem_cgroup_oom_group(parent) &&
> +			    parent->oom_score > oc->chosen_points) {
> +				oc->chosen_memcg = parent;
> +				oc->chosen_points = parent->oom_score;
> +			}
> +		}
> +	}
> +
> +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +		css_get(&oc->chosen_memcg->css);
> +
> +	rcu_read_unlock();
> +}


As I've written in a private email, things will get much easier if you
get rid of memcg->oom_score and simply do the recursive oom_score
evaluation of eligible inter nodes. You would basically do
	for_each_mem_cgroup_tree(root, iter) {
		if (!memcg_oom_eligible(iter))
			continue;

		oom_score = oom_evaluate_memcg(iter, mask);
		if (oom_score == -1) {
			oc->chosen_memcg = INFLIGHT_VICTIM;
			mem_cgroup_iter_break(root, iter);
			break;
		}
		if (oom_score > oc->chosen_points) {
			mark_new_oom_memcg(iter);
		}

		/* potential optimization to skip the whole subtree if
		 * iter is not leaf */
	}

where
bool memcg_oom_eligible(struct mem_cgroup *memcg)
{
	if (cgroup_has_tasks(memcg->css.cgroup))
		return true;
	if (mem_cgroup_oom_group(memcg))
		return true;
	return false;
}

unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
	/* calculate badness */
}

unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
	unsigned long score = 0;

	if (memcg == root_mem_cgroup) {
		for_each_task()
			score += __oom_badness(task, mask);
		return score
	}

	for_each_mem_cgroup_tree(memcg, iter) {
		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
		if (memcg_score == -1) {
			mem_cgroup_iter_break(memcg, iter);
			return -1;
		}
	}

	return score;
}

This should be also simple to split for oom_group in a separate patch
while keeping the overall code structure.
Does this make any sense to you?

[...]
> @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	__oom_kill_process(victim);
>  }
>  
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> +	if (!tsk_is_oom_victim(task)) {

How can this happen?

> +		get_task_struct(task);
> +		__oom_kill_process(task);
> +	}
> +	return 0;
> +}
> +
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
> +
> +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +		return oc->chosen_memcg;
> +
> +	/* Always begin with the task with the biggest memory footprint */
> +	oc->chosen_points = 0;
> +	oc->chosen_task = NULL;
> +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +		goto out;
> +
> +	if (__ratelimit(&oom_rs))
> +		dump_header(oc, oc->chosen_task);

Hmm, does the full dump_header really apply for the new heuristic? E.g.
does it make sense to dump_tasks()? Would it make sense to print stats
of all eligible memcgs instead?

> +
> +	__oom_kill_process(oc->chosen_task);
> +
> +	/* If oom_group flag is set, kill all belonging tasks */
> +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> +				      NULL);
> +
> +	schedule_timeout_killable(1);

I would prefer if we had this timeout at a single place in
out_of_memory()

Other than that the semantic (sans oom_group which needs more
clarification) makes sense to me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-27 13:09   ` Roman Gushchin
@ 2017-10-03 11:50     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 11:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.

I do not have a strong preference about this. I would just be worried
that it is usually systemd which tries to own the whole hierarchy and it
has no idea what is the proper oom behavior because that highly depends
on the specific workload so a global sysctl/kernel command line argument
would fit better IMHO.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/cgroup-defs.h |  5 +++++
>  kernel/cgroup/cgroup.c      | 10 ++++++++++
>  mm/memcontrol.c             |  3 +++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 3e55bbd31ad1..cae5343a8b21 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -80,6 +80,11 @@ enum {
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
>  	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> +	/*
> +	 * Enable cgroup-aware OOM killer.
> +	 */
> +	CGRP_GROUP_OOM = (1 << 5),
>  };
>  
>  /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c3421ee0d230..8d8aa46ff930 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
>  		if (!strcmp(token, "nsdelegate")) {
>  			*root_flags |= CGRP_ROOT_NS_DELEGATE;
>  			continue;
> +		} else if (!strcmp(token, "groupoom")) {
> +			*root_flags |= CGRP_GROUP_OOM;
> +			continue;
>  		}
>  
>  		pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>  			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>  		else
>  			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> +		if (root_flags & CGRP_GROUP_OOM)
> +			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> +		else
> +			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
>  	}
>  }
>  
> @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>  {
>  	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>  		seq_puts(seq, ",nsdelegate");
> +	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> +		seq_puts(seq, ",groupoom");
>  	return 0;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 353bb590713e..3f82b6f22d63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2852,6 +2852,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return false;
>  
> +	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> +		return false;
> +
>  	if (oc->memcg)
>  		root = oc->memcg;
>  	else
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
@ 2017-10-03 11:50     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 11:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.

I do not have a strong preference about this. I would just be worried
that it is usually systemd which tries to own the whole hierarchy and it
has no idea what is the proper oom behavior because that highly depends
on the specific workload so a global sysctl/kernel command line argument
would fit better IMHO.

> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/cgroup-defs.h |  5 +++++
>  kernel/cgroup/cgroup.c      | 10 ++++++++++
>  mm/memcontrol.c             |  3 +++
>  3 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 3e55bbd31ad1..cae5343a8b21 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -80,6 +80,11 @@ enum {
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
>  	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> +	/*
> +	 * Enable cgroup-aware OOM killer.
> +	 */
> +	CGRP_GROUP_OOM = (1 << 5),
>  };
>  
>  /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c3421ee0d230..8d8aa46ff930 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
>  		if (!strcmp(token, "nsdelegate")) {
>  			*root_flags |= CGRP_ROOT_NS_DELEGATE;
>  			continue;
> +		} else if (!strcmp(token, "groupoom")) {
> +			*root_flags |= CGRP_GROUP_OOM;
> +			continue;
>  		}
>  
>  		pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
>  			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>  		else
>  			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> +		if (root_flags & CGRP_GROUP_OOM)
> +			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> +		else
> +			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
>  	}
>  }
>  
> @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
>  {
>  	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>  		seq_puts(seq, ",nsdelegate");
> +	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> +		seq_puts(seq, ",groupoom");
>  	return 0;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 353bb590713e..3f82b6f22d63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2852,6 +2852,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return false;
>  
> +	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> +		return false;
> +
>  	if (oc->memcg)
>  		root = oc->memcg;
>  	else
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 11:48     ` Michal Hocko
  (?)
@ 2017-10-03 12:37       ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest memory consumer:
> > a leaf memory cgroup or a memory cgroup with the memory.oom_group
> > option set. Then it kills either a task with the biggest memory
> > footprint, either all belonging tasks, if memory.oom_group is set.
> > If a cgroup has memory.oom_group set, all descendant cgroups
> > implicitly inherit the memory.oom_group setting.
> 
> I think it would be better to separate oom_group into its own patch.
> So this patch would just add the cgroup awareness and oom_group will
> build on top of that.

Sure, will do.

> 
> Wrt. to the implicit inheritance you brought up in a separate email
> thread [1]. Let me quote
> : after some additional thinking I don't think anymore that implicit
> : propagation of oom_group is a good idea.  Let me explain: assume we
> : have memcg A with memory.max and memory.oom_group set, and nested
> : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> : A/B. What is an expected system behavior?
> : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> : We really shouldn't touch processes which are not belonging to A/B.
> : That means we should either kill the biggest process in A/B, either all
> : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> : It really makes no sense, and makes oom_group knob really hard to describe.
> : 
> : Also, after some off-list discussion, we've realized that memory.oom_knob
> : should be delegatable. The workload should have control over it to express
> : dependency between processes.
> 
> OK, I have asked about this already but I am not sure the answer was
> very explicit. So let me ask again. When exactly a subtree would
> disagree with the parent on oom_group? In other words when do we want a
> different cleanup based on the OOM root? I am not saying this is wrong
> I am just curious about a practical example.

Well, I do not have a practical example right now, but it's against the logic.
Any OOM event has a scope, and group_oom knob is applied for OOM events
scoped to the cgroup or any ancestors (including system as a whole).
So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
It's a strange configuration limitation, and I do not see any benefits:
it doesn't provide any new functionality or guarantees.

Even if we don't have practical examples, we should build something less
surprising for a user, and I don't understand why oom_group should be inherited.

> 
> > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with other leaf and oom_group memory cgroups.
> > The oom_group option is not supported for the root cgroup.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating root cgroup oom_score: we define it
> > as maximum oom_score of the belonging tasks.
> 
> [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> 
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask,
> > +			      unsigned long totalpages)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	pg_data_t *pgdat;
> > +
> > +	/*
> > +	 * We don't have necessary stats for the root memcg,
> > +	 * so we define it's oom_score as the maximum oom_score
> > +	 * of the belonging tasks.
> > +	 */
> 
> Why not a sum of all tasks which would more resemble what we do for
> other memcgs? Sure this would require ignoring oom_score_adj so
> oom_badness would have to be tweaked a bit (basically split it into
> __oom_badness which calculates the value without the bias and
> oom_badness on top adding the bias on top of the scaled value).

We've discussed it already: calculating the sum is tricky, as tasks
are sharing memory (and the mm struct(. As I remember, you suggested
using maximum to solve exactly this problem, and I think it's a good
approximation. Assuming that tasks in the root cgroup likely have
nothing in common, and we don't support oom_group for it, looking
at the biggest task makes perfect sense: we're exactly comparing
killable entities.

> 
> > +	if (memcg == root_mem_cgroup) {
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		long score, max_score = 0;
> > +
> > +		css_task_iter_start(&memcg->css, 0, &it);
> > +		while ((task = css_task_iter_next(&it))) {
> > +			score = oom_badness(task, memcg, nodemask,
> > +					    totalpages);
> > +			if (score > max_score)
> > +				max_score = score;
> > +		}
> > +		css_task_iter_end(&it);
> > +
> > +		return max_score;
> > +	}
> > +
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		if (nodemask && !node_isset(nid, *nodemask))
> > +			continue;
> > +
> > +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +
> > +		pgdat = NODE_DATA(nid);
> > +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > +					    NR_SLAB_UNRECLAIMABLE);
> > +	}
> > +
> > +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +		(PAGE_SIZE / 1024);
> > +	points += memcg_page_state(memcg, MEMCG_SOCK);
> > +	points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > +	return points;
> > +}
> > +
> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *       of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +			       const nodemask_t *nodemask,
> > +			       unsigned long totalpages)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	int eligible = 0;
> > +
> > +	/*
> > +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> > +	 *
> > +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +	 * as unkillable.
> > +	 *
> > +	 * If there are inflight OOM victim tasks inside the memcg,
> > +	 * we return -1.
> > +	 */
> > +	css_task_iter_start(&memcg->css, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (!eligible &&
> > +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +			eligible = 1;
> > +
> > +		if (tsk_is_oom_victim(task) &&
> > +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > +			eligible = -1;
> > +			break;
> > +		}
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (eligible <= 0)
> > +		return eligible;
> > +
> > +	return memcg_oom_badness(memcg, nodemask, totalpages);
> > +}
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter, *parent;
> > +
> > +	/*
> > +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> > +	 * the oom_group flag, simple select the memcg as a victim.
> > +	 */
> > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > +		oc->chosen_memcg = oc->memcg;
> > +		css_get(&oc->chosen_memcg->css);
> > +		oc->chosen_points = oc->memcg->oom_score;
> > +		return;
> > +	}
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	/*
> > +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> > +	 * by the tree.
> > +	 *
> > +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> > +	 * so we simple reset oom_score for non-lead cgroups before
> > +	 * starting accumulating an actual value from underlying sub-tree.
> > +	 *
> > +	 * Root memcg is treated as a leaf memcg.
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> > +			iter->oom_score = 0;
> > +			continue;
> > +		}
> > +
> > +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> > +						     oc->totalpages);
> > +
> > +		/*
> > +		 * Ignore empty and non-eligible memory cgroups.
> > +		 */
> > +		if (iter->oom_score == 0)
> > +			continue;
> > +
> > +		/*
> > +		 * If there are inflight OOM victims, we don't need to look
> > +		 * further for new victims.
> > +		 */
> > +		if (iter->oom_score == -1) {
> > +			oc->chosen_memcg = INFLIGHT_VICTIM;
> > +			mem_cgroup_iter_break(root, iter);
> > +			break;
> > +		}
> > +
> > +		if (iter->oom_score > oc->chosen_points) {
> > +			oc->chosen_memcg = iter;
> > +			oc->chosen_points = iter->oom_score;
> > +		}
> > +
> > +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > +		     parent = parent_mem_cgroup(parent)) {
> > +			parent->oom_score += iter->oom_score;
> > +
> > +			if (mem_cgroup_oom_group(parent) &&
> > +			    parent->oom_score > oc->chosen_points) {
> > +				oc->chosen_memcg = parent;
> > +				oc->chosen_points = parent->oom_score;
> > +			}
> > +		}
> > +	}
> > +
> > +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> > +		css_get(&oc->chosen_memcg->css);
> > +
> > +	rcu_read_unlock();
> > +}
> 
> 
> As I've written in a private email, things will get much easier if you
> get rid of memcg->oom_score and simply do the recursive oom_score
> evaluation of eligible inter nodes. You would basically do
> 	for_each_mem_cgroup_tree(root, iter) {
> 		if (!memcg_oom_eligible(iter))
> 			continue;
> 
> 		oom_score = oom_evaluate_memcg(iter, mask);
> 		if (oom_score == -1) {
> 			oc->chosen_memcg = INFLIGHT_VICTIM;
> 			mem_cgroup_iter_break(root, iter);
> 			break;
> 		}
> 		if (oom_score > oc->chosen_points) {
> 			mark_new_oom_memcg(iter);
> 		}
> 
> 		/* potential optimization to skip the whole subtree if
> 		 * iter is not leaf */
> 	}
> 
> where
> bool memcg_oom_eligible(struct mem_cgroup *memcg)
> {
> 	if (cgroup_has_tasks(memcg->css.cgroup))
> 		return true;
> 	if (mem_cgroup_oom_group(memcg))
> 		return true;
> 	return false;
> }
> 
> unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
> 	/* calculate badness */
> }
> 
> unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	unsigned long score = 0;
> 
> 	if (memcg == root_mem_cgroup) {
> 		for_each_task()
> 			score += __oom_badness(task, mask);
> 		return score
> 	}
> 
> 	for_each_mem_cgroup_tree(memcg, iter) {
> 		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
> 		if (memcg_score == -1) {
> 			mem_cgroup_iter_break(memcg, iter);
> 			return -1;
> 		}
> 	}
> 
> 	return score;
> }
> 
> This should be also simple to split for oom_group in a separate patch
> while keeping the overall code structure.
> Does this make any sense to you?

I totally agree that getting rid of memcg->oom_score is possible and is
a good idea, as well as separating oom_group into a separate patch.

What about the rest, let me check what I can do here.


> 
> [...]
> > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	__oom_kill_process(victim);
> >  }
> >  
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > +	if (!tsk_is_oom_victim(task)) {
> 
> How can this happen?

We do start with killing the largest process, and then iterate over all tasks
in the cgroup. So, this check is required to avoid killing tasks which are
already in the termination process.

> 
> > +		get_task_struct(task);
> > +		__oom_kill_process(task);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > +				      DEFAULT_RATELIMIT_BURST);
> > +
> > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > +		return oc->chosen_memcg;
> > +
> > +	/* Always begin with the task with the biggest memory footprint */
> > +	oc->chosen_points = 0;
> > +	oc->chosen_task = NULL;
> > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > +
> > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > +		goto out;
> > +
> > +	if (__ratelimit(&oom_rs))
> > +		dump_header(oc, oc->chosen_task);
> 
> Hmm, does the full dump_header really apply for the new heuristic? E.g.
> does it make sense to dump_tasks()? Would it make sense to print stats
> of all eligible memcgs instead?

Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
but is also closely connected with the implementation. So I would suggest
to postpone this until we'll get more usage examples and will better
understand what information we need.

> 
> > +
> > +	__oom_kill_process(oc->chosen_task);
> > +
> > +	/* If oom_group flag is set, kill all belonging tasks */
> > +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > +				      NULL);
> > +
> > +	schedule_timeout_killable(1);
> 
> I would prefer if we had this timeout at a single place in
> out_of_memory()

Ok, will do.

> 
> Other than that the semantic (sans oom_group which needs more
> clarification) makes sense to me.

Cool!

Glad to hear this.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 12:37       ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest memory consumer:
> > a leaf memory cgroup or a memory cgroup with the memory.oom_group
> > option set. Then it kills either a task with the biggest memory
> > footprint, either all belonging tasks, if memory.oom_group is set.
> > If a cgroup has memory.oom_group set, all descendant cgroups
> > implicitly inherit the memory.oom_group setting.
> 
> I think it would be better to separate oom_group into its own patch.
> So this patch would just add the cgroup awareness and oom_group will
> build on top of that.

Sure, will do.

> 
> Wrt. to the implicit inheritance you brought up in a separate email
> thread [1]. Let me quote
> : after some additional thinking I don't think anymore that implicit
> : propagation of oom_group is a good idea.  Let me explain: assume we
> : have memcg A with memory.max and memory.oom_group set, and nested
> : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> : A/B. What is an expected system behavior?
> : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> : We really shouldn't touch processes which are not belonging to A/B.
> : That means we should either kill the biggest process in A/B, either all
> : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> : It really makes no sense, and makes oom_group knob really hard to describe.
> : 
> : Also, after some off-list discussion, we've realized that memory.oom_knob
> : should be delegatable. The workload should have control over it to express
> : dependency between processes.
> 
> OK, I have asked about this already but I am not sure the answer was
> very explicit. So let me ask again. When exactly a subtree would
> disagree with the parent on oom_group? In other words when do we want a
> different cleanup based on the OOM root? I am not saying this is wrong
> I am just curious about a practical example.

Well, I do not have a practical example right now, but it's against the logic.
Any OOM event has a scope, and group_oom knob is applied for OOM events
scoped to the cgroup or any ancestors (including system as a whole).
So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
It's a strange configuration limitation, and I do not see any benefits:
it doesn't provide any new functionality or guarantees.

Even if we don't have practical examples, we should build something less
surprising for a user, and I don't understand why oom_group should be inherited.

> 
> > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with other leaf and oom_group memory cgroups.
> > The oom_group option is not supported for the root cgroup.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating root cgroup oom_score: we define it
> > as maximum oom_score of the belonging tasks.
> 
> [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> 
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask,
> > +			      unsigned long totalpages)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	pg_data_t *pgdat;
> > +
> > +	/*
> > +	 * We don't have necessary stats for the root memcg,
> > +	 * so we define it's oom_score as the maximum oom_score
> > +	 * of the belonging tasks.
> > +	 */
> 
> Why not a sum of all tasks which would more resemble what we do for
> other memcgs? Sure this would require ignoring oom_score_adj so
> oom_badness would have to be tweaked a bit (basically split it into
> __oom_badness which calculates the value without the bias and
> oom_badness on top adding the bias on top of the scaled value).

We've discussed it already: calculating the sum is tricky, as tasks
are sharing memory (and the mm struct(. As I remember, you suggested
using maximum to solve exactly this problem, and I think it's a good
approximation. Assuming that tasks in the root cgroup likely have
nothing in common, and we don't support oom_group for it, looking
at the biggest task makes perfect sense: we're exactly comparing
killable entities.

> 
> > +	if (memcg == root_mem_cgroup) {
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		long score, max_score = 0;
> > +
> > +		css_task_iter_start(&memcg->css, 0, &it);
> > +		while ((task = css_task_iter_next(&it))) {
> > +			score = oom_badness(task, memcg, nodemask,
> > +					    totalpages);
> > +			if (score > max_score)
> > +				max_score = score;
> > +		}
> > +		css_task_iter_end(&it);
> > +
> > +		return max_score;
> > +	}
> > +
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		if (nodemask && !node_isset(nid, *nodemask))
> > +			continue;
> > +
> > +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +
> > +		pgdat = NODE_DATA(nid);
> > +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > +					    NR_SLAB_UNRECLAIMABLE);
> > +	}
> > +
> > +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +		(PAGE_SIZE / 1024);
> > +	points += memcg_page_state(memcg, MEMCG_SOCK);
> > +	points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > +	return points;
> > +}
> > +
> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *       of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +			       const nodemask_t *nodemask,
> > +			       unsigned long totalpages)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	int eligible = 0;
> > +
> > +	/*
> > +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> > +	 *
> > +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +	 * as unkillable.
> > +	 *
> > +	 * If there are inflight OOM victim tasks inside the memcg,
> > +	 * we return -1.
> > +	 */
> > +	css_task_iter_start(&memcg->css, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (!eligible &&
> > +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +			eligible = 1;
> > +
> > +		if (tsk_is_oom_victim(task) &&
> > +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > +			eligible = -1;
> > +			break;
> > +		}
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (eligible <= 0)
> > +		return eligible;
> > +
> > +	return memcg_oom_badness(memcg, nodemask, totalpages);
> > +}
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter, *parent;
> > +
> > +	/*
> > +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> > +	 * the oom_group flag, simple select the memcg as a victim.
> > +	 */
> > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > +		oc->chosen_memcg = oc->memcg;
> > +		css_get(&oc->chosen_memcg->css);
> > +		oc->chosen_points = oc->memcg->oom_score;
> > +		return;
> > +	}
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	/*
> > +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> > +	 * by the tree.
> > +	 *
> > +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> > +	 * so we simple reset oom_score for non-lead cgroups before
> > +	 * starting accumulating an actual value from underlying sub-tree.
> > +	 *
> > +	 * Root memcg is treated as a leaf memcg.
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> > +			iter->oom_score = 0;
> > +			continue;
> > +		}
> > +
> > +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> > +						     oc->totalpages);
> > +
> > +		/*
> > +		 * Ignore empty and non-eligible memory cgroups.
> > +		 */
> > +		if (iter->oom_score == 0)
> > +			continue;
> > +
> > +		/*
> > +		 * If there are inflight OOM victims, we don't need to look
> > +		 * further for new victims.
> > +		 */
> > +		if (iter->oom_score == -1) {
> > +			oc->chosen_memcg = INFLIGHT_VICTIM;
> > +			mem_cgroup_iter_break(root, iter);
> > +			break;
> > +		}
> > +
> > +		if (iter->oom_score > oc->chosen_points) {
> > +			oc->chosen_memcg = iter;
> > +			oc->chosen_points = iter->oom_score;
> > +		}
> > +
> > +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > +		     parent = parent_mem_cgroup(parent)) {
> > +			parent->oom_score += iter->oom_score;
> > +
> > +			if (mem_cgroup_oom_group(parent) &&
> > +			    parent->oom_score > oc->chosen_points) {
> > +				oc->chosen_memcg = parent;
> > +				oc->chosen_points = parent->oom_score;
> > +			}
> > +		}
> > +	}
> > +
> > +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> > +		css_get(&oc->chosen_memcg->css);
> > +
> > +	rcu_read_unlock();
> > +}
> 
> 
> As I've written in a private email, things will get much easier if you
> get rid of memcg->oom_score and simply do the recursive oom_score
> evaluation of eligible inter nodes. You would basically do
> 	for_each_mem_cgroup_tree(root, iter) {
> 		if (!memcg_oom_eligible(iter))
> 			continue;
> 
> 		oom_score = oom_evaluate_memcg(iter, mask);
> 		if (oom_score == -1) {
> 			oc->chosen_memcg = INFLIGHT_VICTIM;
> 			mem_cgroup_iter_break(root, iter);
> 			break;
> 		}
> 		if (oom_score > oc->chosen_points) {
> 			mark_new_oom_memcg(iter);
> 		}
> 
> 		/* potential optimization to skip the whole subtree if
> 		 * iter is not leaf */
> 	}
> 
> where
> bool memcg_oom_eligible(struct mem_cgroup *memcg)
> {
> 	if (cgroup_has_tasks(memcg->css.cgroup))
> 		return true;
> 	if (mem_cgroup_oom_group(memcg))
> 		return true;
> 	return false;
> }
> 
> unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
> 	/* calculate badness */
> }
> 
> unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	unsigned long score = 0;
> 
> 	if (memcg == root_mem_cgroup) {
> 		for_each_task()
> 			score += __oom_badness(task, mask);
> 		return score
> 	}
> 
> 	for_each_mem_cgroup_tree(memcg, iter) {
> 		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
> 		if (memcg_score == -1) {
> 			mem_cgroup_iter_break(memcg, iter);
> 			return -1;
> 		}
> 	}
> 
> 	return score;
> }
> 
> This should be also simple to split for oom_group in a separate patch
> while keeping the overall code structure.
> Does this make any sense to you?

I totally agree that getting rid of memcg->oom_score is possible and is
a good idea, as well as separating oom_group into a separate patch.

What about the rest, let me check what I can do here.


> 
> [...]
> > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	__oom_kill_process(victim);
> >  }
> >  
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > +	if (!tsk_is_oom_victim(task)) {
> 
> How can this happen?

We do start with killing the largest process, and then iterate over all tasks
in the cgroup. So, this check is required to avoid killing tasks which are
already in the termination process.

> 
> > +		get_task_struct(task);
> > +		__oom_kill_process(task);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > +				      DEFAULT_RATELIMIT_BURST);
> > +
> > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > +		return oc->chosen_memcg;
> > +
> > +	/* Always begin with the task with the biggest memory footprint */
> > +	oc->chosen_points = 0;
> > +	oc->chosen_task = NULL;
> > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > +
> > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > +		goto out;
> > +
> > +	if (__ratelimit(&oom_rs))
> > +		dump_header(oc, oc->chosen_task);
> 
> Hmm, does the full dump_header really apply for the new heuristic? E.g.
> does it make sense to dump_tasks()? Would it make sense to print stats
> of all eligible memcgs instead?

Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
but is also closely connected with the implementation. So I would suggest
to postpone this until we'll get more usage examples and will better
understand what information we need.

> 
> > +
> > +	__oom_kill_process(oc->chosen_task);
> > +
> > +	/* If oom_group flag is set, kill all belonging tasks */
> > +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > +				      NULL);
> > +
> > +	schedule_timeout_killable(1);
> 
> I would prefer if we had this timeout at a single place in
> out_of_memory()

Ok, will do.

> 
> Other than that the semantic (sans oom_group which needs more
> clarification) makes sense to me.

Cool!

Glad to hear this.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 12:37       ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest memory consumer:
> > a leaf memory cgroup or a memory cgroup with the memory.oom_group
> > option set. Then it kills either a task with the biggest memory
> > footprint, either all belonging tasks, if memory.oom_group is set.
> > If a cgroup has memory.oom_group set, all descendant cgroups
> > implicitly inherit the memory.oom_group setting.
> 
> I think it would be better to separate oom_group into its own patch.
> So this patch would just add the cgroup awareness and oom_group will
> build on top of that.

Sure, will do.

> 
> Wrt. to the implicit inheritance you brought up in a separate email
> thread [1]. Let me quote
> : after some additional thinking I don't think anymore that implicit
> : propagation of oom_group is a good idea.  Let me explain: assume we
> : have memcg A with memory.max and memory.oom_group set, and nested
> : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> : A/B. What is an expected system behavior?
> : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> : We really shouldn't touch processes which are not belonging to A/B.
> : That means we should either kill the biggest process in A/B, either all
> : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> : It really makes no sense, and makes oom_group knob really hard to describe.
> : 
> : Also, after some off-list discussion, we've realized that memory.oom_knob
> : should be delegatable. The workload should have control over it to express
> : dependency between processes.
> 
> OK, I have asked about this already but I am not sure the answer was
> very explicit. So let me ask again. When exactly a subtree would
> disagree with the parent on oom_group? In other words when do we want a
> different cleanup based on the OOM root? I am not saying this is wrong
> I am just curious about a practical example.

Well, I do not have a practical example right now, but it's against the logic.
Any OOM event has a scope, and group_oom knob is applied for OOM events
scoped to the cgroup or any ancestors (including system as a whole).
So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
It's a strange configuration limitation, and I do not see any benefits:
it doesn't provide any new functionality or guarantees.

Even if we don't have practical examples, we should build something less
surprising for a user, and I don't understand why oom_group should be inherited.

> 
> > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with other leaf and oom_group memory cgroups.
> > The oom_group option is not supported for the root cgroup.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating root cgroup oom_score: we define it
> > as maximum oom_score of the belonging tasks.
> 
> [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> 
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask,
> > +			      unsigned long totalpages)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	pg_data_t *pgdat;
> > +
> > +	/*
> > +	 * We don't have necessary stats for the root memcg,
> > +	 * so we define it's oom_score as the maximum oom_score
> > +	 * of the belonging tasks.
> > +	 */
> 
> Why not a sum of all tasks which would more resemble what we do for
> other memcgs? Sure this would require ignoring oom_score_adj so
> oom_badness would have to be tweaked a bit (basically split it into
> __oom_badness which calculates the value without the bias and
> oom_badness on top adding the bias on top of the scaled value).

We've discussed it already: calculating the sum is tricky, as tasks
are sharing memory (and the mm struct(. As I remember, you suggested
using maximum to solve exactly this problem, and I think it's a good
approximation. Assuming that tasks in the root cgroup likely have
nothing in common, and we don't support oom_group for it, looking
at the biggest task makes perfect sense: we're exactly comparing
killable entities.

> 
> > +	if (memcg == root_mem_cgroup) {
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		long score, max_score = 0;
> > +
> > +		css_task_iter_start(&memcg->css, 0, &it);
> > +		while ((task = css_task_iter_next(&it))) {
> > +			score = oom_badness(task, memcg, nodemask,
> > +					    totalpages);
> > +			if (score > max_score)
> > +				max_score = score;
> > +		}
> > +		css_task_iter_end(&it);
> > +
> > +		return max_score;
> > +	}
> > +
> > +	for_each_node_state(nid, N_MEMORY) {
> > +		if (nodemask && !node_isset(nid, *nodemask))
> > +			continue;
> > +
> > +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +
> > +		pgdat = NODE_DATA(nid);
> > +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > +					    NR_SLAB_UNRECLAIMABLE);
> > +	}
> > +
> > +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +		(PAGE_SIZE / 1024);
> > +	points += memcg_page_state(memcg, MEMCG_SOCK);
> > +	points += memcg_page_state(memcg, MEMCG_SWAP);
> > +
> > +	return points;
> > +}
> > +
> > +/*
> > + * Checks if the given memcg is a valid OOM victim and returns a number,
> > + * which means the folowing:
> > + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> > + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> > + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> > + *   >0: memcg is eligible, and the returned value is an estimation
> > + *       of the memory footprint
> > + */
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +			       const nodemask_t *nodemask,
> > +			       unsigned long totalpages)
> > +{
> > +	struct css_task_iter it;
> > +	struct task_struct *task;
> > +	int eligible = 0;
> > +
> > +	/*
> > +	 * Memcg is OOM eligible if there are OOM killable tasks inside.
> > +	 *
> > +	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> > +	 * as unkillable.
> > +	 *
> > +	 * If there are inflight OOM victim tasks inside the memcg,
> > +	 * we return -1.
> > +	 */
> > +	css_task_iter_start(&memcg->css, 0, &it);
> > +	while ((task = css_task_iter_next(&it))) {
> > +		if (!eligible &&
> > +		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> > +			eligible = 1;
> > +
> > +		if (tsk_is_oom_victim(task) &&
> > +		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> > +			eligible = -1;
> > +			break;
> > +		}
> > +	}
> > +	css_task_iter_end(&it);
> > +
> > +	if (eligible <= 0)
> > +		return eligible;
> > +
> > +	return memcg_oom_badness(memcg, nodemask, totalpages);
> > +}
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter, *parent;
> > +
> > +	/*
> > +	 * If OOM is memcg-wide, and the memcg or it's ancestor has
> > +	 * the oom_group flag, simple select the memcg as a victim.
> > +	 */
> > +	if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> > +		oc->chosen_memcg = oc->memcg;
> > +		css_get(&oc->chosen_memcg->css);
> > +		oc->chosen_points = oc->memcg->oom_score;
> > +		return;
> > +	}
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	/*
> > +	 * The oom_score is calculated for leaf memcgs and propagated upwards
> > +	 * by the tree.
> > +	 *
> > +	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
> > +	 * so we simple reset oom_score for non-lead cgroups before
> > +	 * starting accumulating an actual value from underlying sub-tree.
> > +	 *
> > +	 * Root memcg is treated as a leaf memcg.
> > +	 */
> > +	rcu_read_lock();
> > +	for_each_mem_cgroup_tree(iter, root) {
> > +		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> > +			iter->oom_score = 0;
> > +			continue;
> > +		}
> > +
> > +		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> > +						     oc->totalpages);
> > +
> > +		/*
> > +		 * Ignore empty and non-eligible memory cgroups.
> > +		 */
> > +		if (iter->oom_score == 0)
> > +			continue;
> > +
> > +		/*
> > +		 * If there are inflight OOM victims, we don't need to look
> > +		 * further for new victims.
> > +		 */
> > +		if (iter->oom_score == -1) {
> > +			oc->chosen_memcg = INFLIGHT_VICTIM;
> > +			mem_cgroup_iter_break(root, iter);
> > +			break;
> > +		}
> > +
> > +		if (iter->oom_score > oc->chosen_points) {
> > +			oc->chosen_memcg = iter;
> > +			oc->chosen_points = iter->oom_score;
> > +		}
> > +
> > +		for (parent = parent_mem_cgroup(iter); parent && parent != root;
> > +		     parent = parent_mem_cgroup(parent)) {
> > +			parent->oom_score += iter->oom_score;
> > +
> > +			if (mem_cgroup_oom_group(parent) &&
> > +			    parent->oom_score > oc->chosen_points) {
> > +				oc->chosen_memcg = parent;
> > +				oc->chosen_points = parent->oom_score;
> > +			}
> > +		}
> > +	}
> > +
> > +	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> > +		css_get(&oc->chosen_memcg->css);
> > +
> > +	rcu_read_unlock();
> > +}
> 
> 
> As I've written in a private email, things will get much easier if you
> get rid of memcg->oom_score and simply do the recursive oom_score
> evaluation of eligible inter nodes. You would basically do
> 	for_each_mem_cgroup_tree(root, iter) {
> 		if (!memcg_oom_eligible(iter))
> 			continue;
> 
> 		oom_score = oom_evaluate_memcg(iter, mask);
> 		if (oom_score == -1) {
> 			oc->chosen_memcg = INFLIGHT_VICTIM;
> 			mem_cgroup_iter_break(root, iter);
> 			break;
> 		}
> 		if (oom_score > oc->chosen_points) {
> 			mark_new_oom_memcg(iter);
> 		}
> 
> 		/* potential optimization to skip the whole subtree if
> 		 * iter is not leaf */
> 	}
> 
> where
> bool memcg_oom_eligible(struct mem_cgroup *memcg)
> {
> 	if (cgroup_has_tasks(memcg->css.cgroup))
> 		return true;
> 	if (mem_cgroup_oom_group(memcg))
> 		return true;
> 	return false;
> }
> 
> unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	/* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
> 	/* calculate badness */
> }
> 
> unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
> {
> 	unsigned long score = 0;
> 
> 	if (memcg == root_mem_cgroup) {
> 		for_each_task()
> 			score += __oom_badness(task, mask);
> 		return score
> 	}
> 
> 	for_each_mem_cgroup_tree(memcg, iter) {
> 		unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
> 		if (memcg_score == -1) {
> 			mem_cgroup_iter_break(memcg, iter);
> 			return -1;
> 		}
> 	}
> 
> 	return score;
> }
> 
> This should be also simple to split for oom_group in a separate patch
> while keeping the overall code structure.
> Does this make any sense to you?

I totally agree that getting rid of memcg->oom_score is possible and is
a good idea, as well as separating oom_group into a separate patch.

What about the rest, let me check what I can do here.


> 
> [...]
> > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	__oom_kill_process(victim);
> >  }
> >  
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > +	if (!tsk_is_oom_victim(task)) {
> 
> How can this happen?

We do start with killing the largest process, and then iterate over all tasks
in the cgroup. So, this check is required to avoid killing tasks which are
already in the termination process.

> 
> > +		get_task_struct(task);
> > +		__oom_kill_process(task);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > +{
> > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > +				      DEFAULT_RATELIMIT_BURST);
> > +
> > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > +		return oc->chosen_memcg;
> > +
> > +	/* Always begin with the task with the biggest memory footprint */
> > +	oc->chosen_points = 0;
> > +	oc->chosen_task = NULL;
> > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > +
> > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > +		goto out;
> > +
> > +	if (__ratelimit(&oom_rs))
> > +		dump_header(oc, oc->chosen_task);
> 
> Hmm, does the full dump_header really apply for the new heuristic? E.g.
> does it make sense to dump_tasks()? Would it make sense to print stats
> of all eligible memcgs instead?

Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
but is also closely connected with the implementation. So I would suggest
to postpone this until we'll get more usage examples and will better
understand what information we need.

> 
> > +
> > +	__oom_kill_process(oc->chosen_task);
> > +
> > +	/* If oom_group flag is set, kill all belonging tasks */
> > +	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > +		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > +				      NULL);
> > +
> > +	schedule_timeout_killable(1);
> 
> I would prefer if we had this timeout at a single place in
> out_of_memory()

Ok, will do.

> 
> Other than that the semantic (sans oom_group which needs more
> clarification) makes sense to me.

Cool!

Glad to hear this.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-03 11:50     ` Michal Hocko
@ 2017-10-03 12:49       ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 01:50:36PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> I do not have a strong preference about this. I would just be worried
> that it is usually systemd which tries to own the whole hierarchy

I actually like this fact.

It gives us the opportunity to change the default behavior for most users
at the point when we'll be sure that new behavior is better; but at the same
time we'll save full compatibility on the kernel level.
With growing popularity of memory cgroups, I don't think that hiding
this functionality with a boot option makes any sense. It's just not
this type of feature, that should be hidden.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
@ 2017-10-03 12:49       ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 01:50:36PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> I do not have a strong preference about this. I would just be worried
> that it is usually systemd which tries to own the whole hierarchy

I actually like this fact.

It gives us the opportunity to change the default behavior for most users
at the point when we'll be sure that new behavior is better; but at the same
time we'll save full compatibility on the kernel level.
With growing popularity of memory cgroups, I don't think that hiding
this functionality with a boot option makes any sense. It's just not
this type of feature, that should be hidden.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  2017-10-03 10:49     ` Michal Hocko
@ 2017-10-03 12:50       ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 12:49:39PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:33, Roman Gushchin wrote:
> > Implement mem_cgroup_scan_tasks() functionality for the root
> > memory cgroup to use this function for looking for a OOM victim
> > task in the root memory cgroup by the cgroup-ware OOM killer.
> > 
> > The root memory cgroup should be treated as a leaf cgroup,
> > so only tasks which are directly belonging to the root cgroup
> > should be iterated over.
> 
> I would only add that this patch doesn't introduce any functionally
> visible change because we never trigger oom killer with the root memcg
> as the root of the hierarchy. So this is just a preparatory work for
> later changes.

Sure, thanks!

> 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: kernel-team@fb.com
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
@ 2017-10-03 12:50       ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 12:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 12:49:39PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:33, Roman Gushchin wrote:
> > Implement mem_cgroup_scan_tasks() functionality for the root
> > memory cgroup to use this function for looking for a OOM victim
> > task in the root memory cgroup by the cgroup-ware OOM killer.
> > 
> > The root memory cgroup should be treated as a leaf cgroup,
> > so only tasks which are directly belonging to the root cgroup
> > should be iterated over.
> 
> I would only add that this patch doesn't introduce any functionally
> visible change because we never trigger oom killer with the root memcg
> as the root of the hierarchy. So this is just a preparatory work for
> later changes.

Sure, thanks!

> 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: kernel-team@fb.com
> > Cc: cgroups@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 12:37       ` Roman Gushchin
@ 2017-10-03 13:36         ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 13:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
[...]
> > Wrt. to the implicit inheritance you brought up in a separate email
> > thread [1]. Let me quote
> > : after some additional thinking I don't think anymore that implicit
> > : propagation of oom_group is a good idea.  Let me explain: assume we
> > : have memcg A with memory.max and memory.oom_group set, and nested
> > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > : A/B. What is an expected system behavior?
> > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > : We really shouldn't touch processes which are not belonging to A/B.
> > : That means we should either kill the biggest process in A/B, either all
> > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> > : It really makes no sense, and makes oom_group knob really hard to describe.
> > : 
> > : Also, after some off-list discussion, we've realized that memory.oom_knob
> > : should be delegatable. The workload should have control over it to express
> > : dependency between processes.
> > 
> > OK, I have asked about this already but I am not sure the answer was
> > very explicit. So let me ask again. When exactly a subtree would
> > disagree with the parent on oom_group? In other words when do we want a
> > different cleanup based on the OOM root? I am not saying this is wrong
> > I am just curious about a practical example.
> 
> Well, I do not have a practical example right now, but it's against the logic.
> Any OOM event has a scope, and group_oom knob is applied for OOM events
> scoped to the cgroup or any ancestors (including system as a whole).
> So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
> It's a strange configuration limitation, and I do not see any benefits:
> it doesn't provide any new functionality or guarantees.

Well, I guess I agree. I was merely interested about consequences when
the oom behavior is different depending on which layer it happens. Does
it make sense to cleanup the whole hierarchy while any subtree would
kill a single task if the oom happened there?
 
> Even if we don't have practical examples, we should build something less
> surprising for a user, and I don't understand why oom_group should be inherited.

I guess we want to inherit the value on the memcg creation but I agree
that enforcing parent setting is weird. I will think about it some more
but I agree that it is saner to only enforce per memcg value.
 
> > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > 
> > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > is compared with other leaf and oom_group memory cgroups.
> > > The oom_group option is not supported for the root cgroup.
> > > Due to memcg statistics implementation a special algorithm
> > > is used for estimating root cgroup oom_score: we define it
> > > as maximum oom_score of the belonging tasks.
> > 
> > [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> > 
> > [...]
> > > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > +			      const nodemask_t *nodemask,
> > > +			      unsigned long totalpages)
> > > +{
> > > +	long points = 0;
> > > +	int nid;
> > > +	pg_data_t *pgdat;
> > > +
> > > +	/*
> > > +	 * We don't have necessary stats for the root memcg,
> > > +	 * so we define it's oom_score as the maximum oom_score
> > > +	 * of the belonging tasks.
> > > +	 */
> > 
> > Why not a sum of all tasks which would more resemble what we do for
> > other memcgs? Sure this would require ignoring oom_score_adj so
> > oom_badness would have to be tweaked a bit (basically split it into
> > __oom_badness which calculates the value without the bias and
> > oom_badness on top adding the bias on top of the scaled value).
> 
> We've discussed it already: calculating the sum is tricky, as tasks
> are sharing memory (and the mm struct(. As I remember, you suggested
> using maximum to solve exactly this problem, and I think it's a good
> approximation. Assuming that tasks in the root cgroup likely have
> nothing in common, and we don't support oom_group for it, looking
> at the biggest task makes perfect sense: we're exactly comparing
> killable entities.

Please add a comment explaining that. I hope we can make root memcg less
special eventually. It shouldn't be all that hard. We already have per
LRU numbers and we only use few counters which could be accounted to the
root memcg as well. Counters should be quite cheap.

[...]

> > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > >  	__oom_kill_process(victim);
> > >  }
> > >  
> > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > +{
> > > +	if (!tsk_is_oom_victim(task)) {
> > 
> > How can this happen?
> 
> We do start with killing the largest process, and then iterate over all tasks
> in the cgroup. So, this check is required to avoid killing tasks which are
> already in the termination process.

Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
 
> > 
> > > +		get_task_struct(task);
> > > +		__oom_kill_process(task);
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > +{
> > > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > > +				      DEFAULT_RATELIMIT_BURST);
> > > +
> > > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > > +		return oc->chosen_memcg;
> > > +
> > > +	/* Always begin with the task with the biggest memory footprint */
> > > +	oc->chosen_points = 0;
> > > +	oc->chosen_task = NULL;
> > > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > +
> > > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > +		goto out;
> > > +
> > > +	if (__ratelimit(&oom_rs))
> > > +		dump_header(oc, oc->chosen_task);
> > 
> > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > does it make sense to dump_tasks()? Would it make sense to print stats
> > of all eligible memcgs instead?
> 
> Hm, this is a tricky part: the dmesg output is at some point a part of ABI,

People are parsing oom reports but I disagree this is an ABI of any
sort. The report is closely tight to the particular implementation and
as such it has changed several times over the time.

> but is also closely connected with the implementation. So I would suggest
> to postpone this until we'll get more usage examples and will better
> understand what information we need.

I would drop tasks list at least because that is clearly misleading in
this context because we are not selecting from all tasks. We are
selecting between memcgs. The memcg information can be added in a
separate patch of course.
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 13:36         ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 13:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
[...]
> > Wrt. to the implicit inheritance you brought up in a separate email
> > thread [1]. Let me quote
> > : after some additional thinking I don't think anymore that implicit
> > : propagation of oom_group is a good idea.  Let me explain: assume we
> > : have memcg A with memory.max and memory.oom_group set, and nested
> > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > : A/B. What is an expected system behavior?
> > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > : We really shouldn't touch processes which are not belonging to A/B.
> > : That means we should either kill the biggest process in A/B, either all
> > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> > : It really makes no sense, and makes oom_group knob really hard to describe.
> > : 
> > : Also, after some off-list discussion, we've realized that memory.oom_knob
> > : should be delegatable. The workload should have control over it to express
> > : dependency between processes.
> > 
> > OK, I have asked about this already but I am not sure the answer was
> > very explicit. So let me ask again. When exactly a subtree would
> > disagree with the parent on oom_group? In other words when do we want a
> > different cleanup based on the OOM root? I am not saying this is wrong
> > I am just curious about a practical example.
> 
> Well, I do not have a practical example right now, but it's against the logic.
> Any OOM event has a scope, and group_oom knob is applied for OOM events
> scoped to the cgroup or any ancestors (including system as a whole).
> So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
> It's a strange configuration limitation, and I do not see any benefits:
> it doesn't provide any new functionality or guarantees.

Well, I guess I agree. I was merely interested about consequences when
the oom behavior is different depending on which layer it happens. Does
it make sense to cleanup the whole hierarchy while any subtree would
kill a single task if the oom happened there?
 
> Even if we don't have practical examples, we should build something less
> surprising for a user, and I don't understand why oom_group should be inherited.

I guess we want to inherit the value on the memcg creation but I agree
that enforcing parent setting is weird. I will think about it some more
but I agree that it is saner to only enforce per memcg value.
 
> > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > 
> > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > is compared with other leaf and oom_group memory cgroups.
> > > The oom_group option is not supported for the root cgroup.
> > > Due to memcg statistics implementation a special algorithm
> > > is used for estimating root cgroup oom_score: we define it
> > > as maximum oom_score of the belonging tasks.
> > 
> > [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> > 
> > [...]
> > > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > +			      const nodemask_t *nodemask,
> > > +			      unsigned long totalpages)
> > > +{
> > > +	long points = 0;
> > > +	int nid;
> > > +	pg_data_t *pgdat;
> > > +
> > > +	/*
> > > +	 * We don't have necessary stats for the root memcg,
> > > +	 * so we define it's oom_score as the maximum oom_score
> > > +	 * of the belonging tasks.
> > > +	 */
> > 
> > Why not a sum of all tasks which would more resemble what we do for
> > other memcgs? Sure this would require ignoring oom_score_adj so
> > oom_badness would have to be tweaked a bit (basically split it into
> > __oom_badness which calculates the value without the bias and
> > oom_badness on top adding the bias on top of the scaled value).
> 
> We've discussed it already: calculating the sum is tricky, as tasks
> are sharing memory (and the mm struct(. As I remember, you suggested
> using maximum to solve exactly this problem, and I think it's a good
> approximation. Assuming that tasks in the root cgroup likely have
> nothing in common, and we don't support oom_group for it, looking
> at the biggest task makes perfect sense: we're exactly comparing
> killable entities.

Please add a comment explaining that. I hope we can make root memcg less
special eventually. It shouldn't be all that hard. We already have per
LRU numbers and we only use few counters which could be accounted to the
root memcg as well. Counters should be quite cheap.

[...]

> > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > >  	__oom_kill_process(victim);
> > >  }
> > >  
> > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > +{
> > > +	if (!tsk_is_oom_victim(task)) {
> > 
> > How can this happen?
> 
> We do start with killing the largest process, and then iterate over all tasks
> in the cgroup. So, this check is required to avoid killing tasks which are
> already in the termination process.

Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
 
> > 
> > > +		get_task_struct(task);
> > > +		__oom_kill_process(task);
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > +{
> > > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > > +				      DEFAULT_RATELIMIT_BURST);
> > > +
> > > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > > +		return oc->chosen_memcg;
> > > +
> > > +	/* Always begin with the task with the biggest memory footprint */
> > > +	oc->chosen_points = 0;
> > > +	oc->chosen_task = NULL;
> > > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > +
> > > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > +		goto out;
> > > +
> > > +	if (__ratelimit(&oom_rs))
> > > +		dump_header(oc, oc->chosen_task);
> > 
> > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > does it make sense to dump_tasks()? Would it make sense to print stats
> > of all eligible memcgs instead?
> 
> Hm, this is a tricky part: the dmesg output is at some point a part of ABI,

People are parsing oom reports but I disagree this is an ABI of any
sort. The report is closely tight to the particular implementation and
as such it has changed several times over the time.

> but is also closely connected with the implementation. So I would suggest
> to postpone this until we'll get more usage examples and will better
> understand what information we need.

I would drop tasks list at least because that is clearly misleading in
this context because we are not selecting from all tasks. We are
selecting between memcgs. The memcg information can be added in a
separate patch of course.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-10-03 12:49       ` Roman Gushchin
@ 2017-10-03 13:39         ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 13:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 13:49:36, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 01:50:36PM +0200, Michal Hocko wrote:
> > On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> > > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > > OOM killer. If not set, the OOM selection is performed in
> > > a "traditional" per-process way.
> > > 
> > > The behavior can be changed dynamically by remounting the cgroupfs.
> > 
> > I do not have a strong preference about this. I would just be worried
> > that it is usually systemd which tries to own the whole hierarchy
> 
> I actually like this fact.
> 
> It gives us the opportunity to change the default behavior for most users
> at the point when we'll be sure that new behavior is better; but at the same
> time we'll save full compatibility on the kernel level.

Well, I would be much more skeptical because I simply believe that
neither of the approach is better in general. It really depends on the
usecases. And systemd or whoever mounts the hierarchy has no slightest
idea what is the case. So we might end up with a global knob to control
the mount point after all. But as I've said no strong opinion on that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
@ 2017-10-03 13:39         ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 13:39 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 13:49:36, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 01:50:36PM +0200, Michal Hocko wrote:
> > On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> > > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > > OOM killer. If not set, the OOM selection is performed in
> > > a "traditional" per-process way.
> > > 
> > > The behavior can be changed dynamically by remounting the cgroupfs.
> > 
> > I do not have a strong preference about this. I would just be worried
> > that it is usually systemd which tries to own the whole hierarchy
> 
> I actually like this fact.
> 
> It gives us the opportunity to change the default behavior for most users
> at the point when we'll be sure that new behavior is better; but at the same
> time we'll save full compatibility on the kernel level.

Well, I would be much more skeptical because I simply believe that
neither of the approach is better in general. It really depends on the
usecases. And systemd or whoever mounts the hierarchy has no slightest
idea what is the case. So we might end up with a global knob to control
the mount point after all. But as I've said no strong opinion on that.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 13:36         ` Michal Hocko
@ 2017-10-03 14:08           ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> [...]
> > > Wrt. to the implicit inheritance you brought up in a separate email
> > > thread [1]. Let me quote
> > > : after some additional thinking I don't think anymore that implicit
> > > : propagation of oom_group is a good idea.  Let me explain: assume we
> > > : have memcg A with memory.max and memory.oom_group set, and nested
> > > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > > : A/B. What is an expected system behavior?
> > > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > > : We really shouldn't touch processes which are not belonging to A/B.
> > > : That means we should either kill the biggest process in A/B, either all
> > > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> > > : It really makes no sense, and makes oom_group knob really hard to describe.
> > > : 
> > > : Also, after some off-list discussion, we've realized that memory.oom_knob
> > > : should be delegatable. The workload should have control over it to express
> > > : dependency between processes.
> > > 
> > > OK, I have asked about this already but I am not sure the answer was
> > > very explicit. So let me ask again. When exactly a subtree would
> > > disagree with the parent on oom_group? In other words when do we want a
> > > different cleanup based on the OOM root? I am not saying this is wrong
> > > I am just curious about a practical example.
> > 
> > Well, I do not have a practical example right now, but it's against the logic.
> > Any OOM event has a scope, and group_oom knob is applied for OOM events
> > scoped to the cgroup or any ancestors (including system as a whole).
> > So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
> > It's a strange configuration limitation, and I do not see any benefits:
> > it doesn't provide any new functionality or guarantees.
> 
> Well, I guess I agree. I was merely interested about consequences when
> the oom behavior is different depending on which layer it happens. Does
> it make sense to cleanup the whole hierarchy while any subtree would
> kill a single task if the oom happened there?

By setting or not setting the oom_group knob a user is expressing
the readiness to handle the OOM by itself, e.g. looking at cgroup events,
restarting killed tasks, etc.

If workload is complex, and has some sub-parts with their own memory
constraints, it's quite possible, that it's ready to restart these parts,
but not a random process killed by the global OOM. This is actually a proper
replacement for setting oom_score_adj: let say there is memcg A,
which contains some control stuff in A/C, and several sub-workloads
A/W1, A/W2, etc. In case of global OOM, caused by system miss-configuration,
or, say, a memory leak in the control stuff, it makes perfect sense to
kill A as a whole, so we can set A/memory.oom_groups to 1.

But if there is a memory shortage in one of the workers (A/W1, for instance),
it's quite possible that killing everything is excessive. So, a user has
the freedom to decide what's the proper way to handle OOM.

>  
> > Even if we don't have practical examples, we should build something less
> > surprising for a user, and I don't understand why oom_group should be inherited.
> 
> I guess we want to inherit the value on the memcg creation but I agree
> that enforcing parent setting is weird. I will think about it some more
> but I agree that it is saner to only enforce per memcg value.

I'm not against, but we should come up with a good explanation, why we're
inheriting it; or not inherit.

>  
> > > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > > 
> > > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > > is compared with other leaf and oom_group memory cgroups.
> > > > The oom_group option is not supported for the root cgroup.
> > > > Due to memcg statistics implementation a special algorithm
> > > > is used for estimating root cgroup oom_score: we define it
> > > > as maximum oom_score of the belonging tasks.
> > > 
> > > [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> > > 
> > > [...]
> > > > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > > +			      const nodemask_t *nodemask,
> > > > +			      unsigned long totalpages)
> > > > +{
> > > > +	long points = 0;
> > > > +	int nid;
> > > > +	pg_data_t *pgdat;
> > > > +
> > > > +	/*
> > > > +	 * We don't have necessary stats for the root memcg,
> > > > +	 * so we define it's oom_score as the maximum oom_score
> > > > +	 * of the belonging tasks.
> > > > +	 */
> > > 
> > > Why not a sum of all tasks which would more resemble what we do for
> > > other memcgs? Sure this would require ignoring oom_score_adj so
> > > oom_badness would have to be tweaked a bit (basically split it into
> > > __oom_badness which calculates the value without the bias and
> > > oom_badness on top adding the bias on top of the scaled value).
> > 
> > We've discussed it already: calculating the sum is tricky, as tasks
> > are sharing memory (and the mm struct(. As I remember, you suggested
> > using maximum to solve exactly this problem, and I think it's a good
> > approximation. Assuming that tasks in the root cgroup likely have
> > nothing in common, and we don't support oom_group for it, looking
> > at the biggest task makes perfect sense: we're exactly comparing
> > killable entities.
> 
> Please add a comment explaining that. I hope we can make root memcg less
> special eventually. It shouldn't be all that hard. We already have per
> LRU numbers and we only use few counters which could be accounted to the
> root memcg as well. Counters should be quite cheap.

Sure, this is my hope too.

> 
> [...]
> 
> > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > >  	__oom_kill_process(victim);
> > > >  }
> > > >  
> > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > +{
> > > > +	if (!tsk_is_oom_victim(task)) {
> > > 
> > > How can this happen?
> > 
> > We do start with killing the largest process, and then iterate over all tasks
> > in the cgroup. So, this check is required to avoid killing tasks which are
> > already in the termination process.
> 
> Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?

No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
tasks. This is a way to skip the biggest task, and do not kill it again.

>  
> > > 
> > > > +		get_task_struct(task);
> > > > +		__oom_kill_process(task);
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > > +{
> > > > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > > > +				      DEFAULT_RATELIMIT_BURST);
> > > > +
> > > > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > > > +		return oc->chosen_memcg;
> > > > +
> > > > +	/* Always begin with the task with the biggest memory footprint */
> > > > +	oc->chosen_points = 0;
> > > > +	oc->chosen_task = NULL;
> > > > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > > +
> > > > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > > +		goto out;
> > > > +
> > > > +	if (__ratelimit(&oom_rs))
> > > > +		dump_header(oc, oc->chosen_task);
> > > 
> > > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > > does it make sense to dump_tasks()? Would it make sense to print stats
> > > of all eligible memcgs instead?
> > 
> > Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
> 
> People are parsing oom reports but I disagree this is an ABI of any
> sort. The report is closely tight to the particular implementation and
> as such it has changed several times over the time.
> 
> > but is also closely connected with the implementation. So I would suggest
> > to postpone this until we'll get more usage examples and will better
> > understand what information we need.
> 
> I would drop tasks list at least because that is clearly misleading in
> this context because we are not selecting from all tasks. We are
> selecting between memcgs. The memcg information can be added in a
> separate patch of course.

Let's postpone it until we'll land the rest of the patchset.

Thank you!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:08           ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> [...]
> > > Wrt. to the implicit inheritance you brought up in a separate email
> > > thread [1]. Let me quote
> > > : after some additional thinking I don't think anymore that implicit
> > > : propagation of oom_group is a good idea.  Let me explain: assume we
> > > : have memcg A with memory.max and memory.oom_group set, and nested
> > > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > > : A/B. What is an expected system behavior?
> > > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > > : We really shouldn't touch processes which are not belonging to A/B.
> > > : That means we should either kill the biggest process in A/B, either all
> > > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> > > : It really makes no sense, and makes oom_group knob really hard to describe.
> > > : 
> > > : Also, after some off-list discussion, we've realized that memory.oom_knob
> > > : should be delegatable. The workload should have control over it to express
> > > : dependency between processes.
> > > 
> > > OK, I have asked about this already but I am not sure the answer was
> > > very explicit. So let me ask again. When exactly a subtree would
> > > disagree with the parent on oom_group? In other words when do we want a
> > > different cleanup based on the OOM root? I am not saying this is wrong
> > > I am just curious about a practical example.
> > 
> > Well, I do not have a practical example right now, but it's against the logic.
> > Any OOM event has a scope, and group_oom knob is applied for OOM events
> > scoped to the cgroup or any ancestors (including system as a whole).
> > So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
> > It's a strange configuration limitation, and I do not see any benefits:
> > it doesn't provide any new functionality or guarantees.
> 
> Well, I guess I agree. I was merely interested about consequences when
> the oom behavior is different depending on which layer it happens. Does
> it make sense to cleanup the whole hierarchy while any subtree would
> kill a single task if the oom happened there?

By setting or not setting the oom_group knob a user is expressing
the readiness to handle the OOM by itself, e.g. looking at cgroup events,
restarting killed tasks, etc.

If workload is complex, and has some sub-parts with their own memory
constraints, it's quite possible, that it's ready to restart these parts,
but not a random process killed by the global OOM. This is actually a proper
replacement for setting oom_score_adj: let say there is memcg A,
which contains some control stuff in A/C, and several sub-workloads
A/W1, A/W2, etc. In case of global OOM, caused by system miss-configuration,
or, say, a memory leak in the control stuff, it makes perfect sense to
kill A as a whole, so we can set A/memory.oom_groups to 1.

But if there is a memory shortage in one of the workers (A/W1, for instance),
it's quite possible that killing everything is excessive. So, a user has
the freedom to decide what's the proper way to handle OOM.

>  
> > Even if we don't have practical examples, we should build something less
> > surprising for a user, and I don't understand why oom_group should be inherited.
> 
> I guess we want to inherit the value on the memcg creation but I agree
> that enforcing parent setting is weird. I will think about it some more
> but I agree that it is saner to only enforce per memcg value.

I'm not against, but we should come up with a good explanation, why we're
inheriting it; or not inherit.

>  
> > > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > > 
> > > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > > is compared with other leaf and oom_group memory cgroups.
> > > > The oom_group option is not supported for the root cgroup.
> > > > Due to memcg statistics implementation a special algorithm
> > > > is used for estimating root cgroup oom_score: we define it
> > > > as maximum oom_score of the belonging tasks.
> > > 
> > > [1] http://lkml.kernel.org/r/20171002124712.GA17638@castle.DHCP.thefacebook.com
> > > 
> > > [...]
> > > > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > > +			      const nodemask_t *nodemask,
> > > > +			      unsigned long totalpages)
> > > > +{
> > > > +	long points = 0;
> > > > +	int nid;
> > > > +	pg_data_t *pgdat;
> > > > +
> > > > +	/*
> > > > +	 * We don't have necessary stats for the root memcg,
> > > > +	 * so we define it's oom_score as the maximum oom_score
> > > > +	 * of the belonging tasks.
> > > > +	 */
> > > 
> > > Why not a sum of all tasks which would more resemble what we do for
> > > other memcgs? Sure this would require ignoring oom_score_adj so
> > > oom_badness would have to be tweaked a bit (basically split it into
> > > __oom_badness which calculates the value without the bias and
> > > oom_badness on top adding the bias on top of the scaled value).
> > 
> > We've discussed it already: calculating the sum is tricky, as tasks
> > are sharing memory (and the mm struct(. As I remember, you suggested
> > using maximum to solve exactly this problem, and I think it's a good
> > approximation. Assuming that tasks in the root cgroup likely have
> > nothing in common, and we don't support oom_group for it, looking
> > at the biggest task makes perfect sense: we're exactly comparing
> > killable entities.
> 
> Please add a comment explaining that. I hope we can make root memcg less
> special eventually. It shouldn't be all that hard. We already have per
> LRU numbers and we only use few counters which could be accounted to the
> root memcg as well. Counters should be quite cheap.

Sure, this is my hope too.

> 
> [...]
> 
> > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > >  	__oom_kill_process(victim);
> > > >  }
> > > >  
> > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > +{
> > > > +	if (!tsk_is_oom_victim(task)) {
> > > 
> > > How can this happen?
> > 
> > We do start with killing the largest process, and then iterate over all tasks
> > in the cgroup. So, this check is required to avoid killing tasks which are
> > already in the termination process.
> 
> Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?

No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
tasks. This is a way to skip the biggest task, and do not kill it again.

>  
> > > 
> > > > +		get_task_struct(task);
> > > > +		__oom_kill_process(task);
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > > +{
> > > > +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > > > +				      DEFAULT_RATELIMIT_BURST);
> > > > +
> > > > +	if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > > > +		return oc->chosen_memcg;
> > > > +
> > > > +	/* Always begin with the task with the biggest memory footprint */
> > > > +	oc->chosen_points = 0;
> > > > +	oc->chosen_task = NULL;
> > > > +	mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > > +
> > > > +	if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > > +		goto out;
> > > > +
> > > > +	if (__ratelimit(&oom_rs))
> > > > +		dump_header(oc, oc->chosen_task);
> > > 
> > > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > > does it make sense to dump_tasks()? Would it make sense to print stats
> > > of all eligible memcgs instead?
> > 
> > Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
> 
> People are parsing oom reports but I disagree this is an ABI of any
> sort. The report is closely tight to the particular implementation and
> as such it has changed several times over the time.
> 
> > but is also closely connected with the implementation. So I would suggest
> > to postpone this until we'll get more usage examples and will better
> > understand what information we need.
> 
> I would drop tasks list at least because that is clearly misleading in
> this context because we are not selecting from all tasks. We are
> selecting between memcgs. The memcg information can be added in a
> separate patch of course.

Let's postpone it until we'll land the rest of the patchset.

Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:08           ` Roman Gushchin
@ 2017-10-03 14:22             ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 14:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
[...]
> > I guess we want to inherit the value on the memcg creation but I agree
> > that enforcing parent setting is weird. I will think about it some more
> > but I agree that it is saner to only enforce per memcg value.
> 
> I'm not against, but we should come up with a good explanation, why we're
> inheriting it; or not inherit.

Inheriting sounds like a less surprising behavior. Once you opt in for
oom_group you can expect that descendants are going to assume the same
unless they explicitly state otherwise.

[...]
> > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > >  	__oom_kill_process(victim);
> > > > >  }
> > > > >  
> > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > +{
> > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > 
> > > > How can this happen?
> > > 
> > > We do start with killing the largest process, and then iterate over all tasks
> > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > already in the termination process.
> > 
> > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> 
> No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> tasks. This is a way to skip the biggest task, and do not kill it again.

OK, I have missed that part. Why are we doing that actually? Why don't
we simply do 
	/* If oom_group flag is set, kill all belonging tasks */
	if (mem_cgroup_oom_group(oc->chosen_memcg))
		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
				      NULL);

we are going to kill all the tasks anyway.

[...]
> > > > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > > > does it make sense to dump_tasks()? Would it make sense to print stats
> > > > of all eligible memcgs instead?
> > > 
> > > Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
> > 
> > People are parsing oom reports but I disagree this is an ABI of any
> > sort. The report is closely tight to the particular implementation and
> > as such it has changed several times over the time.
> > 
> > > but is also closely connected with the implementation. So I would suggest
> > > to postpone this until we'll get more usage examples and will better
> > > understand what information we need.
> > 
> > I would drop tasks list at least because that is clearly misleading in
> > this context because we are not selecting from all tasks. We are
> > selecting between memcgs. The memcg information can be added in a
> > separate patch of course.
> 
> Let's postpone it until we'll land the rest of the patchset.

This is certainly not a show stopper but I would like to resolve it
sooner rather than later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:22             ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 14:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
[...]
> > I guess we want to inherit the value on the memcg creation but I agree
> > that enforcing parent setting is weird. I will think about it some more
> > but I agree that it is saner to only enforce per memcg value.
> 
> I'm not against, but we should come up with a good explanation, why we're
> inheriting it; or not inherit.

Inheriting sounds like a less surprising behavior. Once you opt in for
oom_group you can expect that descendants are going to assume the same
unless they explicitly state otherwise.

[...]
> > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > >  	__oom_kill_process(victim);
> > > > >  }
> > > > >  
> > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > +{
> > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > 
> > > > How can this happen?
> > > 
> > > We do start with killing the largest process, and then iterate over all tasks
> > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > already in the termination process.
> > 
> > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> 
> No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> tasks. This is a way to skip the biggest task, and do not kill it again.

OK, I have missed that part. Why are we doing that actually? Why don't
we simply do 
	/* If oom_group flag is set, kill all belonging tasks */
	if (mem_cgroup_oom_group(oc->chosen_memcg))
		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
				      NULL);

we are going to kill all the tasks anyway.

[...]
> > > > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > > > does it make sense to dump_tasks()? Would it make sense to print stats
> > > > of all eligible memcgs instead?
> > > 
> > > Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
> > 
> > People are parsing oom reports but I disagree this is an ABI of any
> > sort. The report is closely tight to the particular implementation and
> > as such it has changed several times over the time.
> > 
> > > but is also closely connected with the implementation. So I would suggest
> > > to postpone this until we'll get more usage examples and will better
> > > understand what information we need.
> > 
> > I would drop tasks list at least because that is clearly misleading in
> > this context because we are not selecting from all tasks. We are
> > selecting between memcgs. The memcg information can be added in a
> > separate patch of course.
> 
> Let's postpone it until we'll land the rest of the patchset.

This is certainly not a show stopper but I would like to resolve it
sooner rather than later.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:22             ` Michal Hocko
  (?)
@ 2017-10-03 14:35               ` Tejun Heo
  -1 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2017-10-03 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

Hello, Michal.

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Here's a counter example.

Let's say there's a container which hosts one main application, and
the container shares its host with other containers.

* Let's say the container is a regular containerized OS instance and
  can't really guarantee system integrity if one its processes gets
  randomly killed.

* However, the application that it's running inside an isolated cgroup
  is more intelligent and composed of multiple interchangeable
  processes and can treat killing of a random process as partial
  capacity loss.

When the host is setting up the outer container, it doesn't
necessarily know whether the containerized environment would be able
to handle partial OOM kills or not.  It's akin to panic_on_oom setting
at system level - it's the containerized instance itself which knows
whether it can handle partial OOM kills or not.  This is why this knob
should be delegatable.

Now, the container itself has group OOM set and the isolated main
application is starting up.  It obviously wants partial OOM kills
rather than group killing.  This is the same principle.  The
application which is being contained in the cgroup is the one which
knows how it can handle OOM conditions, not the outer environment, so
it obviously needs to be able to set the configuration it wants.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:35               ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2017-10-03 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

Hello, Michal.

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Here's a counter example.

Let's say there's a container which hosts one main application, and
the container shares its host with other containers.

* Let's say the container is a regular containerized OS instance and
  can't really guarantee system integrity if one its processes gets
  randomly killed.

* However, the application that it's running inside an isolated cgroup
  is more intelligent and composed of multiple interchangeable
  processes and can treat killing of a random process as partial
  capacity loss.

When the host is setting up the outer container, it doesn't
necessarily know whether the containerized environment would be able
to handle partial OOM kills or not.  It's akin to panic_on_oom setting
at system level - it's the containerized instance itself which knows
whether it can handle partial OOM kills or not.  This is why this knob
should be delegatable.

Now, the container itself has group OOM set and the isolated main
application is starting up.  It obviously wants partial OOM kills
rather than group killing.  This is the same principle.  The
application which is being contained in the cgroup is the one which
knows how it can handle OOM conditions, not the outer environment, so
it obviously needs to be able to set the configuration it wants.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:35               ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2017-10-03 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes,
	Andrew Morton, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hello, Michal.

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Here's a counter example.

Let's say there's a container which hosts one main application, and
the container shares its host with other containers.

* Let's say the container is a regular containerized OS instance and
  can't really guarantee system integrity if one its processes gets
  randomly killed.

* However, the application that it's running inside an isolated cgroup
  is more intelligent and composed of multiple interchangeable
  processes and can treat killing of a random process as partial
  capacity loss.

When the host is setting up the outer container, it doesn't
necessarily know whether the containerized environment would be able
to handle partial OOM kills or not.  It's akin to panic_on_oom setting
at system level - it's the containerized instance itself which knows
whether it can handle partial OOM kills or not.  This is why this knob
should be delegatable.

Now, the container itself has group OOM set and the isolated main
application is starting up.  It obviously wants partial OOM kills
rather than group killing.  This is the same principle.  The
application which is being contained in the cgroup is the one which
knows how it can handle OOM conditions, not the outer environment, so
it obviously needs to be able to set the configuration it wants.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:22             ` Michal Hocko
  (?)
@ 2017-10-03 14:38               ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.
> 
> [...]
> > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > >  	__oom_kill_process(victim);
> > > > > >  }
> > > > > >  
> > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > > +{
> > > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > > 
> > > > > How can this happen?
> > > > 
> > > > We do start with killing the largest process, and then iterate over all tasks
> > > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > > already in the termination process.
> > > 
> > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > 
> > No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> > tasks. This is a way to skip the biggest task, and do not kill it again.
> 
> OK, I have missed that part. Why are we doing that actually? Why don't
> we simply do 
> 	/* If oom_group flag is set, kill all belonging tasks */
> 	if (mem_cgroup_oom_group(oc->chosen_memcg))
> 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> 				      NULL);
> 
> we are going to kill all the tasks anyway.

Well, the idea behind was that killing the biggest process give us better
chances to get out of global memory shortage and guarantee forward progress.
I can drop it, if it considered to be excessive.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:38               ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.
> 
> [...]
> > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > >  	__oom_kill_process(victim);
> > > > > >  }
> > > > > >  
> > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > > +{
> > > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > > 
> > > > > How can this happen?
> > > > 
> > > > We do start with killing the largest process, and then iterate over all tasks
> > > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > > already in the termination process.
> > > 
> > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > 
> > No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> > tasks. This is a way to skip the biggest task, and do not kill it again.
> 
> OK, I have missed that part. Why are we doing that actually? Why don't
> we simply do 
> 	/* If oom_group flag is set, kill all belonging tasks */
> 	if (mem_cgroup_oom_group(oc->chosen_memcg))
> 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> 				      NULL);
> 
> we are going to kill all the tasks anyway.

Well, the idea behind was that killing the biggest process give us better
chances to get out of global memory shortage and guarantee forward progress.
I can drop it, if it considered to be excessive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:38               ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-03 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.
> 
> [...]
> > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > >  	__oom_kill_process(victim);
> > > > > >  }
> > > > > >  
> > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > > +{
> > > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > > 
> > > > > How can this happen?
> > > > 
> > > > We do start with killing the largest process, and then iterate over all tasks
> > > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > > already in the termination process.
> > > 
> > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > 
> > No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> > tasks. This is a way to skip the biggest task, and do not kill it again.
> 
> OK, I have missed that part. Why are we doing that actually? Why don't
> we simply do 
> 	/* If oom_group flag is set, kill all belonging tasks */
> 	if (mem_cgroup_oom_group(oc->chosen_memcg))
> 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> 				      NULL);
> 
> we are going to kill all the tasks anyway.

Well, the idea behind was that killing the biggest process give us better
chances to get out of global memory shortage and guarantee forward progress.
I can drop it, if it considered to be excessive.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:38               ` Roman Gushchin
@ 2017-10-03 14:43                 ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 14:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 15:38:08, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> > On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > I guess we want to inherit the value on the memcg creation but I agree
> > > > that enforcing parent setting is weird. I will think about it some more
> > > > but I agree that it is saner to only enforce per memcg value.
> > > 
> > > I'm not against, but we should come up with a good explanation, why we're
> > > inheriting it; or not inherit.
> > 
> > Inheriting sounds like a less surprising behavior. Once you opt in for
> > oom_group you can expect that descendants are going to assume the same
> > unless they explicitly state otherwise.
> > 
> > [...]
> > > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > > >  	__oom_kill_process(victim);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > > > +{
> > > > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > > > 
> > > > > > How can this happen?
> > > > > 
> > > > > We do start with killing the largest process, and then iterate over all tasks
> > > > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > > > already in the termination process.
> > > > 
> > > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > > 
> > > No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> > > tasks. This is a way to skip the biggest task, and do not kill it again.
> > 
> > OK, I have missed that part. Why are we doing that actually? Why don't
> > we simply do 
> > 	/* If oom_group flag is set, kill all belonging tasks */
> > 	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > 				      NULL);
> > 
> > we are going to kill all the tasks anyway.
> 
> Well, the idea behind was that killing the biggest process give us better
> chances to get out of global memory shortage and guarantee forward progress.
> I can drop it, if it considered to be excessive.

Yes, please do so. If we need it then we can do that in a separate patch
along with the explanation why it is needed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-03 14:43                 ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-03 14:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 03-10-17 15:38:08, Roman Gushchin wrote:
> On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> > On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > I guess we want to inherit the value on the memcg creation but I agree
> > > > that enforcing parent setting is weird. I will think about it some more
> > > > but I agree that it is saner to only enforce per memcg value.
> > > 
> > > I'm not against, but we should come up with a good explanation, why we're
> > > inheriting it; or not inherit.
> > 
> > Inheriting sounds like a less surprising behavior. Once you opt in for
> > oom_group you can expect that descendants are going to assume the same
> > unless they explicitly state otherwise.
> > 
> > [...]
> > > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > > > >  	__oom_kill_process(victim);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > > > > +{
> > > > > > > +	if (!tsk_is_oom_victim(task)) {
> > > > > > 
> > > > > > How can this happen?
> > > > > 
> > > > > We do start with killing the largest process, and then iterate over all tasks
> > > > > in the cgroup. So, this check is required to avoid killing tasks which are
> > > > > already in the termination process.
> > > > 
> > > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > > 
> > > No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
> > > tasks. This is a way to skip the biggest task, and do not kill it again.
> > 
> > OK, I have missed that part. Why are we doing that actually? Why don't
> > we simply do 
> > 	/* If oom_group flag is set, kill all belonging tasks */
> > 	if (mem_cgroup_oom_group(oc->chosen_memcg))
> > 		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> > 				      NULL);
> > 
> > we are going to kill all the tasks anyway.
> 
> Well, the idea behind was that killing the biggest process give us better
> chances to get out of global memory shortage and guarantee forward progress.
> I can drop it, if it considered to be excessive.

Yes, please do so. If we need it then we can do that in a separate patch
along with the explanation why it is needed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:35               ` Tejun Heo
@ 2017-10-04  9:29                 ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-04  9:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

On Tue 03-10-17 07:35:59, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> > On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > I guess we want to inherit the value on the memcg creation but I agree
> > > > that enforcing parent setting is weird. I will think about it some more
> > > > but I agree that it is saner to only enforce per memcg value.
> > > 
> > > I'm not against, but we should come up with a good explanation, why we're
> > > inheriting it; or not inherit.
> > 
> > Inheriting sounds like a less surprising behavior. Once you opt in for
> > oom_group you can expect that descendants are going to assume the same
> > unless they explicitly state otherwise.
> 
> Here's a counter example.
> 
> Let's say there's a container which hosts one main application, and
> the container shares its host with other containers.
> 
> * Let's say the container is a regular containerized OS instance and
>   can't really guarantee system integrity if one its processes gets
>   randomly killed.
> 
> * However, the application that it's running inside an isolated cgroup
>   is more intelligent and composed of multiple interchangeable
>   processes and can treat killing of a random process as partial
>   capacity loss.
> 
> When the host is setting up the outer container, it doesn't
> necessarily know whether the containerized environment would be able
> to handle partial OOM kills or not.  It's akin to panic_on_oom setting
> at system level - it's the containerized instance itself which knows
> whether it can handle partial OOM kills or not.  This is why this knob
> should be delegatable.
> 
> Now, the container itself has group OOM set and the isolated main
> application is starting up.  It obviously wants partial OOM kills
> rather than group killing.  This is the same principle.  The
> application which is being contained in the cgroup is the one which
> knows how it can handle OOM conditions, not the outer environment, so
> it obviously needs to be able to set the configuration it wants.

Yes this makes a lot of sense. On the other hand we used to copy other
reclaim specific atributes like swappiness and oom_kill_disable.

I guess we should be OK with "non-hierarchical" behavior when it is
documented properly so that there are surpasses.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-04  9:29                 ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-10-04  9:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

On Tue 03-10-17 07:35:59, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> > On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> > [...]
> > > > I guess we want to inherit the value on the memcg creation but I agree
> > > > that enforcing parent setting is weird. I will think about it some more
> > > > but I agree that it is saner to only enforce per memcg value.
> > > 
> > > I'm not against, but we should come up with a good explanation, why we're
> > > inheriting it; or not inherit.
> > 
> > Inheriting sounds like a less surprising behavior. Once you opt in for
> > oom_group you can expect that descendants are going to assume the same
> > unless they explicitly state otherwise.
> 
> Here's a counter example.
> 
> Let's say there's a container which hosts one main application, and
> the container shares its host with other containers.
> 
> * Let's say the container is a regular containerized OS instance and
>   can't really guarantee system integrity if one its processes gets
>   randomly killed.
> 
> * However, the application that it's running inside an isolated cgroup
>   is more intelligent and composed of multiple interchangeable
>   processes and can treat killing of a random process as partial
>   capacity loss.
> 
> When the host is setting up the outer container, it doesn't
> necessarily know whether the containerized environment would be able
> to handle partial OOM kills or not.  It's akin to panic_on_oom setting
> at system level - it's the containerized instance itself which knows
> whether it can handle partial OOM kills or not.  This is why this knob
> should be delegatable.
> 
> Now, the container itself has group OOM set and the isolated main
> application is starting up.  It obviously wants partial OOM kills
> rather than group killing.  This is the same principle.  The
> application which is being contained in the cgroup is the one which
> knows how it can handle OOM conditions, not the outer environment, so
> it obviously needs to be able to set the configuration it wants.

Yes this makes a lot of sense. On the other hand we used to copy other
reclaim specific atributes like swappiness and oom_kill_disable.

I guess we should be OK with "non-hierarchical" behavior when it is
documented properly so that there are surpasses.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
  2017-10-03 14:22             ` Michal Hocko
  (?)
@ 2017-10-04 15:04               ` Roman Gushchin
  -1 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Not sure I understand why. Setting memory.oom_group on a child memcg
has absolutely no meaning until memory.max is also set. In case of OOM
scoped to the parent memcg or above, parent's value defines the behavior.

If a user decides to create a separate OOM domain (be setting the hard
memory limit), he/she can also make a decision on how OOM event should
be handled.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-04 15:04               ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Not sure I understand why. Setting memory.oom_group on a child memcg
has absolutely no meaning until memory.max is also set. In case of OOM
scoped to the parent memcg or above, parent's value defines the behavior.

If a user decides to create a separate OOM domain (be setting the hard
memory limit), he/she can also make a decision on how OOM event should
be handled.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v9 3/5] mm, oom: cgroup-aware OOM killer
@ 2017-10-04 15:04               ` Roman Gushchin
  0 siblings, 0 replies; 46+ messages in thread
From: Roman Gushchin @ 2017-10-04 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Not sure I understand why. Setting memory.oom_group on a child memcg
has absolutely no meaning until memory.max is also set. In case of OOM
scoped to the parent memcg or above, parent's value defines the behavior.

If a user decides to create a separate OOM domain (be setting the hard
memory limit), he/she can also make a decision on how OOM event should
be handled.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-10-04 15:05 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-27 13:09 [v9 0/5] cgroup-aware OOM killer Roman Gushchin
2017-09-27 13:09 ` Roman Gushchin
2017-09-27 13:09 ` [v9 1/5] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-09-27 13:09   ` Roman Gushchin
2017-09-27 13:09 ` [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-09-27 13:09   ` Roman Gushchin
2017-10-03 10:49   ` Michal Hocko
2017-10-03 10:49     ` Michal Hocko
2017-10-03 12:50     ` Roman Gushchin
2017-10-03 12:50       ` Roman Gushchin
2017-09-27 13:09 ` [v9 3/5] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-09-27 13:09   ` Roman Gushchin
2017-10-03 11:48   ` Michal Hocko
2017-10-03 11:48     ` Michal Hocko
2017-10-03 12:37     ` Roman Gushchin
2017-10-03 12:37       ` Roman Gushchin
2017-10-03 12:37       ` Roman Gushchin
2017-10-03 13:36       ` Michal Hocko
2017-10-03 13:36         ` Michal Hocko
2017-10-03 14:08         ` Roman Gushchin
2017-10-03 14:08           ` Roman Gushchin
2017-10-03 14:22           ` Michal Hocko
2017-10-03 14:22             ` Michal Hocko
2017-10-03 14:35             ` Tejun Heo
2017-10-03 14:35               ` Tejun Heo
2017-10-03 14:35               ` Tejun Heo
2017-10-04  9:29               ` Michal Hocko
2017-10-04  9:29                 ` Michal Hocko
2017-10-03 14:38             ` Roman Gushchin
2017-10-03 14:38               ` Roman Gushchin
2017-10-03 14:38               ` Roman Gushchin
2017-10-03 14:43               ` Michal Hocko
2017-10-03 14:43                 ` Michal Hocko
2017-10-04 15:04             ` Roman Gushchin
2017-10-04 15:04               ` Roman Gushchin
2017-10-04 15:04               ` Roman Gushchin
2017-09-27 13:09 ` [v9 4/5] mm, oom: add cgroup v2 mount option for " Roman Gushchin
2017-09-27 13:09   ` Roman Gushchin
2017-10-03 11:50   ` Michal Hocko
2017-10-03 11:50     ` Michal Hocko
2017-10-03 12:49     ` Roman Gushchin
2017-10-03 12:49       ` Roman Gushchin
2017-10-03 13:39       ` Michal Hocko
2017-10-03 13:39         ` Michal Hocko
2017-09-27 13:09 ` [v9 5/5] mm, oom, docs: describe the " Roman Gushchin
2017-09-27 13:09   ` Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.