[v8 0/4] cgroup-aware OOM killer

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [v8 0/4] cgroup-aware OOM killer
@ 2017-09-11 13:17 Roman Gushchin
  2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
                   ` (4 more replies)
  0 siblings, 5 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-11 13:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

This patchset makes the OOM killer cgroup-aware.

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
    memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
    way how OOM victims becoming an access to the memory
    reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
    (based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
    oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
    synchronization is based on finding tasks with OOM victim marker,
    rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
    from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
    only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

Roman Gushchin (4):
  mm, oom: refactor the oom_kill_process() function
  mm, oom: cgroup-aware OOM killer
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  39 +++++++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  33 ++++++
 include/linux/oom.h         |  12 +-
 kernel/cgroup/cgroup.c      |  10 ++
 mm/memcontrol.c             | 261 ++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c               | 210 +++++++++++++++++++++++------------
 7 files changed, 501 insertions(+), 69 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [v8 1/4] mm, oom: refactor the oom_kill_process() function
  2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
@ 2017-09-11 13:17 ` Roman Gushchin
  2017-09-11 20:51   ` David Rientjes
  2017-09-14 13:42   ` Michal Hocko
  2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-11 13:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 99736e026712..f061b627092c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-	struct task_struct *p = oc->chosen;
-	unsigned int points = oc->chosen_points;
-	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
+	struct task_struct *p;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap = true;
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
-
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child,
-				oc->memcg, oc->nodemask, oc->totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
-
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+	struct task_struct *p = oc->chosen;
+	unsigned int points = oc->chosen_points;
+	struct task_struct *victim = p;
+	struct task_struct *child;
+	struct task_struct *t;
+	unsigned int victim_points = 0;
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
+	/*
+	 * If the task is already exiting, don't alarm the sysadmin or kill
+	 * its children or threads, just give it access to memory reserves
+	 * so it can die quickly
+	 */
+	task_lock(p);
+	if (task_will_free_mem(p)) {
+		mark_oom_victim(p);
+		wake_oom_reaper(p);
+		task_unlock(p);
+		put_task_struct(p);
+		return;
+	}
+	task_unlock(p);
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, p);
+
+	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
+
+	/*
+	 * If any of p's children has a different mm and is eligible for kill,
+	 * the one with the highest oom_badness() score is sacrificed for its
+	 * parent.  This attempts to lose the minimal amount of work done while
+	 * still freeing memory.
+	 */
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, p->mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child,
+				oc->memcg, oc->nodemask, oc->totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	__oom_kill_process(victim);
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [v8 1/4] mm, oom: refactor the oom_kill_process() function
  2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-09-11 20:51   ` David Rientjes
  2017-09-14 13:42   ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-11 20:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: David Rientjes <rientjes@google.com>

https://marc.info/?l=linux-kernel&m=150274805412752

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 1/4] mm, oom: refactor the oom_kill_process() function
  2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-09-11 20:51   ` David Rientjes
@ 2017-09-14 13:42   ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2017-09-14 13:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
	David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon 11-09-17 14:17:39, Roman Gushchin wrote:
> The oom_kill_process() function consists of two logical parts:
> the first one is responsible for considering task's children as
> a potential victim and printing the debug information.
> The second half is responsible for sending SIGKILL to all
> tasks sharing the mm struct with the given victim.
> 
> This commit splits the oom_kill_process() function with
> an intention to re-use the the second half: __oom_kill_process().
> 
> The cgroup-aware OOM killer will kill multiple tasks
> belonging to the victim cgroup. We don't need to print
> the debug information for the each task, as well as play
> with task selection (considering task's children),
> so we can't use the existing oom_kill_process().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 65 insertions(+), 58 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 99736e026712..f061b627092c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -804,68 +804,12 @@ static bool task_will_free_mem(struct task_struct *task)
>  	return ret;
>  }
>  
> -static void oom_kill_process(struct oom_control *oc, const char *message)
> +static void __oom_kill_process(struct task_struct *victim)
>  {
> -	struct task_struct *p = oc->chosen;
> -	unsigned int points = oc->chosen_points;
> -	struct task_struct *victim = p;
> -	struct task_struct *child;
> -	struct task_struct *t;
> +	struct task_struct *p;
>  	struct mm_struct *mm;
> -	unsigned int victim_points = 0;
> -	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> -					      DEFAULT_RATELIMIT_BURST);
>  	bool can_oom_reap = true;
>  
> -	/*
> -	 * If the task is already exiting, don't alarm the sysadmin or kill
> -	 * its children or threads, just give it access to memory reserves
> -	 * so it can die quickly
> -	 */
> -	task_lock(p);
> -	if (task_will_free_mem(p)) {
> -		mark_oom_victim(p);
> -		wake_oom_reaper(p);
> -		task_unlock(p);
> -		put_task_struct(p);
> -		return;
> -	}
> -	task_unlock(p);
> -
> -	if (__ratelimit(&oom_rs))
> -		dump_header(oc, p);
> -
> -	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> -		message, task_pid_nr(p), p->comm, points);
> -
> -	/*
> -	 * If any of p's children has a different mm and is eligible for kill,
> -	 * the one with the highest oom_badness() score is sacrificed for its
> -	 * parent.  This attempts to lose the minimal amount of work done while
> -	 * still freeing memory.
> -	 */
> -	read_lock(&tasklist_lock);
> -	for_each_thread(p, t) {
> -		list_for_each_entry(child, &t->children, sibling) {
> -			unsigned int child_points;
> -
> -			if (process_shares_mm(child, p->mm))
> -				continue;
> -			/*
> -			 * oom_badness() returns 0 if the thread is unkillable
> -			 */
> -			child_points = oom_badness(child,
> -				oc->memcg, oc->nodemask, oc->totalpages);
> -			if (child_points > victim_points) {
> -				put_task_struct(victim);
> -				victim = child;
> -				victim_points = child_points;
> -				get_task_struct(victim);
> -			}
> -		}
> -	}
> -	read_unlock(&tasklist_lock);
> -
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
>  		put_task_struct(victim);
> @@ -939,6 +883,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  }
>  #undef K
>  
> +static void oom_kill_process(struct oom_control *oc, const char *message)
> +{
> +	struct task_struct *p = oc->chosen;
> +	unsigned int points = oc->chosen_points;
> +	struct task_struct *victim = p;
> +	struct task_struct *child;
> +	struct task_struct *t;
> +	unsigned int victim_points = 0;
> +	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +					      DEFAULT_RATELIMIT_BURST);
> +
> +	/*
> +	 * If the task is already exiting, don't alarm the sysadmin or kill
> +	 * its children or threads, just give it access to memory reserves
> +	 * so it can die quickly
> +	 */
> +	task_lock(p);
> +	if (task_will_free_mem(p)) {
> +		mark_oom_victim(p);
> +		wake_oom_reaper(p);
> +		task_unlock(p);
> +		put_task_struct(p);
> +		return;
> +	}
> +	task_unlock(p);
> +
> +	if (__ratelimit(&oom_rs))
> +		dump_header(oc, p);
> +
> +	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);
> +
> +	/*
> +	 * If any of p's children has a different mm and is eligible for kill,
> +	 * the one with the highest oom_badness() score is sacrificed for its
> +	 * parent.  This attempts to lose the minimal amount of work done while
> +	 * still freeing memory.
> +	 */
> +	read_lock(&tasklist_lock);
> +	for_each_thread(p, t) {
> +		list_for_each_entry(child, &t->children, sibling) {
> +			unsigned int child_points;
> +
> +			if (process_shares_mm(child, p->mm))
> +				continue;
> +			/*
> +			 * oom_badness() returns 0 if the thread is unkillable
> +			 */
> +			child_points = oom_badness(child,
> +				oc->memcg, oc->nodemask, oc->totalpages);
> +			if (child_points > victim_points) {
> +				put_task_struct(victim);
> +				victim = child;
> +				victim_points = child_points;
> +				get_task_struct(victim);
> +			}
> +		}
> +	}
> +	read_unlock(&tasklist_lock);
> +
> +	__oom_kill_process(victim);
> +}
> +
>  /*
>   * Determines whether the kernel must panic because of the panic_on_oom sysctl.
>   */
> -- 
> 2.13.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [v8 2/4] mm, oom: cgroup-aware OOM killer
  2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
  2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-09-11 13:17 ` Roman Gushchin
  2017-09-13 20:46   ` David Rientjes
  2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-11 13:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest memory consumer,
walking down by the memcg tree from root cgroup or OOMing cgroup,
if OOM is caused by reaching memcg memory limit. On each level
the memcg with biggest memory footprint is selected.
By default, a leaf memcg is chosen, and the biggest task
inside is killed.

But a user can change this behavior by enabling the per-cgroup
memory.oom_group option. If set, it causes the OOM killer to treat
the whole cgroup as an indivisible memory consumer. This means
that OOM victim selection will stop at such memcg, the whole
memcg will be selected, and all belonging tasks will be killed.
The only exception is tasks with oom_score_adj set to -1000
are considered as unkillable.

The root cgroup is treated as a leaf memcg, so it's score
is compared with top-level memory cgroups. The oom_group option
is not supported for the root cgroup. Due to memcg statistics
implementation a special algorithm is used for estimating
root cgroup oom_score: we define it as maximum oom_score
of the belonging tasks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h |  33 ++++++
 include/linux/oom.h        |  12 ++-
 mm/memcontrol.c            | 258 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c              |  95 ++++++++++++++---
 4 files changed, 383 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..5b5c2b89968e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
+	/* kill all tasks in the subtree in case of OOM */
+	bool oom_group;
+
+	/* cached OOM score */
+	long oom_score;
+
 	/* handle for "memory.events" */
 	struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +763,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -936,6 +959,16 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
-	struct task_struct *chosen;
+	struct task_struct *chosen_task;
+	struct mem_cgroup *chosen_memcg;
 	unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 15af3da5af02..da2b12ea4667 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+			      const nodemask_t *nodemask,
+			      unsigned long totalpages)
+{
+	long points = 0;
+	int nid;
+	pg_data_t *pgdat;
+
+	/*
+	 * We don't have necessary stats for the root memcg,
+	 * so we define it's oom_score as the maximum oom_score
+	 * of the belonging tasks.
+	 */
+	if (memcg == root_mem_cgroup) {
+		struct css_task_iter it;
+		struct task_struct *task;
+		long score, max_score = 0;
+
+		css_task_iter_start(&memcg->css, 0, &it);
+		while ((task = css_task_iter_next(&it))) {
+			score = oom_badness(task, memcg, nodemask,
+					    totalpages);
+			if (max_score > score)
+				max_score = score;
+		}
+		css_task_iter_end(&it);
+
+		return max_score;
+	}
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (nodemask && !node_isset(nid, *nodemask))
+			continue;
+
+		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+		pgdat = NODE_DATA(nid);
+		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+					    NR_SLAB_UNRECLAIMABLE);
+	}
+
+	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+		(PAGE_SIZE / 1024);
+	points += memcg_page_state(memcg, MEMCG_SOCK);
+	points += memcg_page_state(memcg, MEMCG_SWAP);
+
+	return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ *   -1: there are inflight OOM victim tasks, belonging to the memcg
+ *    0: memcg is not eligible, e.g. all belonging tasks are protected
+ *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ *   >0: memcg is eligible, and the returned value is an estimation
+ *       of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+			       const nodemask_t *nodemask,
+			       unsigned long totalpages)
+{
+	struct css_task_iter it;
+	struct task_struct *task;
+	int eligible = 0;
+
+	/*
+	 * Memcg is OOM eligible if there are OOM killable tasks inside.
+	 *
+	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+	 * as unkillable.
+	 *
+	 * If there are inflight OOM victim tasks inside the memcg,
+	 * we return -1.
+	 */
+	css_task_iter_start(&memcg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (!eligible &&
+		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+			eligible = 1;
+
+		if (tsk_is_oom_victim(task) &&
+		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+			eligible = -1;
+			break;
+		}
+	}
+	css_task_iter_end(&it);
+
+	if (eligible <= 0)
+		return eligible;
+
+	return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+	struct mem_cgroup *iter, *parent;
+
+	/*
+	 * If OOM is memcg-wide, and the memcg has the oom_group flag,
+	 * simple select the memcg as a victim.
+	 */
+	if (oc->memcg && oc->memcg->oom_group) {
+		oc->chosen_memcg = oc->memcg;
+		css_get(&oc->chosen_memcg->css);
+		oc->chosen_points = oc->memcg->oom_score;
+		return;
+	}
+
+	/*
+	 * The oom_score is calculated for leaf memcgs and propagated upwards
+	 * by the tree.
+	 *
+	 * for_each_mem_cgroup_tree() walks the tree in pre-order,
+	 * so we simple reset oom_score for non-lead cgroups before
+	 * starting accumulating an actual value from underlying sub-tree.
+	 *
+	 * Root memcg is treated as a leaf memcg.
+	 */
+	for_each_mem_cgroup_tree(iter, root) {
+		if (memcg_has_children(iter) && iter != root_mem_cgroup) {
+			iter->oom_score = 0;
+			continue;
+		}
+
+		iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
+						     oc->totalpages);
+
+		/*
+		 * Ignore empty and non-eligible memory cgroups.
+		 */
+		if (iter->oom_score == 0)
+			continue;
+
+		/*
+		 * If there are inflight OOM victims, we don't need to look
+		 * further for new victims.
+		 */
+		if (iter->oom_score == -1) {
+			oc->chosen_memcg = INFLIGHT_VICTIM;
+			mem_cgroup_iter_break(root, iter);
+			return;
+		}
+
+		for (parent = parent_mem_cgroup(iter); parent && parent != root;
+		     parent = parent_mem_cgroup(parent))
+			parent->oom_score += iter->oom_score;
+	}
+
+	for (;;) {
+		struct cgroup_subsys_state *css;
+		struct mem_cgroup *memcg = NULL;
+		long score = LONG_MIN;
+
+		/*
+		 * Root memcg is compared with top-level memcgs.
+		 */
+		if (root == root_mem_cgroup && root->oom_score > 0) {
+			score = root->oom_score;
+			memcg = root_mem_cgroup;
+		}
+
+		css_for_each_child(css, &root->css) {
+			struct mem_cgroup *iter = mem_cgroup_from_css(css);
+
+			/*
+			 * Ignore empty and non-eligible memory cgroups.
+			 */
+			if (iter->oom_score == 0)
+				continue;
+
+			if (iter->oom_score > score) {
+				memcg = iter;
+				score = iter->oom_score;
+			}
+		}
+
+		if (!memcg) {
+			if (oc->memcg && root == oc->memcg) {
+				oc->chosen_memcg = oc->memcg;
+				css_get(&oc->chosen_memcg->css);
+				oc->chosen_points = oc->memcg->oom_score;
+			}
+			break;
+		}
+
+		if (memcg->oom_group || !memcg_has_children(memcg) ||
+		    memcg == root_mem_cgroup) {
+			oc->chosen_memcg = memcg;
+			css_get(&oc->chosen_memcg->css);
+			oc->chosen_points = score;
+			break;
+		}
+
+		root = memcg;
+	}
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *root;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	if (oc->memcg)
+		root = oc->memcg;
+	else
+		root = root_mem_cgroup;
+
+	oc->chosen_task = NULL;
+	oc->chosen_memcg = NULL;
+
+	rcu_read_lock();
+	select_victim_memcg(root, oc);
+	rcu_read_unlock();
+
+	return oc->chosen_task || oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5258,6 +5483,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	bool oom_group = memcg->oom_group;
+
+	seq_printf(m, "%d\n", oom_group);
+
+	return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+					       char *buf, size_t nbytes,
+					       loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int oom_group;
+	int err;
+
+	err = kstrtoint(strstrip(buf), 0, &oom_group);
+	if (err)
+		return err;
+
+	memcg->oom_group = oom_group;
+
+	return nbytes;
+}
+
 static int memory_events_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5378,6 +5630,12 @@ static struct cftype memory_files[] = {
 		.write = memory_max_write,
 	},
 	{
+		.name = "oom_group",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_group_show,
+		.write = memory_oom_group_write,
+	},
+	{
 		.name = "events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct mem_cgroup, events_file),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f061b627092c..70359a535e62 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
 {
 	struct oom_control *oc = arg;
 	unsigned long points;
@@ -322,26 +322,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto next;
 
 	/* Prefer thread group leaders for display purposes */
-	if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+	if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
 		goto next;
 select:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
 	get_task_struct(task);
-	oc->chosen = task;
+	oc->chosen_task = task;
 	oc->chosen_points = points;
 next:
 	return 0;
 abort:
-	if (oc->chosen)
-		put_task_struct(oc->chosen);
-	oc->chosen = (void *)-1UL;
+	if (oc->chosen_task)
+		put_task_struct(oc->chosen_task);
+	oc->chosen_task = INFLIGHT_VICTIM;
 	return 1;
 }
 
 /*
  * Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
  */
 static void select_bad_process(struct oom_control *oc)
 {
@@ -810,6 +810,12 @@ static void __oom_kill_process(struct task_struct *victim)
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
 
+	if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+	    victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		put_task_struct(victim);
+		return;
+	}
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -885,7 +891,7 @@ static void __oom_kill_process(struct task_struct *victim)
 
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
-	struct task_struct *p = oc->chosen;
+	struct task_struct *p = oc->chosen_task;
 	unsigned int points = oc->chosen_points;
 	struct task_struct *victim = p;
 	struct task_struct *child;
@@ -946,6 +952,64 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	__oom_kill_process(victim);
 }
 
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
+{
+	if (!tsk_is_oom_victim(task)) {
+		get_task_struct(task);
+		__oom_kill_process(task);
+	}
+	return 0;
+}
+
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (oc->chosen_task) {
+		if (oc->chosen_task == INFLIGHT_VICTIM)
+			return true;
+
+		if (__ratelimit(&oom_rs))
+			dump_header(oc, oc->chosen_task);
+
+		__oom_kill_process(oc->chosen_task);
+
+		schedule_timeout_killable(1);
+		return true;
+
+	} else if (oc->chosen_memcg) {
+		if (oc->chosen_memcg == INFLIGHT_VICTIM)
+			return true;
+
+		/* Always begin with the biggest task */
+		oc->chosen_points = 0;
+		oc->chosen_task = NULL;
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+		if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+			if (__ratelimit(&oom_rs))
+				dump_header(oc, oc->chosen_task);
+
+			__oom_kill_process(oc->chosen_task);
+
+			if (mem_cgroup_oom_group(oc->chosen_memcg))
+				mem_cgroup_scan_tasks(oc->chosen_memcg,
+						      oom_kill_memcg_member,
+						      NULL);
+			schedule_timeout_killable(1);
+		}
+
+		mem_cgroup_put(oc->chosen_memcg);
+		oc->chosen_memcg = NULL;
+		return oc->chosen_task;
+
+	} else {
+		oc->chosen_points = 0;
+		return false;
+	}
+}
+
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
  */
@@ -1042,18 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
 	    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
-		oc->chosen = current;
+		oc->chosen_task = current;
 		oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
 
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
 		/*
@@ -1062,7 +1129,7 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		schedule_timeout_killable(1);
 	}
-	return !!oc->chosen;
+	return !!oc->chosen_task;
 }
 
 /*
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [v8 2/4] mm, oom: cgroup-aware OOM killer
  2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-09-13 20:46   ` David Rientjes
  2017-09-13 21:59     ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-13 20:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 15af3da5af02..da2b12ea4667 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +			      const nodemask_t *nodemask,
> +			      unsigned long totalpages)
> +{
> +	long points = 0;
> +	int nid;
> +	pg_data_t *pgdat;
> +
> +	/*
> +	 * We don't have necessary stats for the root memcg,
> +	 * so we define it's oom_score as the maximum oom_score
> +	 * of the belonging tasks.
> +	 */
> +	if (memcg == root_mem_cgroup) {
> +		struct css_task_iter it;
> +		struct task_struct *task;
> +		long score, max_score = 0;
> +
> +		css_task_iter_start(&memcg->css, 0, &it);
> +		while ((task = css_task_iter_next(&it))) {
> +			score = oom_badness(task, memcg, nodemask,
> +					    totalpages);
> +			if (max_score > score)

score > max_score

> +				max_score = score;
> +		}
> +		css_task_iter_end(&it);
> +
> +		return max_score;
> +	}
> +
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (nodemask && !node_isset(nid, *nodemask))
> +			continue;
> +
> +		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +		pgdat = NODE_DATA(nid);
> +		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +					    NR_SLAB_UNRECLAIMABLE);
> +	}
> +
> +	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +		(PAGE_SIZE / 1024);
> +	points += memcg_page_state(memcg, MEMCG_SOCK);
> +	points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +	return points;
> +}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 2/4] mm, oom: cgroup-aware OOM killer
  2017-09-13 20:46   ` David Rientjes
@ 2017-09-13 21:59     ` Roman Gushchin
  0 siblings, 0 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-13 21:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Sep 13, 2017 at 01:46:51PM -0700, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 15af3da5af02..da2b12ea4667 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2661,6 +2661,231 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> >  	return ret;
> >  }
> >  
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > +			      const nodemask_t *nodemask,
> > +			      unsigned long totalpages)
> > +{
> > +	long points = 0;
> > +	int nid;
> > +	pg_data_t *pgdat;
> > +
> > +	/*
> > +	 * We don't have necessary stats for the root memcg,
> > +	 * so we define it's oom_score as the maximum oom_score
> > +	 * of the belonging tasks.
> > +	 */
> > +	if (memcg == root_mem_cgroup) {
> > +		struct css_task_iter it;
> > +		struct task_struct *task;
> > +		long score, max_score = 0;
> > +
> > +		css_task_iter_start(&memcg->css, 0, &it);
> > +		while ((task = css_task_iter_next(&it))) {
> > +			score = oom_badness(task, memcg, nodemask,
> > +					    totalpages);
> > +			if (max_score > score)
> 
> score > max_score

Ups. Fixed. Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
  2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
  2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-09-11 13:17 ` Roman Gushchin
  2017-09-11 20:48   ` David Rientjes
  2017-09-11 13:17 ` [v8 4/4] mm, oom, docs: describe the " Roman Gushchin
  2017-09-11 20:44 ` [v8 0/4] " David Rientjes
  4 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-11 13:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/cgroup-defs.h |  5 +++++
 kernel/cgroup/cgroup.c      | 10 ++++++++++
 mm/memcontrol.c             |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ade4a78a54c2..db4ff3e233a9 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -79,6 +79,11 @@ enum {
 	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 	 */
 	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+	/*
+	 * Enable cgroup-aware OOM killer.
+	 */
+	CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d6551cd45238..5f8a97a233bb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1699,6 +1699,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
 		if (!strcmp(token, "nsdelegate")) {
 			*root_flags |= CGRP_ROOT_NS_DELEGATE;
 			continue;
+		} else if (!strcmp(token, "groupoom")) {
+			*root_flags |= CGRP_GROUP_OOM;
+			continue;
 		}
 
 		pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1715,6 +1718,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+		if (root_flags & CGRP_GROUP_OOM)
+			cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
 	}
 }
 
@@ -1722,6 +1730,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 {
 	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
 		seq_puts(seq, ",nsdelegate");
+	if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+		seq_puts(seq, ",groupoom");
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da2b12ea4667..d645f70cb3a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2871,6 +2871,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return false;
 
+	if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+		return false;
+
 	if (oc->memcg)
 		root = oc->memcg;
 	else
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin
@ 2017-09-11 20:48   ` David Rientjes
  2017-09-12 20:01     ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-11 20:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
> 
> The behavior can be changed dynamically by remounting the cgroupfs.

I can't imagine that Tejun would be happy with a new mount option, 
especially when it's not required.

OOM behavior does not need to be defined at mount time and for the entire 
hierarchy.  It's possible to very easily implement a tunable as part of 
mem cgroup that is propagated to descendants and controls the oom scoring 
behavior for that hierarchy.  It does not need to be system wide and 
affect scoring of all processes based on which mem cgroup they are 
attached to at any given time.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-11 20:48   ` David Rientjes
@ 2017-09-12 20:01     ` Roman Gushchin
  2017-09-12 20:23       ` David Rientjes
  2017-09-13 12:23       ` Michal Hocko
  0 siblings, 2 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-12 20:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 11, 2017 at 01:48:39PM -0700, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> I can't imagine that Tejun would be happy with a new mount option, 
> especially when it's not required.
> 
> OOM behavior does not need to be defined at mount time and for the entire 
> hierarchy.  It's possible to very easily implement a tunable as part of 
> mem cgroup that is propagated to descendants and controls the oom scoring 
> behavior for that hierarchy.  It does not need to be system wide and 
> affect scoring of all processes based on which mem cgroup they are 
> attached to at any given time.

No, I don't think that mixing per-cgroup and per-process OOM selection
algorithms is a good idea.

So, there are 3 reasonable options:
1) boot option
2) sysctl
3) cgroup mount option

I believe, 3) is better, because it allows changing the behavior dynamically,
and explicitly depends on v2 (what sysctl lacks).

So, the only question is should it be opt-in or opt-out option.
Personally, I would prefer opt-out, but Michal has a very strong opinion here.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-12 20:01     ` Roman Gushchin
@ 2017-09-12 20:23       ` David Rientjes
  2017-09-13 12:23       ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-12 20:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, 12 Sep 2017, Roman Gushchin wrote:

> > I can't imagine that Tejun would be happy with a new mount option, 
> > especially when it's not required.
> > 
> > OOM behavior does not need to be defined at mount time and for the entire 
> > hierarchy.  It's possible to very easily implement a tunable as part of 
> > mem cgroup that is propagated to descendants and controls the oom scoring 
> > behavior for that hierarchy.  It does not need to be system wide and 
> > affect scoring of all processes based on which mem cgroup they are 
> > attached to at any given time.
> 
> No, I don't think that mixing per-cgroup and per-process OOM selection
> algorithms is a good idea.
> 
> So, there are 3 reasonable options:
> 1) boot option
> 2) sysctl
> 3) cgroup mount option
> 
> I believe, 3) is better, because it allows changing the behavior dynamically,
> and explicitly depends on v2 (what sysctl lacks).
> 
> So, the only question is should it be opt-in or opt-out option.
> Personally, I would prefer opt-out, but Michal has a very strong opinion here.
> 

If it absolutely must be a mount option, then I would agree it should be 
opt-in so that it's known what is being changed rather than changing how 
selection was done in the past and requiring legacy users to now mount in 
a new way.

I'd be interested to hear Tejun's comments, however, about whether we want 
to add controller specific mount options like this instead of a tunable at 
the root level, for instance, that controls victim selection and would be 
isolated to the memory cgroup controller as opposed to polluting mount 
options.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 3/4] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  2017-09-12 20:01     ` Roman Gushchin
  2017-09-12 20:23       ` David Rientjes
@ 2017-09-13 12:23       ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2017-09-13 12:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue 12-09-17 21:01:15, Roman Gushchin wrote:
> On Mon, Sep 11, 2017 at 01:48:39PM -0700, David Rientjes wrote:
> > On Mon, 11 Sep 2017, Roman Gushchin wrote:
> > 
> > > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > > OOM killer. If not set, the OOM selection is performed in
> > > a "traditional" per-process way.
> > > 
> > > The behavior can be changed dynamically by remounting the cgroupfs.
> > 
> > I can't imagine that Tejun would be happy with a new mount option, 
> > especially when it's not required.
> > 
> > OOM behavior does not need to be defined at mount time and for the entire 
> > hierarchy.  It's possible to very easily implement a tunable as part of 
> > mem cgroup that is propagated to descendants and controls the oom scoring 
> > behavior for that hierarchy.  It does not need to be system wide and 
> > affect scoring of all processes based on which mem cgroup they are 
> > attached to at any given time.
> 
> No, I don't think that mixing per-cgroup and per-process OOM selection
> algorithms is a good idea.
> 
> So, there are 3 reasonable options:
> 1) boot option
> 2) sysctl
> 3) cgroup mount option
> 
> I believe, 3) is better, because it allows changing the behavior dynamically,
> and explicitly depends on v2 (what sysctl lacks).

I see your argument here. I would just be worried that we end up really
needing more oom strategies in future and those wouldn't fit into memcg
mount option scope. So 1/2 sounds more exensible to me long term. Boot
time would be easier because we do not have to bother dynamic selection
in that case.

> So, the only question is should it be opt-in or opt-out option.
> Personally, I would prefer opt-out, but Michal has a very strong opinion here.

Yes I still strongly believe this has to be opt-in.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [v8 4/4] mm, oom, docs: describe the cgroup-aware OOM killer
  2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
                   ` (2 preceding siblings ...)
  2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin
@ 2017-09-11 13:17 ` Roman Gushchin
  2017-09-11 20:44 ` [v8 0/4] " David Rientjes
  4 siblings, 0 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-11 13:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
	kernel-team, cgroups, linux-doc, linux-kernel

Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dc44785dc0fa..61a2e959e07a 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
        5-2-1. Memory Interface Files
        5-2-2. Usage Guidelines
        5-2-3. Memory Ownership
+       5-2-4. OOM Killer
      5-3. IO
        5-3-1. IO Interface Files
        5-3-2. Writeback
@@ -1034,6 +1035,18 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	If set, OOM killer will kill all processes attached to the cgroup
+	if selected as an OOM victim.
+
+	OOM killer respects the /proc/pid/oom_score_adj value -1000,
+	and will never kill the unkillable task, even if memory.oom_group
+	is set.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1237,6 +1250,32 @@ to be accessed repeatedly by other cgroups, it may make sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, hierarchically looking for a cgroup with the
+largest memory footprint.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup oom_group option. If set, it causes the OOM killer
+to kill all processes attached to the cgroup, except processes
+with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with top-level memory cgroups.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
                   ` (3 preceding siblings ...)
  2017-09-11 13:17 ` [v8 4/4] mm, oom, docs: describe the " Roman Gushchin
@ 2017-09-11 20:44 ` David Rientjes
  2017-09-13 12:29   ` Michal Hocko
  2017-09-21 14:21   ` Johannes Weiner
  4 siblings, 2 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-11 20:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 11 Sep 2017, Roman Gushchin wrote:

> This patchset makes the OOM killer cgroup-aware.
> 
> v8:
>   - Do not kill tasks with OOM_SCORE_ADJ -1000
>   - Make the whole thing opt-in with cgroup mount option control
>   - Drop oom_priority for further discussions

Nack, we specifically require oom_priority for this to function correctly, 
otherwise we cannot prefer to kill from low priority leaf memcgs as 
required.  v8 appears to implement new functionality that we want, to 
compare two memcgs based on usage, but without the ability to influence 
that decision to protect important userspace, so now I'm in a position 
where (1) nothing has changed if I don't use the new mount option or (2) I 
get completely different oom kill selection with the new mount option but 
not the ability to influence it.  I was much happier with the direction 
that v7 was taking, but since v8 causes us to regress without the ability 
to change memcg priority, this has to be nacked.

>   - Kill the whole cgroup if oom_group is set and it's
>     memory.max is reached
>   - Update docs and commit messages

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-11 20:44 ` [v8 0/4] " David Rientjes
@ 2017-09-13 12:29   ` Michal Hocko
  2017-09-13 20:46     ` David Rientjes
  2017-09-13 21:56     ` Roman Gushchin
  2017-09-21 14:21   ` Johannes Weiner
  1 sibling, 2 replies; 78+ messages in thread
From: Michal Hocko @ 2017-09-13 12:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon 11-09-17 13:44:39, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > This patchset makes the OOM killer cgroup-aware.
> > 
> > v8:
> >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> >   - Make the whole thing opt-in with cgroup mount option control
> >   - Drop oom_priority for further discussions
> 
> Nack, we specifically require oom_priority for this to function correctly, 
> otherwise we cannot prefer to kill from low priority leaf memcgs as 
> required.

While I understand that your usecase might require priorities I do not
think this part missing is a reason to nack the cgroup based selection
and kill-all parts. This can be done on top. The only important part
right now is the current selection semantic - only leaf memcgs vs. size
of the hierarchy). I strongly believe that comparing only leaf memcgs
is more straightforward and it doesn't lead to unexpected results as
mentioned before (kill a small memcg which is a part of the larger
sub-hierarchy).

I didn't get to read the new version of this series yet and hope to get
to it soon.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-13 12:29   ` Michal Hocko
@ 2017-09-13 20:46     ` David Rientjes
  2017-09-14 13:34       ` Michal Hocko
  2017-09-13 21:56     ` Roman Gushchin
  1 sibling, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-13 20:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 13 Sep 2017, Michal Hocko wrote:

> > > This patchset makes the OOM killer cgroup-aware.
> > > 
> > > v8:
> > >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> > >   - Make the whole thing opt-in with cgroup mount option control
> > >   - Drop oom_priority for further discussions
> > 
> > Nack, we specifically require oom_priority for this to function correctly, 
> > otherwise we cannot prefer to kill from low priority leaf memcgs as 
> > required.
> 
> While I understand that your usecase might require priorities I do not
> think this part missing is a reason to nack the cgroup based selection
> and kill-all parts. This can be done on top. The only important part
> right now is the current selection semantic - only leaf memcgs vs. size
> of the hierarchy). I strongly believe that comparing only leaf memcgs
> is more straightforward and it doesn't lead to unexpected results as
> mentioned before (kill a small memcg which is a part of the larger
> sub-hierarchy).
> 

The problem is that we cannot enable the cgroup-aware oom killer and 
oom_group behavior because, without oom priorities, we have no ability to 
influence the cgroup that it chooses.  It is doing two things: providing 
more fairness amongst cgroups by selecting based on cumulative usage 
rather than single large process (good!), and effectively is removing all 
userspace control of oom selection (bad).  We want the former, but it 
needs to be coupled with support so that we can protect vital cgroups, 
regardless of their usage.

It is certainly possible to add oom priorities on top before it is merged, 
but I don't see why it isn't part of the patchset.  We need it before its 
merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
killing in the most preferable memcg when they could have simply changed 
the oom priority.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-13 20:46     ` David Rientjes
@ 2017-09-14 13:34       ` Michal Hocko
  2017-09-14 20:07         ` David Rientjes
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-14 13:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 13-09-17 13:46:08, David Rientjes wrote:
> On Wed, 13 Sep 2017, Michal Hocko wrote:
> 
> > > > This patchset makes the OOM killer cgroup-aware.
> > > > 
> > > > v8:
> > > >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> > > >   - Make the whole thing opt-in with cgroup mount option control
> > > >   - Drop oom_priority for further discussions
> > > 
> > > Nack, we specifically require oom_priority for this to function correctly, 
> > > otherwise we cannot prefer to kill from low priority leaf memcgs as 
> > > required.
> > 
> > While I understand that your usecase might require priorities I do not
> > think this part missing is a reason to nack the cgroup based selection
> > and kill-all parts. This can be done on top. The only important part
> > right now is the current selection semantic - only leaf memcgs vs. size
> > of the hierarchy). I strongly believe that comparing only leaf memcgs
> > is more straightforward and it doesn't lead to unexpected results as
> > mentioned before (kill a small memcg which is a part of the larger
> > sub-hierarchy).
> > 
> 
> The problem is that we cannot enable the cgroup-aware oom killer and 
> oom_group behavior because, without oom priorities, we have no ability to 
> influence the cgroup that it chooses.  It is doing two things: providing 
> more fairness amongst cgroups by selecting based on cumulative usage 
> rather than single large process (good!), and effectively is removing all 
> userspace control of oom selection (bad).  We want the former, but it 
> needs to be coupled with support so that we can protect vital cgroups, 
> regardless of their usage.

I understand that your usecase needs a more fine grained control over
the selection but that alone is not a reason to nack the implementation
which doesn't provide it (yet).

> It is certainly possible to add oom priorities on top before it is merged, 
> but I don't see why it isn't part of the patchset.

Because the semantic of the priority for non-leaf memcgs is not fully
clear and I would rather have the core of the functionality merged
before this is sorted out.

> We need it before its 
> merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
> killing in the most preferable memcg when they could have simply changed 
> the oom priority.

I am sorry but I do not really understand your concern. Are you
suggesting that users would start oom disable all tasks in a memcg to
give it a higher priority? Even if that was the case why should such an
abuse be a blocker for generic memcg aware oom killer being merged?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-14 13:34       ` Michal Hocko
@ 2017-09-14 20:07         ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-14 20:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 14 Sep 2017, Michal Hocko wrote:

> > It is certainly possible to add oom priorities on top before it is merged, 
> > but I don't see why it isn't part of the patchset.
> 
> Because the semantic of the priority for non-leaf memcgs is not fully
> clear and I would rather have the core of the functionality merged
> before this is sorted out.
> 

We can't merge the core of the feature before this is sorted out because 
then users start to depend on behavior and we must be backwards 
compatible.  We need a full patchset that introduces the new selection 
heuristic and a way for userspace to control it to either bias or prefer 
one cgroup over another.  The kill-all mechanism is a more orthogonal 
feature for the cgroup-aware oom killer than oom priorities.

I have a usecase for both the cgroup-aware oom killer and its oom 
priorities from previous versions of this patchset, I assume that Roman 
does as well, and would like to see it merged bacause there are real-world 
usecases for it rather than hypothetical usecases that would want to do 
something different.

> > We need it before its 
> > merged to avoid users playing with /proc/pid/oom_score_adj to prevent any 
> > killing in the most preferable memcg when they could have simply changed 
> > the oom priority.
> 
> I am sorry but I do not really understand your concern. Are you
> suggesting that users would start oom disable all tasks in a memcg to
> give it a higher priority? Even if that was the case why should such an
> abuse be a blocker for generic memcg aware oom killer being merged?

If users do not have any way to control victim selection because of a 
shortcoming in the kernel implementation, they will be required to oom 
disable processes and let that be inherited by children they fork in the 
memcg hierarchy to protect cgroups that they do not want to be oom killed, 
regardless of their size.  They simply are left with no other alternative 
if they want to use the cgroup-aware oom killer and/or the kill-all 
mechanism.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-13 12:29   ` Michal Hocko
  2017-09-13 20:46     ` David Rientjes
@ 2017-09-13 21:56     ` Roman Gushchin
  2017-09-14 13:40       ` Michal Hocko
  1 sibling, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-13 21:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> On Mon 11-09-17 13:44:39, David Rientjes wrote:
> > On Mon, 11 Sep 2017, Roman Gushchin wrote:
> > 
> > > This patchset makes the OOM killer cgroup-aware.
> > > 
> > > v8:
> > >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> > >   - Make the whole thing opt-in with cgroup mount option control
> > >   - Drop oom_priority for further discussions
> > 
> > Nack, we specifically require oom_priority for this to function correctly, 
> > otherwise we cannot prefer to kill from low priority leaf memcgs as 
> > required.
> 
> While I understand that your usecase might require priorities I do not
> think this part missing is a reason to nack the cgroup based selection
> and kill-all parts. This can be done on top. The only important part
> right now is the current selection semantic - only leaf memcgs vs. size
> of the hierarchy).

I agree.

> I strongly believe that comparing only leaf memcgs
> is more straightforward and it doesn't lead to unexpected results as
> mentioned before (kill a small memcg which is a part of the larger
> sub-hierarchy).

One of two main goals of this patchset is to introduce cgroup-level
fairness: bigger cgroups should be affected more than smaller,
despite the size of tasks inside. I believe the same principle
should be used for cgroups.

Also, the opposite will make oom_semantics more weird: it will mean
kill all tasks, but also treat memcg as a leaf cgroup.

> 
> I didn't get to read the new version of this series yet and hope to get
> to it soon.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-13 21:56     ` Roman Gushchin
@ 2017-09-14 13:40       ` Michal Hocko
  2017-09-14 16:05         ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-14 13:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
[...]
> > I strongly believe that comparing only leaf memcgs
> > is more straightforward and it doesn't lead to unexpected results as
> > mentioned before (kill a small memcg which is a part of the larger
> > sub-hierarchy).
> 
> One of two main goals of this patchset is to introduce cgroup-level
> fairness: bigger cgroups should be affected more than smaller,
> despite the size of tasks inside. I believe the same principle
> should be used for cgroups.

Yes bigger cgroups should be preferred but I fail to see why bigger
hierarchies should be considered as well if they are not kill-all. And
whether non-leaf memcgs should allow kill-all is not entirely clear to
me. What would be the usecase?
Consider that it might be not your choice (as a user) how deep is your
leaf memcg. I can already see how people complain that their memcg has
been killed just because it was one level deeper in the hierarchy...

I would really start simple and only allow kill-all on leaf memcgs and
only compare leaf memcgs & root. If we ever need to kill whole
hierarchies then allow kill-all on intermediate memcgs as well and then
consider cumulative consumptions only on those that have kill-all
enabled.

Or do I miss any reasonable usecase that would suffer from such a
semantic?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-14 13:40       ` Michal Hocko
@ 2017-09-14 16:05         ` Roman Gushchin
  2017-09-15 10:58           ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-14 16:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> [...]
> > > I strongly believe that comparing only leaf memcgs
> > > is more straightforward and it doesn't lead to unexpected results as
> > > mentioned before (kill a small memcg which is a part of the larger
> > > sub-hierarchy).
> > 
> > One of two main goals of this patchset is to introduce cgroup-level
> > fairness: bigger cgroups should be affected more than smaller,
> > despite the size of tasks inside. I believe the same principle
> > should be used for cgroups.
> 
> Yes bigger cgroups should be preferred but I fail to see why bigger
> hierarchies should be considered as well if they are not kill-all. And
> whether non-leaf memcgs should allow kill-all is not entirely clear to
> me. What would be the usecase?

We definitely want to support kill-all for non-leaf cgroups.
A workload can consist of several cgroups and we want to clean up
the whole thing on OOM. I don't see any reasons to limit
this functionality to leaf cgroups only.

Hierarchies are memory consumers, we do account their usage,
we do apply limits and guarantees for the hierarchies. The same is
with OOM victim selection: we are reclaiming memory from the
biggest consumer. Kill-all knob only defines the way _how_ we do that:
by killing one or all processes.

Just for example, we might want to take memory.low into account at
some point: prefer cgroups which are above their guarantees, avoid
killing those who fit. It would be hard if we're comparing cgroups
from different hierarchies. The same will be with introducing
oom_priorities, which is much more required functionality.

> Consider that it might be not your choice (as a user) how deep is your
> leaf memcg. I can already see how people complain that their memcg has
> been killed just because it was one level deeper in the hierarchy...

The kill-all functionality is enforced by parent, and it seems to be
following the overall memcg design. The parent cgroup enforces memory
limit, memory low limit, etc.

I don't know why OOM control should be different.

> 
> I would really start simple and only allow kill-all on leaf memcgs and
> only compare leaf memcgs & root. If we ever need to kill whole
> hierarchies then allow kill-all on intermediate memcgs as well and then
> consider cumulative consumptions only on those that have kill-all
> enabled.

This sounds hacky to me: the whole thing is depending on cgroup v2 and
is additionally explicitly opt-in.

Why do we need to introduce such incomplete functionality first,
and then suffer trying to extend it and provide backward compatibility?

Also, I think we should compare root cgroup with top-level cgroups,
rather than leaf cgroups. A process in the root cgroup is definitely
system-level entity, and we should compare it with other top-level
entities (other containerized workloads), rather then some random
leaf cgroup deep inside the tree. If we decided, that we're not comparing
random tasks from different cgroups, why should we do this for leaf
cgroups? Is sounds like making only one step towards right direction,
while we can do more.

> 
> Or do I miss any reasonable usecase that would suffer from such a
> semantic?

Kill-all for sub-trees is definitely required.
Enforcing oom_priorities for sub-trees is something that I would expect
very useful too. Comparing leaf cgroups system-wide instead of processes
doesn't sound good for me, we're lacking hierarchical fairness, which
was one of two goals of this patchset.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-14 16:05         ` Roman Gushchin
@ 2017-09-15 10:58           ` Michal Hocko
  2017-09-15 15:23             ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-15 10:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > [...]
> > > > I strongly believe that comparing only leaf memcgs
> > > > is more straightforward and it doesn't lead to unexpected results as
> > > > mentioned before (kill a small memcg which is a part of the larger
> > > > sub-hierarchy).
> > > 
> > > One of two main goals of this patchset is to introduce cgroup-level
> > > fairness: bigger cgroups should be affected more than smaller,
> > > despite the size of tasks inside. I believe the same principle
> > > should be used for cgroups.
> > 
> > Yes bigger cgroups should be preferred but I fail to see why bigger
> > hierarchies should be considered as well if they are not kill-all. And
> > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > me. What would be the usecase?
> 
> We definitely want to support kill-all for non-leaf cgroups.
> A workload can consist of several cgroups and we want to clean up
> the whole thing on OOM.

Could you be more specific about such a workload? E.g. how can be such a
hierarchy handled consistently when its sub-tree gets killed due to
internal memory pressure? Or do you expect that none of the subtree will
have hard limit configured?

> I don't see any reasons to limit this functionality to leaf cgroups
> only.

Well, I wanted to start simple first and extend on top. Memcg v1 is full
of seemingly interesting and very generic concepts which turned out
being a headache long term.

> Hierarchies are memory consumers, we do account their usage,
> we do apply limits and guarantees for the hierarchies. The same is
> with OOM victim selection: we are reclaiming memory from the
> biggest consumer. Kill-all knob only defines the way _how_ we do that:
> by killing one or all processes.

But then you just enforce a structural restriction on your configuration
because
	root
        /  \
       A    D
      /\   
     B  C

is a different thing than
	root
        / | \
       B  C  D

And consider that the sole purpose of A might be a control over
a non-memory resource (e.g. a cpu share distribution). Why should we
discriminate B and C in such a case?

> Just for example, we might want to take memory.low into account at
> some point: prefer cgroups which are above their guarantees, avoid
> killing those who fit. It would be hard if we're comparing cgroups
> from different hierarchies.

This can be reflected in the memcg oom score calculation I believe. We
already do something similar during the reclaim.

> The same will be with introducing oom_priorities, which is much more
> required functionality.

More on that below.

> > Consider that it might be not your choice (as a user) how deep is your
> > leaf memcg. I can already see how people complain that their memcg has
> > been killed just because it was one level deeper in the hierarchy...
> 
> The kill-all functionality is enforced by parent, and it seems to be
> following the overall memcg design. The parent cgroup enforces memory
> limit, memory low limit, etc.

And the same is true for the memcg oom killer. It enforces the selection
to the out-of-memory subtree. We are trying to be proportional on the
size of the _reclaimable_ memory in that scope. Same way as with the
LRU reclaim. We do not prefer larger hierarchies over smaller. We just
iterate over those that have pages on LRUs (leaf memcgs with v2) and
scan/reclaim proportionally to their size. Why should the oom killer
decision be any different in that regards?

> I don't know why OOM control should be different.

I am not arguing that kill-all functionality on non-leaf is wrong. I
just haven't heard the usecase for it yet. I am also not opposed to
consider the cumulative size of non-leaf memcg if it is kill-all as the
cumulative size will be reclaimed then. But I fail to see why we should
prefer larger hierarchies when the resulting memcg victim is much
smaller in the end.

> > I would really start simple and only allow kill-all on leaf memcgs and
> > only compare leaf memcgs & root. If we ever need to kill whole
> > hierarchies then allow kill-all on intermediate memcgs as well and then
> > consider cumulative consumptions only on those that have kill-all
> > enabled.
> 
> This sounds hacky to me: the whole thing is depending on cgroup v2 and
> is additionally explicitly opt-in.
> 
> Why do we need to introduce such incomplete functionality first,
> and then suffer trying to extend it and provide backward compatibility?

Why would a backward compatibility be a problem? kill-all on non-leaf
memcgs should be seamless. We would simply allow setting the knob. Adding
a priority shouldn't be a problem either. A new knob would be added. Any
memcg with a non-zero priority would be considered during selection for
example (cumulative size would be considered as a tie-breaker for
non-leaf memcgs and a victim selected from the largest hierarchy/leaf
memcg - but that really needs to be thought through and hear about
specific usecases).

> Also, I think we should compare root cgroup with top-level cgroups,
> rather than leaf cgroups. A process in the root cgroup is definitely
> system-level entity, and we should compare it with other top-level
> entities (other containerized workloads), rather then some random
> leaf cgroup deep inside the tree. If we decided, that we're not comparing
> random tasks from different cgroups, why should we do this for leaf
> cgroups? Is sounds like making only one step towards right direction,
> while we can do more.

The main problem I have with that is mentioned above. A single hierarchy
enforces some structural constrains when multiple controllers are in
place.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 10:58           ` Michal Hocko
@ 2017-09-15 15:23             ` Roman Gushchin
  2017-09-15 19:55               ` David Rientjes
  2017-09-18  6:14               ` Michal Hocko
  0 siblings, 2 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-15 15:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> > On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I strongly believe that comparing only leaf memcgs
> > > > > is more straightforward and it doesn't lead to unexpected results as
> > > > > mentioned before (kill a small memcg which is a part of the larger
> > > > > sub-hierarchy).
> > > > 
> > > > One of two main goals of this patchset is to introduce cgroup-level
> > > > fairness: bigger cgroups should be affected more than smaller,
> > > > despite the size of tasks inside. I believe the same principle
> > > > should be used for cgroups.
> > > 
> > > Yes bigger cgroups should be preferred but I fail to see why bigger
> > > hierarchies should be considered as well if they are not kill-all. And
> > > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > > me. What would be the usecase?
> > 
> > We definitely want to support kill-all for non-leaf cgroups.
> > A workload can consist of several cgroups and we want to clean up
> > the whole thing on OOM.
> 
> Could you be more specific about such a workload? E.g. how can be such a
> hierarchy handled consistently when its sub-tree gets killed due to
> internal memory pressure?

Or just system-wide OOM.

> Or do you expect that none of the subtree will
> have hard limit configured?

And this can also be a case: the whole workload may have hard limit
configured, while internal memcgs have only memory.low set for "soft"
prioritization.

> 
> But then you just enforce a structural restriction on your configuration
> because
> 	root
>         /  \
>        A    D
>       /\   
>      B  C
> 
> is a different thing than
> 	root
>         / | \
>        B  C  D
>

I actually don't have a strong argument against an approach to select
largest leaf or kill-all-set memcg. I think, in practice there will be
no much difference.

The only real concern I have is that then we have to do the same with
oom_priorities (select largest priority tree-wide), and this will limit
an ability to enforce the priority by parent cgroup.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 15:23             ` Roman Gushchin
@ 2017-09-15 19:55               ` David Rientjes
  2017-09-15 21:08                 ` Roman Gushchin
  2017-09-18  6:16                 ` Michal Hocko
  2017-09-18  6:14               ` Michal Hocko
  1 sibling, 2 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-15 19:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, 15 Sep 2017, Roman Gushchin wrote:

> > But then you just enforce a structural restriction on your configuration
> > because
> > 	root
> >         /  \
> >        A    D
> >       /\   
> >      B  C
> > 
> > is a different thing than
> > 	root
> >         / | \
> >        B  C  D
> >
> 
> I actually don't have a strong argument against an approach to select
> largest leaf or kill-all-set memcg. I think, in practice there will be
> no much difference.
> 
> The only real concern I have is that then we have to do the same with
> oom_priorities (select largest priority tree-wide), and this will limit
> an ability to enforce the priority by parent cgroup.
> 

Yes, oom_priority cannot select the largest priority tree-wide for exactly 
that reason.  We need the ability to control from which subtree the kill 
occurs in ancestor cgroups.  If multiple jobs are allocated their own 
cgroups and they can own memory.oom_priority for their own subcontainers, 
this becomes quite powerful so they can define their own oom priorities.   
Otherwise, they can easily override the oom priorities of other cgroups.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 19:55               ` David Rientjes
@ 2017-09-15 21:08                 ` Roman Gushchin
  2017-09-18  6:20                   ` Michal Hocko
  2017-09-19 20:54                   ` David Rientjes
  2017-09-18  6:16                 ` Michal Hocko
  1 sibling, 2 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-15 21:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, Sep 15, 2017 at 12:55:55PM -0700, David Rientjes wrote:
> On Fri, 15 Sep 2017, Roman Gushchin wrote:
> 
> > > But then you just enforce a structural restriction on your configuration
> > > because
> > > 	root
> > >         /  \
> > >        A    D
> > >       /\   
> > >      B  C
> > > 
> > > is a different thing than
> > > 	root
> > >         / | \
> > >        B  C  D
> > >
> > 
> > I actually don't have a strong argument against an approach to select
> > largest leaf or kill-all-set memcg. I think, in practice there will be
> > no much difference.
> > 
> > The only real concern I have is that then we have to do the same with
> > oom_priorities (select largest priority tree-wide), and this will limit
> > an ability to enforce the priority by parent cgroup.
> > 
> 
> Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> that reason.  We need the ability to control from which subtree the kill 
> occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> cgroups and they can own memory.oom_priority for their own subcontainers, 
> this becomes quite powerful so they can define their own oom priorities.   
> Otherwise, they can easily override the oom priorities of other cgroups.

I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
the oom_priority below parent's value, or something like this.

But it looks more complex, and I'm not sure there are real examples,
when we have to compare memcgs, which are on different levels
(or in different subtrees).

In any case, oom_priorities and size-based comparison should share the
same tree-walking policy. And I still would prefer comparing sizes and
priorities independently on each level.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 21:08                 ` Roman Gushchin
@ 2017-09-18  6:20                   ` Michal Hocko
  2017-09-18 15:02                     ` Roman Gushchin
  2017-09-19 20:54                   ` David Rientjes
  1 sibling, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-18  6:20 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri 15-09-17 14:08:07, Roman Gushchin wrote:
> On Fri, Sep 15, 2017 at 12:55:55PM -0700, David Rientjes wrote:
> > On Fri, 15 Sep 2017, Roman Gushchin wrote:
> > 
> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > 	root
> > > >         /  \
> > > >        A    D
> > > >       /\   
> > > >      B  C
> > > > 
> > > > is a different thing than
> > > > 	root
> > > >         / | \
> > > >        B  C  D
> > > >
> > > 
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> > > 
> > > The only real concern I have is that then we have to do the same with
> > > oom_priorities (select largest priority tree-wide), and this will limit
> > > an ability to enforce the priority by parent cgroup.
> > > 
> > 
> > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > that reason.  We need the ability to control from which subtree the kill 
> > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > this becomes quite powerful so they can define their own oom priorities.   
> > Otherwise, they can easily override the oom priorities of other cgroups.
> 
> I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
> the oom_priority below parent's value, or something like this.

As said in other email. We can make priorities hierarchical (in the same
sense as hard limit or others) so that children cannot override their
parent.

> But it looks more complex, and I'm not sure there are real examples,
> when we have to compare memcgs, which are on different levels
> (or in different subtrees).

Well, I have given you one that doesn't sounds completely insane to me
in other email. You may need an intermediate level for other than memcg
controller. The whole concept of significance of the hierarchy level
seems really odd to me. Or am I wrong here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-18  6:20                   ` Michal Hocko
@ 2017-09-18 15:02                     ` Roman Gushchin
  2017-09-21  8:30                       ` David Rientjes
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-18 15:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 18, 2017 at 08:20:45AM +0200, Michal Hocko wrote:
> On Fri 15-09-17 14:08:07, Roman Gushchin wrote:
> > On Fri, Sep 15, 2017 at 12:55:55PM -0700, David Rientjes wrote:
> > > On Fri, 15 Sep 2017, Roman Gushchin wrote:
> > > 
> > > > > But then you just enforce a structural restriction on your configuration
> > > > > because
> > > > > 	root
> > > > >         /  \
> > > > >        A    D
> > > > >       /\   
> > > > >      B  C
> > > > > 
> > > > > is a different thing than
> > > > > 	root
> > > > >         / | \
> > > > >        B  C  D
> > > > >
> > > > 
> > > > I actually don't have a strong argument against an approach to select
> > > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > > no much difference.
> > > > 
> > > > The only real concern I have is that then we have to do the same with
> > > > oom_priorities (select largest priority tree-wide), and this will limit
> > > > an ability to enforce the priority by parent cgroup.
> > > > 
> > > 
> > > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > > that reason.  We need the ability to control from which subtree the kill 
> > > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > > this becomes quite powerful so they can define their own oom priorities.   
> > > Otherwise, they can easily override the oom priorities of other cgroups.
> > 
> > I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
> > the oom_priority below parent's value, or something like this.
> 
> As said in other email. We can make priorities hierarchical (in the same
> sense as hard limit or others) so that children cannot override their
> parent.

You mean they can set the knob to any value, but parent's value is enforced,
if it's greater than child's value?

If so, this sounds logical to me. Then we have size-based comparison and
priority-based comparison with similar rules, and all use cases are covered.

Ok, can we stick with this design?
Then I'll return oom_priorities in place, and post a (hopefully) final version.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-18 15:02                     ` Roman Gushchin
@ 2017-09-21  8:30                       ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-21  8:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 18 Sep 2017, Roman Gushchin wrote:

> > As said in other email. We can make priorities hierarchical (in the same
> > sense as hard limit or others) so that children cannot override their
> > parent.
> 
> You mean they can set the knob to any value, but parent's value is enforced,
> if it's greater than child's value?
> 
> If so, this sounds logical to me. Then we have size-based comparison and
> priority-based comparison with similar rules, and all use cases are covered.
> 
> Ok, can we stick with this design?
> Then I'll return oom_priorities in place, and post a (hopefully) final version.
> 

I just want to make sure that we are going with your original 
implementation here: that oom_priority is only effective for compare 
sibling memory cgroups and nothing beyond that.  The value alone has no 
relationship to any ancestor.  We can't set oom_priority based on the 
priorities of any other memory cgroups other than our own siblings because 
we have no control over how those change.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 21:08                 ` Roman Gushchin
  2017-09-18  6:20                   ` Michal Hocko
@ 2017-09-19 20:54                   ` David Rientjes
  2017-09-20 22:24                     ` Roman Gushchin
  1 sibling, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-19 20:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri, 15 Sep 2017, Roman Gushchin wrote:

> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > 	root
> > > >         /  \
> > > >        A    D
> > > >       /\   
> > > >      B  C
> > > > 
> > > > is a different thing than
> > > > 	root
> > > >         / | \
> > > >        B  C  D
> > > >
> > > 
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> > > 
> > > The only real concern I have is that then we have to do the same with
> > > oom_priorities (select largest priority tree-wide), and this will limit
> > > an ability to enforce the priority by parent cgroup.
> > > 
> > 
> > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > that reason.  We need the ability to control from which subtree the kill 
> > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > this becomes quite powerful so they can define their own oom priorities.   
> > Otherwise, they can easily override the oom priorities of other cgroups.
> 
> I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
> the oom_priority below parent's value, or something like this.
> 
> But it looks more complex, and I'm not sure there are real examples,
> when we have to compare memcgs, which are on different levels
> (or in different subtrees).
> 

It's actually much more complex because in our environment we'd need an 
"activity manager" with CAP_SYS_RESOURCE to control oom priorities of user 
subcontainers when today it need only be concerned with top-level memory 
cgroups.  Users can create their own hierarchies with their own oom 
priorities at will, it doesn't alter the selection heuristic for another 
other user running on the same system and gives them full control over the 
selection in their own subtree.  We shouldn't need to have a system-wide 
daemon with CAP_SYS_RESOURCE be required to manage subcontainers when 
nothing else requires it.  I believe it's also much easier to document: 
oom_priority is considered for all sibling cgroups at each level of the 
hierarchy and the cgroup with the lowest priority value gets iterated.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-19 20:54                   ` David Rientjes
@ 2017-09-20 22:24                     ` Roman Gushchin
  2017-09-21  8:27                       ` David Rientjes
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-20 22:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Tue, Sep 19, 2017 at 01:54:48PM -0700, David Rientjes wrote:
> On Fri, 15 Sep 2017, Roman Gushchin wrote:
> 
> > > > > But then you just enforce a structural restriction on your configuration
> > > > > because
> > > > > 	root
> > > > >         /  \
> > > > >        A    D
> > > > >       /\   
> > > > >      B  C
> > > > > 
> > > > > is a different thing than
> > > > > 	root
> > > > >         / | \
> > > > >        B  C  D
> > > > >
> > > > 
> > > > I actually don't have a strong argument against an approach to select
> > > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > > no much difference.
> > > > 
> > > > The only real concern I have is that then we have to do the same with
> > > > oom_priorities (select largest priority tree-wide), and this will limit
> > > > an ability to enforce the priority by parent cgroup.
> > > > 
> > > 
> > > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > > that reason.  We need the ability to control from which subtree the kill 
> > > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > > this becomes quite powerful so they can define their own oom priorities.   
> > > Otherwise, they can easily override the oom priorities of other cgroups.
> > 
> > I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set
> > the oom_priority below parent's value, or something like this.
> > 
> > But it looks more complex, and I'm not sure there are real examples,
> > when we have to compare memcgs, which are on different levels
> > (or in different subtrees).
> > 
> 
> It's actually much more complex because in our environment we'd need an 
> "activity manager" with CAP_SYS_RESOURCE to control oom priorities of user 
> subcontainers when today it need only be concerned with top-level memory 
> cgroups.  Users can create their own hierarchies with their own oom 
> priorities at will, it doesn't alter the selection heuristic for another 
> other user running on the same system and gives them full control over the 
> selection in their own subtree.  We shouldn't need to have a system-wide 
> daemon with CAP_SYS_RESOURCE be required to manage subcontainers when 
> nothing else requires it.  I believe it's also much easier to document: 
> oom_priority is considered for all sibling cgroups at each level of the 
> hierarchy and the cgroup with the lowest priority value gets iterated.

I do agree actually. System-wide OOM priorities make no sense.

Always compare sibling cgroups, either by priority or size, seems to be
simple, clear and powerful enough for all reasonable use cases. Am I right,
that it's exactly what you've used internally? This is a perfect confirmation,
I believe.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-20 22:24                     ` Roman Gushchin
@ 2017-09-21  8:27                       ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-21  8:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Wed, 20 Sep 2017, Roman Gushchin wrote:

> > It's actually much more complex because in our environment we'd need an 
> > "activity manager" with CAP_SYS_RESOURCE to control oom priorities of user 
> > subcontainers when today it need only be concerned with top-level memory 
> > cgroups.  Users can create their own hierarchies with their own oom 
> > priorities at will, it doesn't alter the selection heuristic for another 
> > other user running on the same system and gives them full control over the 
> > selection in their own subtree.  We shouldn't need to have a system-wide 
> > daemon with CAP_SYS_RESOURCE be required to manage subcontainers when 
> > nothing else requires it.  I believe it's also much easier to document: 
> > oom_priority is considered for all sibling cgroups at each level of the 
> > hierarchy and the cgroup with the lowest priority value gets iterated.
> 
> I do agree actually. System-wide OOM priorities make no sense.
> 
> Always compare sibling cgroups, either by priority or size, seems to be
> simple, clear and powerful enough for all reasonable use cases. Am I right,
> that it's exactly what you've used internally? This is a perfect confirmation,
> I believe.
> 

We've used it for at least four years, I added my Tested-by to your patch, 
we would convert to your implementation if it is merged upstream, and I 
would enthusiastically support your patch if you would integrate it back 
into your series.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 19:55               ` David Rientjes
  2017-09-15 21:08                 ` Roman Gushchin
@ 2017-09-18  6:16                 ` Michal Hocko
  2017-09-19 20:51                   ` David Rientjes
  1 sibling, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-18  6:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri 15-09-17 12:55:55, David Rientjes wrote:
> On Fri, 15 Sep 2017, Roman Gushchin wrote:
> 
> > > But then you just enforce a structural restriction on your configuration
> > > because
> > > 	root
> > >         /  \
> > >        A    D
> > >       /\   
> > >      B  C
> > > 
> > > is a different thing than
> > > 	root
> > >         / | \
> > >        B  C  D
> > >
> > 
> > I actually don't have a strong argument against an approach to select
> > largest leaf or kill-all-set memcg. I think, in practice there will be
> > no much difference.
> > 
> > The only real concern I have is that then we have to do the same with
> > oom_priorities (select largest priority tree-wide), and this will limit
> > an ability to enforce the priority by parent cgroup.
> > 
> 
> Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> that reason.  We need the ability to control from which subtree the kill 
> occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> cgroups and they can own memory.oom_priority for their own subcontainers, 
> this becomes quite powerful so they can define their own oom priorities.   
> Otherwise, they can easily override the oom priorities of other cgroups.

Could you be more speicific about your usecase? What would be a
problem If we allow to only increase priority in children (like other
hierarchical controls).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-18  6:16                 ` Michal Hocko
@ 2017-09-19 20:51                   ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-19 20:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, 18 Sep 2017, Michal Hocko wrote:

> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > 	root
> > > >         /  \
> > > >        A    D
> > > >       /\   
> > > >      B  C
> > > > 
> > > > is a different thing than
> > > > 	root
> > > >         / | \
> > > >        B  C  D
> > > >
> > > 
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> > > 
> > > The only real concern I have is that then we have to do the same with
> > > oom_priorities (select largest priority tree-wide), and this will limit
> > > an ability to enforce the priority by parent cgroup.
> > > 
> > 
> > Yes, oom_priority cannot select the largest priority tree-wide for exactly 
> > that reason.  We need the ability to control from which subtree the kill 
> > occurs in ancestor cgroups.  If multiple jobs are allocated their own 
> > cgroups and they can own memory.oom_priority for their own subcontainers, 
> > this becomes quite powerful so they can define their own oom priorities.   
> > Otherwise, they can easily override the oom priorities of other cgroups.
> 
> Could you be more speicific about your usecase? What would be a
> problem If we allow to only increase priority in children (like other
> hierarchical controls).
> 

For memcg constrained oom conditions, there is only a theoretical issue if 
the subtree is not under the control of a single user and various users 
can alter their priorities without knowledge of the priorities of other 
children in the same subtree that is oom, or those values change without 
knowledge of a child.  I don't know of anybody that configures memory 
cgroup hierarchies that way, though.

The problem is more obvious in system oom conditions.  If we have two 
top-level memory cgroups with the same "job" priority, they get the same 
oom priority.  The user who configures subcontainers is now always 
targeted for oom kill in an "increase priority in children" policy.

The hierarchy becomes this:

	root
       /    \
      A      D
     / \   / | \
    B   C E  F  G

where A/memory.oom_priority == D/memory.oom_priority.

D wants to kill in order of E -> F -> G, but can't configure that if
B = A - 1 and C = B - 1.  It also shouldn't need to adjust its own oom 
priorities based on a hierarchy outside its control and which can change 
at any time at the discretion of the user (with namespaces you may not 
even be able to access it).

But also if A/memory.oom_priority = D/memory.oom_priority - 100, A is 
preferred unless its subcontainers configure themselves in a way where 
they have higher oom priority values than E, F, and G.  That may yield 
very different results when additional jobs get scheduled on the system 
(and H tree) where the user has full control over their own oom 
priorities, even when the value must only increase.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-15 15:23             ` Roman Gushchin
  2017-09-15 19:55               ` David Rientjes
@ 2017-09-18  6:14               ` Michal Hocko
  2017-09-20 21:53                 ` Roman Gushchin
  1 sibling, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-18  6:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> > On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> > > On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > > > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I strongly believe that comparing only leaf memcgs
> > > > > > is more straightforward and it doesn't lead to unexpected results as
> > > > > > mentioned before (kill a small memcg which is a part of the larger
> > > > > > sub-hierarchy).
> > > > > 
> > > > > One of two main goals of this patchset is to introduce cgroup-level
> > > > > fairness: bigger cgroups should be affected more than smaller,
> > > > > despite the size of tasks inside. I believe the same principle
> > > > > should be used for cgroups.
> > > > 
> > > > Yes bigger cgroups should be preferred but I fail to see why bigger
> > > > hierarchies should be considered as well if they are not kill-all. And
> > > > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > > > me. What would be the usecase?
> > > 
> > > We definitely want to support kill-all for non-leaf cgroups.
> > > A workload can consist of several cgroups and we want to clean up
> > > the whole thing on OOM.
> > 
> > Could you be more specific about such a workload? E.g. how can be such a
> > hierarchy handled consistently when its sub-tree gets killed due to
> > internal memory pressure?
> 
> Or just system-wide OOM.
> 
> > Or do you expect that none of the subtree will
> > have hard limit configured?
> 
> And this can also be a case: the whole workload may have hard limit
> configured, while internal memcgs have only memory.low set for "soft"
> prioritization.
> 
> > 
> > But then you just enforce a structural restriction on your configuration
> > because
> > 	root
> >         /  \
> >        A    D
> >       /\   
> >      B  C
> > 
> > is a different thing than
> > 	root
> >         / | \
> >        B  C  D
> >
> 
> I actually don't have a strong argument against an approach to select
> largest leaf or kill-all-set memcg. I think, in practice there will be
> no much difference.

Well, I am worried that the difference will come unexpected when a
deeper hierarchy is needed because of the structural needs.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-18  6:14               ` Michal Hocko
@ 2017-09-20 21:53                 ` Roman Gushchin
  2017-09-25 12:24                   ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-20 21:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Johannes Weiner,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> > > On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> > > > On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > > > > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > > > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I strongly believe that comparing only leaf memcgs
> > > > > > > is more straightforward and it doesn't lead to unexpected results as
> > > > > > > mentioned before (kill a small memcg which is a part of the larger
> > > > > > > sub-hierarchy).
> > > > > > 
> > > > > > One of two main goals of this patchset is to introduce cgroup-level
> > > > > > fairness: bigger cgroups should be affected more than smaller,
> > > > > > despite the size of tasks inside. I believe the same principle
> > > > > > should be used for cgroups.
> > > > > 
> > > > > Yes bigger cgroups should be preferred but I fail to see why bigger
> > > > > hierarchies should be considered as well if they are not kill-all. And
> > > > > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > > > > me. What would be the usecase?
> > > > 
> > > > We definitely want to support kill-all for non-leaf cgroups.
> > > > A workload can consist of several cgroups and we want to clean up
> > > > the whole thing on OOM.
> > > 
> > > Could you be more specific about such a workload? E.g. how can be such a
> > > hierarchy handled consistently when its sub-tree gets killed due to
> > > internal memory pressure?
> > 
> > Or just system-wide OOM.
> > 
> > > Or do you expect that none of the subtree will
> > > have hard limit configured?
> > 
> > And this can also be a case: the whole workload may have hard limit
> > configured, while internal memcgs have only memory.low set for "soft"
> > prioritization.
> > 
> > > 
> > > But then you just enforce a structural restriction on your configuration
> > > because
> > > 	root
> > >         /  \
> > >        A    D
> > >       /\   
> > >      B  C
> > > 
> > > is a different thing than
> > > 	root
> > >         / | \
> > >        B  C  D
> > >
> > 
> > I actually don't have a strong argument against an approach to select
> > largest leaf or kill-all-set memcg. I think, in practice there will be
> > no much difference.

I've tried to implement this approach, and it's really arguable.
Although your example looks reasonable, the opposite example is also valid:
you might want to compare whole hierarchies, and it's a quite typical usecase.

Assume, you have several containerized workloads on a machine (probably,
each will be contained in a memcg with memory.max set), with some hierarchy
of cgroups inside. Then in case of global memory shortage we want to reclaim
some memory from the biggest workload, and the selection should not depend
on group_oom settings. It would be really strange, if setting group_oom will
higher the chances to be killed.

In other words, let's imagine processes as leaf nodes in memcg tree. We decided
to select the biggest memcg and kill one or more processes inside (depending
on group_oom setting), but the memcg selection doesn't depend on it.
We do not compare processes from different cgroups, as well as cgroups with
processes. The same should apply to cgroups: why do we want to compare cgroups
from different sub-trees?

While size-based comparison can be implemented with this approach,
the priority-based is really weird (as David mentioned).
If priorities have no hierarchical meaning at all, we lack the very important
ability to enforce hierarchy oom_priority. Otherwise we have to invent some
complex rules of oom_priority propagation (e.g. is someone is raising
the oom_priority in parent, should it be applied to children immediately, etc).

The oom_group knob meaning also becoms more complex. It affects both
the victim selection and OOM action. _ANY_ mechanism which allows to affect
OOM victim selection (either priorities, either bpf-based approach) should
not have global system-wide meaning, it breaks everything.

I do understand your point, but the same is true for other stuff, right?
E.g. cpu time distribution (and io, etc) depends on hierarchy configuration.
It's a limitation, but it's ok, as user should create a hierarchy which
reflects some logical relations between processes and groups of processes.
Otherwise we're going to the configuration hell.

In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
and do not crash the system or do not leave it in totally broken state.
Any really complex mm in userspace should be applied _before_ OOM happens.
So, I don't think we have to support all possible configurations here,
if we're able to achieve the main goal (kill some processes and do not leave
broken systems/containers).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-20 21:53                 ` Roman Gushchin
@ 2017-09-25 12:24                   ` Michal Hocko
  2017-09-25 17:00                     ` Johannes Weiner
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-25 12:24 UTC (permalink / raw)
  To: Roman Gushchin, Johannes Weiner, Tejun Heo, kernel-team
  Cc: David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, cgroups, linux-doc, linux-kernel

I would really appreciate some feedback from Tejun, Johannes here.

On Wed 20-09-17 14:53:41, Roman Gushchin wrote:
> On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> > On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
[...]
> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > 	root
> > > >         /  \
> > > >        A    D
> > > >       /\   
> > > >      B  C
> > > > 
> > > > is a different thing than
> > > > 	root
> > > >         / | \
> > > >        B  C  D
> > > >
> > > 
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> 
> I've tried to implement this approach, and it's really arguable.
> Although your example looks reasonable, the opposite example is also valid:
> you might want to compare whole hierarchies, and it's a quite typical usecase.
> 
> Assume, you have several containerized workloads on a machine (probably,
> each will be contained in a memcg with memory.max set), with some hierarchy
> of cgroups inside. Then in case of global memory shortage we want to reclaim
> some memory from the biggest workload, and the selection should not depend
> on group_oom settings. It would be really strange, if setting group_oom will
> higher the chances to be killed.
> 
> In other words, let's imagine processes as leaf nodes in memcg tree. We decided
> to select the biggest memcg and kill one or more processes inside (depending
> on group_oom setting), but the memcg selection doesn't depend on it.
> We do not compare processes from different cgroups, as well as cgroups with
> processes. The same should apply to cgroups: why do we want to compare cgroups
> from different sub-trees?
> 
> While size-based comparison can be implemented with this approach,
> the priority-based is really weird (as David mentioned).
> If priorities have no hierarchical meaning at all, we lack the very important
> ability to enforce hierarchy oom_priority. Otherwise we have to invent some
> complex rules of oom_priority propagation (e.g. is someone is raising
> the oom_priority in parent, should it be applied to children immediately, etc).

I would really forget about the priority at this stage. This needs
really much more thinking and I consider the David's usecase very
specialized to use it as a template for a general purpose oom
prioritization. I might be wrong here of course...

> The oom_group knob meaning also becoms more complex. It affects both
> the victim selection and OOM action. _ANY_ mechanism which allows to affect
> OOM victim selection (either priorities, either bpf-based approach) should
> not have global system-wide meaning, it breaks everything.
> 
> I do understand your point, but the same is true for other stuff, right?
> E.g. cpu time distribution (and io, etc) depends on hierarchy configuration.
> It's a limitation, but it's ok, as user should create a hierarchy which
> reflects some logical relations between processes and groups of processes.
> Otherwise we're going to the configuration hell.

And that is _exactly_ my concern. We surely do not want tell people that
they have to consider their cgroup tree structure to control the global
oom behavior. You simply do not have that constrain with leaf-only
semantic and if kill-all intermediate nodes are used then there is an
explicit opt-in for the hierarchy considerations.

> In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
> and do not crash the system or do not leave it in totally broken state.
> Any really complex mm in userspace should be applied _before_ OOM happens.
> So, I don't think we have to support all possible configurations here,
> if we're able to achieve the main goal (kill some processes and do not leave
> broken systems/containers).

True but we want to have the semantic reasonably understandable. And it
is quite hard to explain that the oom killer hasn't selected the largest
memcg just because it happened to be in a deeper hierarchy which has
been configured to cover a different resource.

I am sorry to repeat my self and I will not argue if there is a
prevalent agreement that level-by-level comparison is considered
desirable and documented behavior but, by all means, do not define this
semantic based on a priority requirements and/or implementation details.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 12:24                   ` Michal Hocko
@ 2017-09-25 17:00                     ` Johannes Weiner
  2017-09-25 18:15                       ` Roman Gushchin
  2017-09-25 22:21                       ` David Rientjes
  0 siblings, 2 replies; 78+ messages in thread
From: Johannes Weiner @ 2017-09-25 17:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, kernel-team, David Rientjes, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 25, 2017 at 02:24:00PM +0200, Michal Hocko wrote:
> I would really appreciate some feedback from Tejun, Johannes here.
> 
> On Wed 20-09-17 14:53:41, Roman Gushchin wrote:
> > On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> > > On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> [...]
> > > > > But then you just enforce a structural restriction on your configuration
> > > > > because
> > > > > 	root
> > > > >         /  \
> > > > >        A    D
> > > > >       /\   
> > > > >      B  C
> > > > > 
> > > > > is a different thing than
> > > > > 	root
> > > > >         / | \
> > > > >        B  C  D
> > > > >
> > > > 
> > > > I actually don't have a strong argument against an approach to select
> > > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > > no much difference.
> > 
> > I've tried to implement this approach, and it's really arguable.
> > Although your example looks reasonable, the opposite example is also valid:
> > you might want to compare whole hierarchies, and it's a quite typical usecase.
> > 
> > Assume, you have several containerized workloads on a machine (probably,
> > each will be contained in a memcg with memory.max set), with some hierarchy
> > of cgroups inside. Then in case of global memory shortage we want to reclaim
> > some memory from the biggest workload, and the selection should not depend
> > on group_oom settings. It would be really strange, if setting group_oom will
> > higher the chances to be killed.
> > 
> > In other words, let's imagine processes as leaf nodes in memcg tree. We decided
> > to select the biggest memcg and kill one or more processes inside (depending
> > on group_oom setting), but the memcg selection doesn't depend on it.
> > We do not compare processes from different cgroups, as well as cgroups with
> > processes. The same should apply to cgroups: why do we want to compare cgroups
> > from different sub-trees?
> > 
> > While size-based comparison can be implemented with this approach,
> > the priority-based is really weird (as David mentioned).
> > If priorities have no hierarchical meaning at all, we lack the very important
> > ability to enforce hierarchy oom_priority. Otherwise we have to invent some
> > complex rules of oom_priority propagation (e.g. is someone is raising
> > the oom_priority in parent, should it be applied to children immediately, etc).
> 
> I would really forget about the priority at this stage. This needs
> really much more thinking and I consider the David's usecase very
> specialized to use it as a template for a general purpose oom
> prioritization. I might be wrong here of course...

No, I agree.

> > In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
> > and do not crash the system or do not leave it in totally broken state.
> > Any really complex mm in userspace should be applied _before_ OOM happens.
> > So, I don't think we have to support all possible configurations here,
> > if we're able to achieve the main goal (kill some processes and do not leave
> > broken systems/containers).
> 
> True but we want to have the semantic reasonably understandable. And it
> is quite hard to explain that the oom killer hasn't selected the largest
> memcg just because it happened to be in a deeper hierarchy which has
> been configured to cover a different resource.

Going back to Michal's example, say the user configured the following:

       root
      /    \
     A      D
    / \
   B   C

A global OOM event happens and we find this:
- A > D
- B, C, D are oomgroups

What the user is telling us is that B, C, and D are compound memory
consumers. They cannot be divided into their task parts from a memory
point of view.

However, the user doesn't say the same for A: the A subtree summarizes
and controls aggregate consumption of B and C, but without groupoom
set on A, the user says that A is in fact divisible into independent
memory consumers B and C.

If we don't have to kill all of A, but we'd have to kill all of D,
does it make sense to compare the two?

Let's consider an extreme case of this conundrum:

	root
      /     \
     A       B
    /|\      |
 A1-A1000    B1

Again we find:
- A > B
- A1 to A1000 and B1 are oomgroups
But:
- A1 to A1000 individually are tiny, B1 is huge

Going level by level, we'd pick A as the bigger hierarchy in the
system, and then kill off one of the tiny groups A1 to A1000.

Conversely, going for biggest consumer regardless of hierarchy, we'd
compare A1 to A1000 and B1, then pick B1 as the biggest single atomic
memory consumer in the system and kill all its tasks.

Which one of these two fits both the purpose and our historic approach
to OOM killing better?

As was noted in this thread, OOM is the last resort to avoid a memory
deadlock. Killing the biggest consumer is most likely to resolve this
precarious situation. It is also most likely to catch buggy software
with memory leaks or runaway allocations, which is a nice bonus.

Killing a potentially tiny consumer inside the biggest top-level
hierarchy doesn't achieve this. I think we can all agree on this.

But also, global OOM in particular means that the hierarchical
approach to allocating the system's memory among cgroups has
failed. The user expressed control over memory in a way that wasn't
sufficient to isolate memory consumption between the different
hierarchies. IMO what follows from that is that the hierarchy itself
is a questionable guide to finding a culprit.

So I'm leaning toward the second model: compare all oomgroups and
standalone tasks in the system with each other, independent of the
failed hierarchical control structure. Then kill the biggest of them.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 17:00                     ` Johannes Weiner
@ 2017-09-25 18:15                       ` Roman Gushchin
  2017-09-25 20:25                         ` Michal Hocko
  2017-09-25 22:21                       ` David Rientjes
  1 sibling, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-25 18:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Tejun Heo, kernel-team, David Rientjes, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 25, 2017 at 01:00:04PM -0400, Johannes Weiner wrote:
> On Mon, Sep 25, 2017 at 02:24:00PM +0200, Michal Hocko wrote:
> > I would really appreciate some feedback from Tejun, Johannes here.
> > 
> > On Wed 20-09-17 14:53:41, Roman Gushchin wrote:
> > > On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
> > > > On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
> > > > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
> > [...]
> > > > > > But then you just enforce a structural restriction on your configuration
> > > > > > because
> > > > > > 	root
> > > > > >         /  \
> > > > > >        A    D
> > > > > >       /\   
> > > > > >      B  C
> > > > > > 
> > > > > > is a different thing than
> > > > > > 	root
> > > > > >         / | \
> > > > > >        B  C  D
> > > > > >
> > > > > 
> > > > > I actually don't have a strong argument against an approach to select
> > > > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > > > no much difference.
> > > 
> > > I've tried to implement this approach, and it's really arguable.
> > > Although your example looks reasonable, the opposite example is also valid:
> > > you might want to compare whole hierarchies, and it's a quite typical usecase.
> > > 
> > > Assume, you have several containerized workloads on a machine (probably,
> > > each will be contained in a memcg with memory.max set), with some hierarchy
> > > of cgroups inside. Then in case of global memory shortage we want to reclaim
> > > some memory from the biggest workload, and the selection should not depend
> > > on group_oom settings. It would be really strange, if setting group_oom will
> > > higher the chances to be killed.
> > > 
> > > In other words, let's imagine processes as leaf nodes in memcg tree. We decided
> > > to select the biggest memcg and kill one or more processes inside (depending
> > > on group_oom setting), but the memcg selection doesn't depend on it.
> > > We do not compare processes from different cgroups, as well as cgroups with
> > > processes. The same should apply to cgroups: why do we want to compare cgroups
> > > from different sub-trees?
> > > 
> > > While size-based comparison can be implemented with this approach,
> > > the priority-based is really weird (as David mentioned).
> > > If priorities have no hierarchical meaning at all, we lack the very important
> > > ability to enforce hierarchy oom_priority. Otherwise we have to invent some
> > > complex rules of oom_priority propagation (e.g. is someone is raising
> > > the oom_priority in parent, should it be applied to children immediately, etc).
> > 
> > I would really forget about the priority at this stage. This needs
> > really much more thinking and I consider the David's usecase very
> > specialized to use it as a template for a general purpose oom
> > prioritization. I might be wrong here of course...
> 
> No, I agree.
> 
> > > In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
> > > and do not crash the system or do not leave it in totally broken state.
> > > Any really complex mm in userspace should be applied _before_ OOM happens.
> > > So, I don't think we have to support all possible configurations here,
> > > if we're able to achieve the main goal (kill some processes and do not leave
> > > broken systems/containers).
> > 
> > True but we want to have the semantic reasonably understandable. And it
> > is quite hard to explain that the oom killer hasn't selected the largest
> > memcg just because it happened to be in a deeper hierarchy which has
> > been configured to cover a different resource.
> 
> Going back to Michal's example, say the user configured the following:
> 
>        root
>       /    \
>      A      D
>     / \
>    B   C
> 
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
> 
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
> 
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
> 
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
> 
> Let's consider an extreme case of this conundrum:
> 
> 	root
>       /     \
>      A       B
>     /|\      |
>  A1-A1000    B1
> 
> Again we find:
> - A > B
> - A1 to A1000 and B1 are oomgroups
> But:
> - A1 to A1000 individually are tiny, B1 is huge
> 
> Going level by level, we'd pick A as the bigger hierarchy in the
> system, and then kill off one of the tiny groups A1 to A1000.
> 
> Conversely, going for biggest consumer regardless of hierarchy, we'd
> compare A1 to A1000 and B1, then pick B1 as the biggest single atomic
> memory consumer in the system and kill all its tasks.
> 
> Which one of these two fits both the purpose and our historic approach
> to OOM killing better?
> 
> As was noted in this thread, OOM is the last resort to avoid a memory
> deadlock. Killing the biggest consumer is most likely to resolve this
> precarious situation. It is also most likely to catch buggy software
> with memory leaks or runaway allocations, which is a nice bonus.
> 
> Killing a potentially tiny consumer inside the biggest top-level
> hierarchy doesn't achieve this. I think we can all agree on this.
> 
> But also, global OOM in particular means that the hierarchical
> approach to allocating the system's memory among cgroups has
> failed. The user expressed control over memory in a way that wasn't
> sufficient to isolate memory consumption between the different
> hierarchies. IMO what follows from that is that the hierarchy itself
> is a questionable guide to finding a culprit.
> 
> So I'm leaning toward the second model: compare all oomgroups and
> standalone tasks in the system with each other, independent of the
> failed hierarchical control structure. Then kill the biggest of them.

I'm not against this model, as I've said before. It feels logical,
and will work fine in most cases.

In this case we can drop any mount/boot options, because it preserves
the existing behavior in the default configuration. A big advantage.

The only thing, I'm slightly concerned, that due to the way how we calculate
the memory footprint for tasks and memory cgroups, we will have a number
of weird edge cases. For instance, when putting a single process into
the group_oom memcg will alter the oom_score significantly and result
in significantly different chances to be killed. An obvious example will
be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
value, but it can also happen in case of constrained alloc, for instance.

If it considered to be a minor issue, we can choose this approach.


Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 18:15                       ` Roman Gushchin
@ 2017-09-25 20:25                         ` Michal Hocko
  2017-09-26 10:59                           ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-25 20:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, kernel-team, David Rientjes,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
[...]
> I'm not against this model, as I've said before. It feels logical,
> and will work fine in most cases.
> 
> In this case we can drop any mount/boot options, because it preserves
> the existing behavior in the default configuration. A big advantage.

I am not sure about this. We still need an opt-in, ragardless, because
selecting the largest process from the largest memcg != selecting the
largest task (just consider memcgs with many processes example).

> The only thing, I'm slightly concerned, that due to the way how we calculate
> the memory footprint for tasks and memory cgroups, we will have a number
> of weird edge cases. For instance, when putting a single process into
> the group_oom memcg will alter the oom_score significantly and result
> in significantly different chances to be killed. An obvious example will
> be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> value, but it can also happen in case of constrained alloc, for instance.

I am not sure I understand. Are you talking about root memcg comparing
to other memcgs?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 20:25                         ` Michal Hocko
@ 2017-09-26 10:59                           ` Roman Gushchin
  2017-09-26 11:21                             ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-26 10:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tejun Heo, kernel-team, David Rientjes,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> [...]
> > I'm not against this model, as I've said before. It feels logical,
> > and will work fine in most cases.
> > 
> > In this case we can drop any mount/boot options, because it preserves
> > the existing behavior in the default configuration. A big advantage.
> 
> I am not sure about this. We still need an opt-in, ragardless, because
> selecting the largest process from the largest memcg != selecting the
> largest task (just consider memcgs with many processes example).

As I understand Johannes, he suggested to compare individual processes with
group_oom mem cgroups. In other words, always select a killable entity with
the biggest memory footprint.

This is slightly different from my v8 approach, where I treat leaf memcgs
as indivisible memory consumers independent on group_oom setting, so
by default I'm selecting the biggest task in the biggest memcg.

While the approach suggested by Johannes looks clear and reasonable,
I'm slightly concerned about possible implementation issues,
which I've described below:

> 
> > The only thing, I'm slightly concerned, that due to the way how we calculate
> > the memory footprint for tasks and memory cgroups, we will have a number
> > of weird edge cases. For instance, when putting a single process into
> > the group_oom memcg will alter the oom_score significantly and result
> > in significantly different chances to be killed. An obvious example will
> > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > value, but it can also happen in case of constrained alloc, for instance.
> 
> I am not sure I understand. Are you talking about root memcg comparing
> to other memcgs?

Not only, but root memcg in this case will be another complication. We can
also use the same trick for all memcg (define memcg oom_score as maximum oom_score
of the belonging tasks), it will turn group_oom into pure container cleanup
solution, without changing victim selection algorithm

But, again, I'm not against approach suggested by Johannes. I think that overall
it's the best possible semantics, if we're not taking some implementation details
into account.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 10:59                           ` Roman Gushchin
@ 2017-09-26 11:21                             ` Michal Hocko
  2017-09-26 12:13                               ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-26 11:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, kernel-team, David Rientjes,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > [...]
> > > I'm not against this model, as I've said before. It feels logical,
> > > and will work fine in most cases.
> > > 
> > > In this case we can drop any mount/boot options, because it preserves
> > > the existing behavior in the default configuration. A big advantage.
> > 
> > I am not sure about this. We still need an opt-in, ragardless, because
> > selecting the largest process from the largest memcg != selecting the
> > largest task (just consider memcgs with many processes example).
> 
> As I understand Johannes, he suggested to compare individual processes with
> group_oom mem cgroups. In other words, always select a killable entity with
> the biggest memory footprint.
> 
> This is slightly different from my v8 approach, where I treat leaf memcgs
> as indivisible memory consumers independent on group_oom setting, so
> by default I'm selecting the biggest task in the biggest memcg.

My reading is that he is actually proposing the same thing I've been
mentioning. Simply select the biggest killable entity (leaf memcg or
group_oom hierarchy) and either kill the largest task in that entity
(for !group_oom) or the whole memcg/hierarchy otherwise.
 
> While the approach suggested by Johannes looks clear and reasonable,
> I'm slightly concerned about possible implementation issues,
> which I've described below:
> 
> > 
> > > The only thing, I'm slightly concerned, that due to the way how we calculate
> > > the memory footprint for tasks and memory cgroups, we will have a number
> > > of weird edge cases. For instance, when putting a single process into
> > > the group_oom memcg will alter the oom_score significantly and result
> > > in significantly different chances to be killed. An obvious example will
> > > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > > value, but it can also happen in case of constrained alloc, for instance.
> > 
> > I am not sure I understand. Are you talking about root memcg comparing
> > to other memcgs?
> 
> Not only, but root memcg in this case will be another complication. We can
> also use the same trick for all memcg (define memcg oom_score as maximum oom_score
> of the belonging tasks), it will turn group_oom into pure container cleanup
> solution, without changing victim selection algorithm

I fail to see the problem to be honest. Simply evaluate the memcg_score
you have so far with one minor detail. You only check memcgs which have
tasks (rather than check for leaf node check) or it is group_oom. An
intermediate memcg will get a cumulative size of the whole subhierarchy
and then you know you can skip the subtree because any subtree can be larger.

> But, again, I'm not against approach suggested by Johannes. I think that overall
> it's the best possible semantics, if we're not taking some implementation details
> into account.

I do not see those implementation details issues and let me repeat do
not develop a semantic based on implementation details.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 11:21                             ` Michal Hocko
@ 2017-09-26 12:13                               ` Roman Gushchin
  2017-09-26 13:30                                 ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-26 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Tejun Heo, kernel-team, David Rientjes,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
> On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > > [...]
> > > > I'm not against this model, as I've said before. It feels logical,
> > > > and will work fine in most cases.
> > > > 
> > > > In this case we can drop any mount/boot options, because it preserves
> > > > the existing behavior in the default configuration. A big advantage.
> > > 
> > > I am not sure about this. We still need an opt-in, ragardless, because
> > > selecting the largest process from the largest memcg != selecting the
> > > largest task (just consider memcgs with many processes example).
> > 
> > As I understand Johannes, he suggested to compare individual processes with
> > group_oom mem cgroups. In other words, always select a killable entity with
> > the biggest memory footprint.
> > 
> > This is slightly different from my v8 approach, where I treat leaf memcgs
> > as indivisible memory consumers independent on group_oom setting, so
> > by default I'm selecting the biggest task in the biggest memcg.
> 
> My reading is that he is actually proposing the same thing I've been
> mentioning. Simply select the biggest killable entity (leaf memcg or
> group_oom hierarchy) and either kill the largest task in that entity
> (for !group_oom) or the whole memcg/hierarchy otherwise.

He wrote the following:
"So I'm leaning toward the second model: compare all oomgroups and
standalone tasks in the system with each other, independent of the
failed hierarchical control structure. Then kill the biggest of them."

>  
> > While the approach suggested by Johannes looks clear and reasonable,
> > I'm slightly concerned about possible implementation issues,
> > which I've described below:
> > 
> > > 
> > > > The only thing, I'm slightly concerned, that due to the way how we calculate
> > > > the memory footprint for tasks and memory cgroups, we will have a number
> > > > of weird edge cases. For instance, when putting a single process into
> > > > the group_oom memcg will alter the oom_score significantly and result
> > > > in significantly different chances to be killed. An obvious example will
> > > > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > > > value, but it can also happen in case of constrained alloc, for instance.
> > > 
> > > I am not sure I understand. Are you talking about root memcg comparing
> > > to other memcgs?
> > 
> > Not only, but root memcg in this case will be another complication. We can
> > also use the same trick for all memcg (define memcg oom_score as maximum oom_score
> > of the belonging tasks), it will turn group_oom into pure container cleanup
> > solution, without changing victim selection algorithm
> 
> I fail to see the problem to be honest. Simply evaluate the memcg_score
> you have so far with one minor detail. You only check memcgs which have
> tasks (rather than check for leaf node check) or it is group_oom. An
> intermediate memcg will get a cumulative size of the whole subhierarchy
> and then you know you can skip the subtree because any subtree can be larger.
> 
> > But, again, I'm not against approach suggested by Johannes. I think that overall
> > it's the best possible semantics, if we're not taking some implementation details
> > into account.
> 
> I do not see those implementation details issues and let me repeat do
> not develop a semantic based on implementation details.

There are no problems in "select the biggest leaf or group_oom memcg, then
kill the biggest task or all tasks depending on group_oom" approach,
which you're describing. Comparing tasks and memcgs (what Johannes is suggesting)
may have some issues.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 12:13                               ` Roman Gushchin
@ 2017-09-26 13:30                                 ` Michal Hocko
  2017-09-26 17:26                                   ` Johannes Weiner
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-26 13:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, kernel-team, David Rientjes,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
> On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
> > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > > > [...]
> > > > > I'm not against this model, as I've said before. It feels logical,
> > > > > and will work fine in most cases.
> > > > > 
> > > > > In this case we can drop any mount/boot options, because it preserves
> > > > > the existing behavior in the default configuration. A big advantage.
> > > > 
> > > > I am not sure about this. We still need an opt-in, ragardless, because
> > > > selecting the largest process from the largest memcg != selecting the
> > > > largest task (just consider memcgs with many processes example).
> > > 
> > > As I understand Johannes, he suggested to compare individual processes with
> > > group_oom mem cgroups. In other words, always select a killable entity with
> > > the biggest memory footprint.
> > > 
> > > This is slightly different from my v8 approach, where I treat leaf memcgs
> > > as indivisible memory consumers independent on group_oom setting, so
> > > by default I'm selecting the biggest task in the biggest memcg.
> > 
> > My reading is that he is actually proposing the same thing I've been
> > mentioning. Simply select the biggest killable entity (leaf memcg or
> > group_oom hierarchy) and either kill the largest task in that entity
> > (for !group_oom) or the whole memcg/hierarchy otherwise.
> 
> He wrote the following:
> "So I'm leaning toward the second model: compare all oomgroups and
> standalone tasks in the system with each other, independent of the
> failed hierarchical control structure. Then kill the biggest of them."

I will let Johannes to comment but I believe this is just a
misunderstanding. If we compared only the biggest task from each memcg
then we are basically losing our fairness objective, aren't we?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 13:30                                 ` Michal Hocko
@ 2017-09-26 17:26                                   ` Johannes Weiner
  2017-09-27  3:37                                     ` Tim Hockin
  0 siblings, 1 reply; 78+ messages in thread
From: Johannes Weiner @ 2017-09-26 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, kernel-team, David Rientjes, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote:
> On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > > > > [...]
> > > > > > I'm not against this model, as I've said before. It feels logical,
> > > > > > and will work fine in most cases.
> > > > > > 
> > > > > > In this case we can drop any mount/boot options, because it preserves
> > > > > > the existing behavior in the default configuration. A big advantage.
> > > > > 
> > > > > I am not sure about this. We still need an opt-in, ragardless, because
> > > > > selecting the largest process from the largest memcg != selecting the
> > > > > largest task (just consider memcgs with many processes example).
> > > > 
> > > > As I understand Johannes, he suggested to compare individual processes with
> > > > group_oom mem cgroups. In other words, always select a killable entity with
> > > > the biggest memory footprint.
> > > > 
> > > > This is slightly different from my v8 approach, where I treat leaf memcgs
> > > > as indivisible memory consumers independent on group_oom setting, so
> > > > by default I'm selecting the biggest task in the biggest memcg.
> > > 
> > > My reading is that he is actually proposing the same thing I've been
> > > mentioning. Simply select the biggest killable entity (leaf memcg or
> > > group_oom hierarchy) and either kill the largest task in that entity
> > > (for !group_oom) or the whole memcg/hierarchy otherwise.
> > 
> > He wrote the following:
> > "So I'm leaning toward the second model: compare all oomgroups and
> > standalone tasks in the system with each other, independent of the
> > failed hierarchical control structure. Then kill the biggest of them."
> 
> I will let Johannes to comment but I believe this is just a
> misunderstanding. If we compared only the biggest task from each memcg
> then we are basically losing our fairness objective, aren't we?

Sorry about the confusion.

Yeah I was making the case for what Michal proposed, to kill the
biggest terminal consumer, which is either a task or an oomgroup.

You'd basically iterate through all the tasks and cgroups in the
system and pick the biggest task that isn't in an oom group or the
biggest oom group and then kill that.

Yeah, you'd have to compare the memory footprints of tasks with the
memory footprints of cgroups. These aren't defined identically, and
tasks don't get attributed every type of allocation that a cgroup
would. But it should get us in the ballpark, and I cannot picture a
scenario where this would lead to a completely undesirable outcome.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 17:26                                   ` Johannes Weiner
@ 2017-09-27  3:37                                     ` Tim Hockin
  2017-09-27  7:43                                       ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Tim Hockin @ 2017-09-27  3:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

I'm excited to see this being discussed again - it's been years since
the last attempt.  I've tried to stay out of the conversation, but I
feel obligated say something and then go back to lurking.

On Tue, Sep 26, 2017 at 10:26 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Sep 26, 2017 at 03:30:40PM +0200, Michal Hocko wrote:
>> On Tue 26-09-17 13:13:00, Roman Gushchin wrote:
>> > On Tue, Sep 26, 2017 at 01:21:34PM +0200, Michal Hocko wrote:
>> > > On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
>> > > > On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
>> > > > > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
>> > > > > [...]
>> > > > > > I'm not against this model, as I've said before. It feels logical,
>> > > > > > and will work fine in most cases.
>> > > > > >
>> > > > > > In this case we can drop any mount/boot options, because it preserves
>> > > > > > the existing behavior in the default configuration. A big advantage.
>> > > > >
>> > > > > I am not sure about this. We still need an opt-in, ragardless, because
>> > > > > selecting the largest process from the largest memcg != selecting the
>> > > > > largest task (just consider memcgs with many processes example).
>> > > >
>> > > > As I understand Johannes, he suggested to compare individual processes with
>> > > > group_oom mem cgroups. In other words, always select a killable entity with
>> > > > the biggest memory footprint.
>> > > >
>> > > > This is slightly different from my v8 approach, where I treat leaf memcgs
>> > > > as indivisible memory consumers independent on group_oom setting, so
>> > > > by default I'm selecting the biggest task in the biggest memcg.
>> > >
>> > > My reading is that he is actually proposing the same thing I've been
>> > > mentioning. Simply select the biggest killable entity (leaf memcg or
>> > > group_oom hierarchy) and either kill the largest task in that entity
>> > > (for !group_oom) or the whole memcg/hierarchy otherwise.
>> >
>> > He wrote the following:
>> > "So I'm leaning toward the second model: compare all oomgroups and
>> > standalone tasks in the system with each other, independent of the
>> > failed hierarchical control structure. Then kill the biggest of them."
>>
>> I will let Johannes to comment but I believe this is just a
>> misunderstanding. If we compared only the biggest task from each memcg
>> then we are basically losing our fairness objective, aren't we?
>
> Sorry about the confusion.
>
> Yeah I was making the case for what Michal proposed, to kill the
> biggest terminal consumer, which is either a task or an oomgroup.
>
> You'd basically iterate through all the tasks and cgroups in the
> system and pick the biggest task that isn't in an oom group or the
> biggest oom group and then kill that.
>
> Yeah, you'd have to compare the memory footprints of tasks with the
> memory footprints of cgroups. These aren't defined identically, and
> tasks don't get attributed every type of allocation that a cgroup
> would. But it should get us in the ballpark, and I cannot picture a
> scenario where this would lead to a completely undesirable outcome.

That last sentence:

> I cannot picture a scenario where this would lead to a completely undesirable outcome.

I feel like David has offered examples here, and many of us at Google
have offered examples as long ago as 2013 (if I recall) of cases where
the proposed heuristic is EXACTLY WRONG.  We need OOM behavior to kill
in a deterministic order configured by policy.  Sometimes, I would
literally prefer to kill every other cgroup before killing "the big
one".  The policy is *all* that matters for shared clusters of varying
users and priorities.

We did this in Borg, and it works REALLY well.  Has for years.  Now
that the world is adopting Kubernetes we need it again, only it's much
harder to carry a kernel patch in this case.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27  3:37                                     ` Tim Hockin
@ 2017-09-27  7:43                                       ` Michal Hocko
  2017-09-27 10:19                                         ` Roman Gushchin
  2017-09-27 15:35                                         ` Tim Hockin
  0 siblings, 2 replies; 78+ messages in thread
From: Michal Hocko @ 2017-09-27  7:43 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Johannes Weiner, Roman Gushchin, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Tue 26-09-17 20:37:37, Tim Hockin wrote:
[...]
> I feel like David has offered examples here, and many of us at Google
> have offered examples as long ago as 2013 (if I recall) of cases where
> the proposed heuristic is EXACTLY WRONG.

I do not think we have discussed anything resembling the current
approach. And I would really appreciate some more examples where
decisions based on leaf nodes would be EXACTLY WRONG.

> We need OOM behavior to kill in a deterministic order configured by
> policy.

And nobody is objecting to this usecase. I think we can build a priority
policy on top of leaf-based decision as well. The main point we are
trying to sort out here is a reasonable semantic that would work for
most workloads. Sibling based selection will simply not work on those
that have to use deeper hierarchies for organizational purposes. I
haven't heard a counter argument for that example yet.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27  7:43                                       ` Michal Hocko
@ 2017-09-27 10:19                                         ` Roman Gushchin
  2017-09-27 15:35                                         ` Tim Hockin
  1 sibling, 0 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-27 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Hockin, Johannes Weiner, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Wed, Sep 27, 2017 at 09:43:19AM +0200, Michal Hocko wrote:
> On Tue 26-09-17 20:37:37, Tim Hockin wrote:
> [...]
> > I feel like David has offered examples here, and many of us at Google
> > have offered examples as long ago as 2013 (if I recall) of cases where
> > the proposed heuristic is EXACTLY WRONG.
> 
> I do not think we have discussed anything resembling the current
> approach. And I would really appreciate some more examples where
> decisions based on leaf nodes would be EXACTLY WRONG.
>

I would agree here.

The discussing two-step approach (select biggest leaf or oom_group memcg,
then select largest process inside) does really look as a way to go.

It should work well in practice and it allows further development.
It will catch workloads which are leaking child processes by default,
which is an advantage in comparison to the existing algorithm.

Both strong hierarchical approach (as in v8) and pure flat (by Johannes)
are more limiting. In first case, deep hierarchies are affected (as Michal
mentioned) and we stick with tree traverse policy (Tejun's point).

In second case, the further development is under a question: any new idea
(say, oom_priorities, or, for example, if we will have a new useful memcg
metric) should be applied to processes and memcgs simultaneously.
Also, We drop any idea of memcg-level fairness and obtain some implementation
issues (which I mentioned earlier). The idea of mixing tasks and memcgs
leads to a much more hairy code, and the OOM code is already quite hairy.
The idea of comparing killable entities is a leaking abstraction,
as we can't predict how much memory killing a single process will release
(say, for example, the process is the init in a pid namespace).

> > We need OOM behavior to kill in a deterministic order configured by
> > policy.
> 
> And nobody is objecting to this usecase. I think we can build a priority
> policy on top of leaf-based decision as well. The main point we are
> trying to sort out here is a reasonable semantic that would work for
> most workloads. Sibling based selection will simply not work on those
> that have to use deeper hierarchies for organizational purposes. I
> haven't heard a counter argument for that example yet.

Yes, implementing oom_priorities is a ~15 lines patch on top of
the discussing approach. David can use this small off-stream patch
for now, in any case it's a step forward in comparison to the existing state.

Overall, do we have any open question left? Does anyone has any strong
arguments against the discussing design?

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27  7:43                                       ` Michal Hocko
  2017-09-27 10:19                                         ` Roman Gushchin
@ 2017-09-27 15:35                                         ` Tim Hockin
  2017-09-27 16:23                                           ` Roman Gushchin
  1 sibling, 1 reply; 78+ messages in thread
From: Tim Hockin @ 2017-09-27 15:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 26-09-17 20:37:37, Tim Hockin wrote:
> [...]
>> I feel like David has offered examples here, and many of us at Google
>> have offered examples as long ago as 2013 (if I recall) of cases where
>> the proposed heuristic is EXACTLY WRONG.
>
> I do not think we have discussed anything resembling the current
> approach. And I would really appreciate some more examples where
> decisions based on leaf nodes would be EXACTLY WRONG.
>
>> We need OOM behavior to kill in a deterministic order configured by
>> policy.
>
> And nobody is objecting to this usecase. I think we can build a priority
> policy on top of leaf-based decision as well. The main point we are
> trying to sort out here is a reasonable semantic that would work for
> most workloads. Sibling based selection will simply not work on those
> that have to use deeper hierarchies for organizational purposes. I
> haven't heard a counter argument for that example yet.

We have a priority-based, multi-user cluster.  That cluster runs a
variety of work, including critical things like search and gmail, as
well as non-critical things like batch work.  We try to offer our
users an SLA around how often they will be killed by factors outside
themselves, but we also want to get higher utilization.  We know for a
fact (data, lots of data) that most jobs have spare memory capacity,
set aside for spikes or simply because accurate sizing is hard.  We
can sell "guaranteed" resources to critical jobs, with a high SLA.  We
can sell "best effort" resources to non-critical jobs with a low SLA.
We achieve much better overall utilization this way.

I need to represent the priority of these tasks in a way that gives me
a very strong promise that, in case of system OOM, the non-critical
jobs will be chosen before the critical jobs.  Regardless of size.
Regardless of how many non-critical jobs have to die.  I'd rather kill
*all* of the non-critical jobs than a single critical job.  Size of
the process or cgroup is simply not a factor, and honestly given 2
options of equal priority I'd say age matters more than size.

So concretely I have 2 first-level cgroups, one for "guaranteed" and
one for "best effort" classes.  I always want to kill from "best
effort", even if that means killing 100 small cgroups, before touching
"guaranteed".

I apologize if this is not as thorough as the rest of the thread - I
am somewhat out of touch with the guts of it all these days.  I just
feel compelled to indicate that, as a historical user (via Google
systems) and current user (via Kubernetes), some of the assertions
being made here do not ring true for our very real use cases.  I
desperately want cgroup-aware OOM handing, but it has to be
policy-based or it is just not useful to us.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27 15:35                                         ` Tim Hockin
@ 2017-09-27 16:23                                           ` Roman Gushchin
  2017-09-27 18:11                                             ` Tim Hockin
  0 siblings, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-09-27 16:23 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote:
> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 26-09-17 20:37:37, Tim Hockin wrote:
> > [...]
> >> I feel like David has offered examples here, and many of us at Google
> >> have offered examples as long ago as 2013 (if I recall) of cases where
> >> the proposed heuristic is EXACTLY WRONG.
> >
> > I do not think we have discussed anything resembling the current
> > approach. And I would really appreciate some more examples where
> > decisions based on leaf nodes would be EXACTLY WRONG.
> >
> >> We need OOM behavior to kill in a deterministic order configured by
> >> policy.
> >
> > And nobody is objecting to this usecase. I think we can build a priority
> > policy on top of leaf-based decision as well. The main point we are
> > trying to sort out here is a reasonable semantic that would work for
> > most workloads. Sibling based selection will simply not work on those
> > that have to use deeper hierarchies for organizational purposes. I
> > haven't heard a counter argument for that example yet.
>

Hi, Tim!

> We have a priority-based, multi-user cluster.  That cluster runs a
> variety of work, including critical things like search and gmail, as
> well as non-critical things like batch work.  We try to offer our
> users an SLA around how often they will be killed by factors outside
> themselves, but we also want to get higher utilization.  We know for a
> fact (data, lots of data) that most jobs have spare memory capacity,
> set aside for spikes or simply because accurate sizing is hard.  We
> can sell "guaranteed" resources to critical jobs, with a high SLA.  We
> can sell "best effort" resources to non-critical jobs with a low SLA.
> We achieve much better overall utilization this way.

This is well understood.

> 
> I need to represent the priority of these tasks in a way that gives me
> a very strong promise that, in case of system OOM, the non-critical
> jobs will be chosen before the critical jobs.  Regardless of size.
> Regardless of how many non-critical jobs have to die.  I'd rather kill
> *all* of the non-critical jobs than a single critical job.  Size of
> the process or cgroup is simply not a factor, and honestly given 2
> options of equal priority I'd say age matters more than size.
> 
> So concretely I have 2 first-level cgroups, one for "guaranteed" and
> one for "best effort" classes.  I always want to kill from "best
> effort", even if that means killing 100 small cgroups, before touching
> "guaranteed".
> 
> I apologize if this is not as thorough as the rest of the thread - I
> am somewhat out of touch with the guts of it all these days.  I just
> feel compelled to indicate that, as a historical user (via Google
> systems) and current user (via Kubernetes), some of the assertions
> being made here do not ring true for our very real use cases.  I
> desperately want cgroup-aware OOM handing, but it has to be
> policy-based or it is just not useful to us.

A policy-based approach was suggested by Michal at a very beginning of
this discussion. Although nobody had any strong objections against it,
we've agreed that this is out of scope of this patchset.

The idea of this patchset is to introduce an ability to select a memcg
as an OOM victim with the following optional killing of all belonging tasks.
I believe, it's absolutely mandatory for _any_ further development
of the OOM killer, which wants to deal with memory cgroups as OOM entities.

If you think that it makes impossible to support some use cases in the future,
let's discuss it. Otherwise, I'd prefer to finish this part of the work,
and proceed to the following improvements on top of it.

Thank you!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27 16:23                                           ` Roman Gushchin
@ 2017-09-27 18:11                                             ` Tim Hockin
  2017-10-01 23:29                                               ` Shakeel Butt
  0 siblings, 1 reply; 78+ messages in thread
From: Tim Hockin @ 2017-09-27 18:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, kernel-team,
	David Rientjes, linux-mm, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin <guro@fb.com> wrote:
> On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote:
>> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Tue 26-09-17 20:37:37, Tim Hockin wrote:
>> > [...]
>> >> I feel like David has offered examples here, and many of us at Google
>> >> have offered examples as long ago as 2013 (if I recall) of cases where
>> >> the proposed heuristic is EXACTLY WRONG.
>> >
>> > I do not think we have discussed anything resembling the current
>> > approach. And I would really appreciate some more examples where
>> > decisions based on leaf nodes would be EXACTLY WRONG.
>> >
>> >> We need OOM behavior to kill in a deterministic order configured by
>> >> policy.
>> >
>> > And nobody is objecting to this usecase. I think we can build a priority
>> > policy on top of leaf-based decision as well. The main point we are
>> > trying to sort out here is a reasonable semantic that would work for
>> > most workloads. Sibling based selection will simply not work on those
>> > that have to use deeper hierarchies for organizational purposes. I
>> > haven't heard a counter argument for that example yet.
>>
>
> Hi, Tim!
>
>> We have a priority-based, multi-user cluster.  That cluster runs a
>> variety of work, including critical things like search and gmail, as
>> well as non-critical things like batch work.  We try to offer our
>> users an SLA around how often they will be killed by factors outside
>> themselves, but we also want to get higher utilization.  We know for a
>> fact (data, lots of data) that most jobs have spare memory capacity,
>> set aside for spikes or simply because accurate sizing is hard.  We
>> can sell "guaranteed" resources to critical jobs, with a high SLA.  We
>> can sell "best effort" resources to non-critical jobs with a low SLA.
>> We achieve much better overall utilization this way.
>
> This is well understood.
>
>>
>> I need to represent the priority of these tasks in a way that gives me
>> a very strong promise that, in case of system OOM, the non-critical
>> jobs will be chosen before the critical jobs.  Regardless of size.
>> Regardless of how many non-critical jobs have to die.  I'd rather kill
>> *all* of the non-critical jobs than a single critical job.  Size of
>> the process or cgroup is simply not a factor, and honestly given 2
>> options of equal priority I'd say age matters more than size.
>>
>> So concretely I have 2 first-level cgroups, one for "guaranteed" and
>> one for "best effort" classes.  I always want to kill from "best
>> effort", even if that means killing 100 small cgroups, before touching
>> "guaranteed".
>>
>> I apologize if this is not as thorough as the rest of the thread - I
>> am somewhat out of touch with the guts of it all these days.  I just
>> feel compelled to indicate that, as a historical user (via Google
>> systems) and current user (via Kubernetes), some of the assertions
>> being made here do not ring true for our very real use cases.  I
>> desperately want cgroup-aware OOM handing, but it has to be
>> policy-based or it is just not useful to us.
>
> A policy-based approach was suggested by Michal at a very beginning of
> this discussion. Although nobody had any strong objections against it,
> we've agreed that this is out of scope of this patchset.
>
> The idea of this patchset is to introduce an ability to select a memcg
> as an OOM victim with the following optional killing of all belonging tasks.
> I believe, it's absolutely mandatory for _any_ further development
> of the OOM killer, which wants to deal with memory cgroups as OOM entities.
>
> If you think that it makes impossible to support some use cases in the future,
> let's discuss it. Otherwise, I'd prefer to finish this part of the work,
> and proceed to the following improvements on top of it.
>
> Thank you!

I am 100% in favor of killing whole groups.  We want that too.  I just
needed to express disagreement with statements that size-based
decisions could not produce bad results.  They can and do.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27 18:11                                             ` Tim Hockin
@ 2017-10-01 23:29                                               ` Shakeel Butt
  2017-10-02 11:56                                                 ` Tetsuo Handa
  2017-10-02 12:24                                                 ` Michal Hocko
  0 siblings, 2 replies; 78+ messages in thread
From: Shakeel Butt @ 2017-10-01 23:29 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

>
> Going back to Michal's example, say the user configured the following:
>
>        root
>       /    \
>      A      D
>     / \
>    B   C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
        root
       /       \
      A(b)    D
     /  \
    B   C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


        struct mem_cgroup *memcg = root_mem_cgroup;
        struct mem_cgroup *selected_memcg = root_mem_cgroup;
        struct mem_cgroup *low_memcg;
        unsigned long low_priority;
        unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
        LIST_HEAD(queue);

next_level:
        low_memcg = NULL;
        low_priority = ULONG_MAX;

next:
        for_each_child_of_memcg(it, memcg) {
                unsigned long prio = it->oom_priority;
                unsigned long badness = 0;

                if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
                        list_add(&it->oom_queue, &queue);
                        continue;
                }

                if (prio > low_priority)
                        continue;

                if (prio == low_priority) {
                        badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
                        if (badness < prev_badness)
                                continue;
                }

                low_memcg = it;
                low_priority = prio;
                prev_badness = badness ?: mem_cgroup_usage(it);  //
for simplicity
        }
        if (!list_empty(&queue)) {
                memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue);
                list_del(&memcg->oom_queue);
                goto next;
        }
        if (low_memcg) {
                selected_memcg = memcg = low_memcg;
                prev_badness = 0;
                if (!low_memcg->oom_group)
                        goto next_level;
        }
        if (selected_memcg->oom_group)
                oom_kill_memcg(selected_memcg);
        else
                oom_kill_process_from_memcg(selected_memcg);

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-01 23:29                                               ` Shakeel Butt
@ 2017-10-02 11:56                                                 ` Tetsuo Handa
  2017-10-02 12:24                                                 ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2017-10-02 11:56 UTC (permalink / raw)
  To: shakeelb, thockin
  Cc: guro, mhocko, hannes, tj, kernel-team, rientjes, linux-mm,
	vdavydov.dev, akpm, cgroups, linux-doc, linux-kernel

Shakeel Butt wrote:
> I think Tim has given very clear explanation why comparing A & D makes
> perfect sense. However I think the above example, a single user system
> where a user has designed and created the whole hierarchy and then
> attaches different jobs/applications to different nodes in this
> hierarchy, is also a valid scenario. One solution I can think of, to
> cater both scenarios, is to introduce a notion of 'bypass oom' or not
> include a memcg for oom comparision and instead include its children
> in the comparison.

I'm not catching up to this thread because I don't use memcg.
But if there are multiple scenarios, what about offloading memcg OOM
handling to loadable kernel modules (like there are many filesystems
which are called by VFS interface) ? We can do try and error more casually.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-01 23:29                                               ` Shakeel Butt
  2017-10-02 11:56                                                 ` Tetsuo Handa
@ 2017-10-02 12:24                                                 ` Michal Hocko
  2017-10-02 12:47                                                   ` Roman Gushchin
  2017-10-02 19:00                                                   ` Shakeel Butt
  1 sibling, 2 replies; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 12:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Sun 01-10-17 16:29:48, Shakeel Butt wrote:
> >
> > Going back to Michal's example, say the user configured the following:
> >
> >        root
> >       /    \
> >      A      D
> >     / \
> >    B   C
> >
> > A global OOM event happens and we find this:
> > - A > D
> > - B, C, D are oomgroups
> >
> > What the user is telling us is that B, C, and D are compound memory
> > consumers. They cannot be divided into their task parts from a memory
> > point of view.
> >
> > However, the user doesn't say the same for A: the A subtree summarizes
> > and controls aggregate consumption of B and C, but without groupoom
> > set on A, the user says that A is in fact divisible into independent
> > memory consumers B and C.
> >
> > If we don't have to kill all of A, but we'd have to kill all of D,
> > does it make sense to compare the two?
> >
> 
> I think Tim has given very clear explanation why comparing A & D makes
> perfect sense. However I think the above example, a single user system
> where a user has designed and created the whole hierarchy and then
> attaches different jobs/applications to different nodes in this
> hierarchy, is also a valid scenario.

Yes and nobody is disputing that, really. I guess the main disconnect
here is that different people want to have more detailed control over
the victim selection while the patchset tries to handle the most
simplistic scenario when a no userspace control over the selection is
required. And I would claim that this will be a last majority of setups
and we should address it first.

A more fine grained control needs some more thinking to come up with a
sensible and long term sustainable API. Just look back and see at the
oom_score_adj story and how it ended up unusable in the end (well apart
from never/always kill corner cases). Let's not repeat that again now.

I strongly believe that we can come up with something - be it priority
based, BFP based or module based selection. But let's start simple with
the most basic scenario first with a most sensible semantic implemented.

I believe the latest version (v9) looks sensible from the semantic point
of view and we should focus on making it into a mergeable shape.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 12:24                                                 ` Michal Hocko
@ 2017-10-02 12:47                                                   ` Roman Gushchin
  2017-10-02 14:29                                                     ` Michal Hocko
  2017-10-02 19:00                                                   ` Shakeel Butt
  1 sibling, 1 reply; 78+ messages in thread
From: Roman Gushchin @ 2017-10-02 12:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Tim Hockin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon, Oct 02, 2017 at 02:24:34PM +0200, Michal Hocko wrote:
> On Sun 01-10-17 16:29:48, Shakeel Butt wrote:
> > >
> > > Going back to Michal's example, say the user configured the following:
> > >
> > >        root
> > >       /    \
> > >      A      D
> > >     / \
> > >    B   C
> > >
> > > A global OOM event happens and we find this:
> > > - A > D
> > > - B, C, D are oomgroups
> > >
> > > What the user is telling us is that B, C, and D are compound memory
> > > consumers. They cannot be divided into their task parts from a memory
> > > point of view.
> > >
> > > However, the user doesn't say the same for A: the A subtree summarizes
> > > and controls aggregate consumption of B and C, but without groupoom
> > > set on A, the user says that A is in fact divisible into independent
> > > memory consumers B and C.
> > >
> > > If we don't have to kill all of A, but we'd have to kill all of D,
> > > does it make sense to compare the two?
> > >
> > 
> > I think Tim has given very clear explanation why comparing A & D makes
> > perfect sense. However I think the above example, a single user system
> > where a user has designed and created the whole hierarchy and then
> > attaches different jobs/applications to different nodes in this
> > hierarchy, is also a valid scenario.
> 
> Yes and nobody is disputing that, really. I guess the main disconnect
> here is that different people want to have more detailed control over
> the victim selection while the patchset tries to handle the most
> simplistic scenario when a no userspace control over the selection is
> required. And I would claim that this will be a last majority of setups
> and we should address it first.
> 
> A more fine grained control needs some more thinking to come up with a
> sensible and long term sustainable API. Just look back and see at the
> oom_score_adj story and how it ended up unusable in the end (well apart
> from never/always kill corner cases). Let's not repeat that again now.
> 
> I strongly believe that we can come up with something - be it priority
> based, BFP based or module based selection. But let's start simple with
> the most basic scenario first with a most sensible semantic implemented.

Totally agree.

> I believe the latest version (v9) looks sensible from the semantic point
> of view and we should focus on making it into a mergeable shape.

The only thing is that after some additional thinking I don't think anymore
that implicit propagation of oom_group is a good idea.

Let me explain: assume we have memcg A with memory.max and memory.oom_group
set, and nested memcg A/B with memory.max set. Let's imagine we have an OOM
event if A/B. What is an expected system behavior?
We have OOM scoped to A/B, and any action should be also scoped to A/B.
We really shouldn't touch processes which are not belonging to A/B.
That means we should either kill the biggest process in A/B, either all
processes in A/B. It's natural to make A/B/memory.oom_group responsible
for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
It really makes no sense, and makes oom_group knob really hard to describe.

Also, after some off-list discussion, we've realized that memory.oom_knob
should be delegatable. The workload should have control over it to express
dependency between processes.

Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 12:47                                                   ` Roman Gushchin
@ 2017-10-02 14:29                                                     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 14:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Tim Hockin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon 02-10-17 13:47:12, Roman Gushchin wrote:
> On Mon, Oct 02, 2017 at 02:24:34PM +0200, Michal Hocko wrote:
[...]
> > I believe the latest version (v9) looks sensible from the semantic point
> > of view and we should focus on making it into a mergeable shape.
> 
> The only thing is that after some additional thinking I don't think anymore
> that implicit propagation of oom_group is a good idea.

It would be better to discuss this under the v9 thread. This one is
already quite convoluted IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 12:24                                                 ` Michal Hocko
  2017-10-02 12:47                                                   ` Roman Gushchin
@ 2017-10-02 19:00                                                   ` Shakeel Butt
  2017-10-02 19:28                                                     ` Michal Hocko
  1 sibling, 1 reply; 78+ messages in thread
From: Shakeel Butt @ 2017-10-02 19:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

> Yes and nobody is disputing that, really. I guess the main disconnect
> here is that different people want to have more detailed control over
> the victim selection while the patchset tries to handle the most
> simplistic scenario when a no userspace control over the selection is
> required. And I would claim that this will be a last majority of setups
> and we should address it first.

IMHO the disconnect/disagreement is which memcgs should be compared
with each other for oom victim selection. Let's forget about oom
priority and just take size into the account. Should the oom selection
algorithm, compare the leaves of the hierarchy or should it compare
siblings? For the single user system, comparing leaves makes sense
while in a multi user system, siblings should be compared for victim
selection.

Coming back to the same example:

       root
       /    \
     A      D
     / \
   B   C

Let's view it as a multi user system and some central job scheduler
has asked a node controller on this system to start two jobs 'A' &
'D'. 'A' then went on to create sub-containers. Now, on system oom,
IMO the most simple sensible thing to do from the semantic point of
view is to compare 'A' and 'D' and if 'A''s usage is higher then
killall 'A' if oom_group or recursively find victim memcg taking 'A'
as root.

I have noted before that for single user systems, comparing 'B', 'C' &
'D' is the most sensible thing to do.

Now, in the multi user system, I can kind of force the comparison of
'A' & 'D' by setting oom_group on 'A'. IMO that is abuse of
'oom_group' as it will get double meanings/semantics which are
comparison leader and killall. I would humbly suggest to have two
separate notions instead. Let's say oom_gang (if you prefer just
'oom_group' is fine too) and killall.

For the single user system example, 'B', 'C' and 'D' will have
'oom_gang' set and if the user wants killall semantics too, he can set
it separately.

For the multi user, 'A' and 'D' will have 'oom_gang' set. Now, lets
say 'A' was selected on system oom, if 'killall' was set on 'A' then
'A' will be selected as victim otherwise the oom selection algorithm
will recursively take 'A' as root and try to find victim memcg.

Another major semantic of 'oom_gang' is that the leaves will always be
treated as 'oom_gang'.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 19:00                                                   ` Shakeel Butt
@ 2017-10-02 19:28                                                     ` Michal Hocko
  2017-10-02 19:45                                                       ` Shakeel Butt
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 19:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon 02-10-17 12:00:43, Shakeel Butt wrote:
> > Yes and nobody is disputing that, really. I guess the main disconnect
> > here is that different people want to have more detailed control over
> > the victim selection while the patchset tries to handle the most
> > simplistic scenario when a no userspace control over the selection is
> > required. And I would claim that this will be a last majority of setups
> > and we should address it first.
> 
> IMHO the disconnect/disagreement is which memcgs should be compared
> with each other for oom victim selection. Let's forget about oom
> priority and just take size into the account. Should the oom selection
> algorithm, compare the leaves of the hierarchy or should it compare
> siblings? For the single user system, comparing leaves makes sense
> while in a multi user system, siblings should be compared for victim
> selection.

THis is simply not true. This is not about single vs. multi user
systems. This is about how the memcg hierarchy is organized (please
have a look at the example I've provided previously). I would dare to
claim that comparing siblings is a weaker semantic just because it puts
stronger constrains on how the hierarchy is organized. Especially when
the cgrou v2 is single hierarchy based (so we cannot create intermediate
cgroup nodes for other controllers because we would automatically get
a cumulative memory consumption).

I am sorry to cut the rest of your proposal because it simply goes over
the scope of the proposed solution while the usecase you are mentioning
is still possible. If we want to compare intermediate nodes (which seems
to be the case) then we can always provide a knob to opt-in - be it your
oom_gang or others.

I am sorry but I would really appreciate to focus on making the step
1  done before diverging into details about potential improvements and a
better control over the selection. This whole thing is an opt-in so
there is a no risk of a regression.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 19:28                                                     ` Michal Hocko
@ 2017-10-02 19:45                                                       ` Shakeel Butt
  2017-10-02 19:56                                                         ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: Shakeel Butt @ 2017-10-02 19:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

> I am sorry to cut the rest of your proposal because it simply goes over
> the scope of the proposed solution while the usecase you are mentioning
> is still possible. If we want to compare intermediate nodes (which seems
> to be the case) then we can always provide a knob to opt-in - be it your
> oom_gang or others.

In the Roman's proposed solution we can already force the comparison
of intermediate nodes using 'oom_group', I am just requesting to
separate the killall semantics from it.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 19:45                                                       ` Shakeel Butt
@ 2017-10-02 19:56                                                         ` Michal Hocko
  2017-10-02 20:00                                                           ` Tim Hockin
  2017-10-02 20:24                                                           ` Shakeel Butt
  0 siblings, 2 replies; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 19:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
> > I am sorry to cut the rest of your proposal because it simply goes over
> > the scope of the proposed solution while the usecase you are mentioning
> > is still possible. If we want to compare intermediate nodes (which seems
> > to be the case) then we can always provide a knob to opt-in - be it your
> > oom_gang or others.
> 
> In the Roman's proposed solution we can already force the comparison
> of intermediate nodes using 'oom_group', I am just requesting to
> separate the killall semantics from it.

oom_group _is_ about killall semantic.  And comparing killable entities
is just a natural thing to do. So I am not sure what you mean

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 19:56                                                         ` Michal Hocko
@ 2017-10-02 20:00                                                           ` Tim Hockin
  2017-10-02 20:08                                                             ` Michal Hocko
  2017-10-02 20:20                                                             ` Shakeel Butt
  2017-10-02 20:24                                                           ` Shakeel Butt
  1 sibling, 2 replies; 78+ messages in thread
From: Tim Hockin @ 2017-10-02 20:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

In the example above:

       root
       /    \
     A      D
     / \
   B   C

Does oom_group allow me to express "compare A and D; if A is chosen
compare B and C; kill the loser" ?  As I understand the proposal (from
reading thread, not patch) it does not.

On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
>> > I am sorry to cut the rest of your proposal because it simply goes over
>> > the scope of the proposed solution while the usecase you are mentioning
>> > is still possible. If we want to compare intermediate nodes (which seems
>> > to be the case) then we can always provide a knob to opt-in - be it your
>> > oom_gang or others.
>>
>> In the Roman's proposed solution we can already force the comparison
>> of intermediate nodes using 'oom_group', I am just requesting to
>> separate the killall semantics from it.
>
> oom_group _is_ about killall semantic.  And comparing killable entities
> is just a natural thing to do. So I am not sure what you mean
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 20:00                                                           ` Tim Hockin
@ 2017-10-02 20:08                                                             ` Michal Hocko
  2017-10-02 20:20                                                             ` Shakeel Butt
  1 sibling, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 20:08 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Shakeel Butt, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon 02-10-17 13:00:54, Tim Hockin wrote:
> In the example above:
> 
>        root
>        /    \
>      A      D
>      / \
>    B   C
> 
> Does oom_group allow me to express "compare A and D; if A is chosen
> compare B and C; kill the loser" ?  As I understand the proposal (from
> reading thread, not patch) it does not.

No it doesn't. It allows you to kill A (recursively) as the largest
memory consumer. So, no, it cannot be used for prioritization, but again
this is not yet the scope of the proposed solution.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 20:00                                                           ` Tim Hockin
  2017-10-02 20:08                                                             ` Michal Hocko
@ 2017-10-02 20:20                                                             ` Shakeel Butt
  1 sibling, 0 replies; 78+ messages in thread
From: Shakeel Butt @ 2017-10-02 20:20 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Michal Hocko, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

(Replying again as format of previous reply got messed up).

On Mon, Oct 2, 2017 at 1:00 PM, Tim Hockin <thockin@hockin.org> wrote:
> In the example above:
>
>        root
>        /    \
>      A      D
>      / \
>    B   C
>
> Does oom_group allow me to express "compare A and D; if A is chosen
> compare B and C; kill the loser" ?  As I understand the proposal (from
> reading thread, not patch) it does not.

It will let you compare A and D and if A is chosen then kill A, B and C.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 19:56                                                         ` Michal Hocko
  2017-10-02 20:00                                                           ` Tim Hockin
@ 2017-10-02 20:24                                                           ` Shakeel Butt
  2017-10-02 20:34                                                             ` Johannes Weiner
  2017-10-02 20:55                                                             ` Michal Hocko
  1 sibling, 2 replies; 78+ messages in thread
From: Shakeel Butt @ 2017-10-02 20:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
>> > I am sorry to cut the rest of your proposal because it simply goes over
>> > the scope of the proposed solution while the usecase you are mentioning
>> > is still possible. If we want to compare intermediate nodes (which seems
>> > to be the case) then we can always provide a knob to opt-in - be it your
>> > oom_gang or others.
>>
>> In the Roman's proposed solution we can already force the comparison
>> of intermediate nodes using 'oom_group', I am just requesting to
>> separate the killall semantics from it.
>
> oom_group _is_ about killall semantic.  And comparing killable entities
> is just a natural thing to do. So I am not sure what you mean
>

I am saying decouple the notion of comparable entities and killable entities.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 20:24                                                           ` Shakeel Butt
@ 2017-10-02 20:34                                                             ` Johannes Weiner
  2017-10-02 20:55                                                             ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: Johannes Weiner @ 2017-10-02 20:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Tim Hockin, Roman Gushchin, Tejun Heo, kernel-team,
	David Rientjes, Linux MM, Vladimir Davydov, Tetsuo Handa,
	Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon, Oct 02, 2017 at 01:24:25PM -0700, Shakeel Butt wrote:
> On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
> >> > I am sorry to cut the rest of your proposal because it simply goes over
> >> > the scope of the proposed solution while the usecase you are mentioning
> >> > is still possible. If we want to compare intermediate nodes (which seems
> >> > to be the case) then we can always provide a knob to opt-in - be it your
> >> > oom_gang or others.
> >>
> >> In the Roman's proposed solution we can already force the comparison
> >> of intermediate nodes using 'oom_group', I am just requesting to
> >> separate the killall semantics from it.
> >
> > oom_group _is_ about killall semantic.  And comparing killable entities
> > is just a natural thing to do. So I am not sure what you mean
> >
> 
> I am saying decouple the notion of comparable entities and killable entities.

Feel free to send patches in a new thread.

We don't need this level of control for this series to be useful - to
us, and other users. It can easily be added on top of Roman's work.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-10-02 20:24                                                           ` Shakeel Butt
  2017-10-02 20:34                                                             ` Johannes Weiner
@ 2017-10-02 20:55                                                             ` Michal Hocko
  1 sibling, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2017-10-02 20:55 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tim Hockin, Roman Gushchin, Johannes Weiner, Tejun Heo,
	kernel-team, David Rientjes, Linux MM, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Cgroups, linux-doc, linux-kernel

On Mon 02-10-17 13:24:25, Shakeel Butt wrote:
> On Mon, Oct 2, 2017 at 12:56 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 02-10-17 12:45:18, Shakeel Butt wrote:
> >> > I am sorry to cut the rest of your proposal because it simply goes over
> >> > the scope of the proposed solution while the usecase you are mentioning
> >> > is still possible. If we want to compare intermediate nodes (which seems
> >> > to be the case) then we can always provide a knob to opt-in - be it your
> >> > oom_gang or others.
> >>
> >> In the Roman's proposed solution we can already force the comparison
> >> of intermediate nodes using 'oom_group', I am just requesting to
> >> separate the killall semantics from it.
> >
> > oom_group _is_ about killall semantic.  And comparing killable entities
> > is just a natural thing to do. So I am not sure what you mean
> >
> 
> I am saying decouple the notion of comparable entities and killable entities.

There is no strong (bijection) relation there. Right now killable
entities are comparable (which I hope we agree is the right thing to do)
but nothing really prevents even non-killable entities to be compared in
the future.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 17:00                     ` Johannes Weiner
  2017-09-25 18:15                       ` Roman Gushchin
@ 2017-09-25 22:21                       ` David Rientjes
  2017-09-26  8:46                         ` Michal Hocko
  1 sibling, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-25 22:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, kernel-team, linux-mm,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon, 25 Sep 2017, Johannes Weiner wrote:

> > True but we want to have the semantic reasonably understandable. And it
> > is quite hard to explain that the oom killer hasn't selected the largest
> > memcg just because it happened to be in a deeper hierarchy which has
> > been configured to cover a different resource.
> 
> Going back to Michal's example, say the user configured the following:
> 
>        root
>       /    \
>      A      D
>     / \
>    B   C
> 
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
> 
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
> 
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
> 
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
> 

No, I agree that we shouldn't compare sibling memory cgroups based on 
different criteria depending on whether group_oom is set or not.

I think it would be better to compare siblings based on the same criteria 
independent of group_oom if the user has mounted the hierarchy with the 
new mode (I think we all agree that the mount option is needed).  It's 
very easy to describe to the user and the selection is simple to 
understand.  Then, once a cgroup has been chosen as the victim cgroup, 
kill the process with the highest badness, allowing the user to influence 
that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
otherwise, kill all eligible processes if enabled.

That, to me, is a very clear semantic and I believe it addresses Roman's 
usecase.  My desire to have oom priorities amongst siblings is so that 
userspace can influence which cgroup is chosen, just as it can influence 
which process is chosen.

I see group_oom as a mechanism to be used when victim selection has 
already been done instead of something that should be considered in the 
policy of victim selection.

> Let's consider an extreme case of this conundrum:
> 
> 	root
>       /     \
>      A       B
>     /|\      |
>  A1-A1000    B1
> 
> Again we find:
> - A > B
> - A1 to A1000 and B1 are oomgroups
> But:
> - A1 to A1000 individually are tiny, B1 is huge
> 
> Going level by level, we'd pick A as the bigger hierarchy in the
> system, and then kill off one of the tiny groups A1 to A1000.
> 
> Conversely, going for biggest consumer regardless of hierarchy, we'd
> compare A1 to A1000 and B1, then pick B1 as the biggest single atomic
> memory consumer in the system and kill all its tasks.
> 

If we compare sibling memcgs independent of group_oom, we don't 
necessarily pick A unless it really is larger than B.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-25 22:21                       ` David Rientjes
@ 2017-09-26  8:46                         ` Michal Hocko
  2017-09-26 21:04                           ` David Rientjes
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-26  8:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Roman Gushchin, Tejun Heo, kernel-team,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Mon 25-09-17 15:21:03, David Rientjes wrote:
> On Mon, 25 Sep 2017, Johannes Weiner wrote:
> 
> > > True but we want to have the semantic reasonably understandable. And it
> > > is quite hard to explain that the oom killer hasn't selected the largest
> > > memcg just because it happened to be in a deeper hierarchy which has
> > > been configured to cover a different resource.
> > 
> > Going back to Michal's example, say the user configured the following:
> > 
> >        root
> >       /    \
> >      A      D
> >     / \
> >    B   C
> > 
> > A global OOM event happens and we find this:
> > - A > D
> > - B, C, D are oomgroups
> > 
> > What the user is telling us is that B, C, and D are compound memory
> > consumers. They cannot be divided into their task parts from a memory
> > point of view.
> > 
> > However, the user doesn't say the same for A: the A subtree summarizes
> > and controls aggregate consumption of B and C, but without groupoom
> > set on A, the user says that A is in fact divisible into independent
> > memory consumers B and C.
> > 
> > If we don't have to kill all of A, but we'd have to kill all of D,
> > does it make sense to compare the two?
> > 
> 
> No, I agree that we shouldn't compare sibling memory cgroups based on 
> different criteria depending on whether group_oom is set or not.
> 
> I think it would be better to compare siblings based on the same criteria 
> independent of group_oom if the user has mounted the hierarchy with the 
> new mode (I think we all agree that the mount option is needed).  It's 
> very easy to describe to the user and the selection is simple to 
> understand. 

I disagree. Just take the most simplistic example when cgroups reflect
some other higher level organization - e.g. school with teachers,
students and admins as the top level cgroups to control the proper cpu
share load. Now you want to have a fair OOM selection between different
entities. Do you consider selecting students all the time as an expected
behavior just because their are the largest group? This just doesn't
make any sense to me.

> Then, once a cgroup has been chosen as the victim cgroup, 
> kill the process with the highest badness, allowing the user to influence 
> that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
> otherwise, kill all eligible processes if enabled.

And now, what should be the semantic of group_oom on an intermediate
(non-leaf) memcg? Why should we compare it to other killable entities?
Roman was mentioning a setup where a _single_ workload consists of a
deeper hierarchy which has to be shut down at once. It absolutely makes
sense to consider the cumulative memory of that hierarchy when we are
going to kill it all.

> That, to me, is a very clear semantic and I believe it addresses Roman's 
> usecase.  My desire to have oom priorities amongst siblings is so that 
> userspace can influence which cgroup is chosen, just as it can influence 
> which process is chosen.

But what you are proposing is something different from oom_score_adj.
That only sets bias to the killable entities while priorities on
intermediate non-killable memcgs controls how the whole oom hierarchy
is traversed. So a non-killable intermediate memcg can hugely influence
what gets killed in the end. This is IMHO a tricky and I would even dare
to claim a wrong semantic. I can see priorities being very useful on
killable entities for sure. I am not entirely sure what would be the
best approach yet and that is why I've suggested that to postpone to
after we settle with a simple approach first. Bringing priorities back
to the discussion again will not help to move that forward I am afraid.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26  8:46                         ` Michal Hocko
@ 2017-09-26 21:04                           ` David Rientjes
  2017-09-27  7:37                             ` Michal Hocko
  0 siblings, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-26 21:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Tejun Heo, kernel-team,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue, 26 Sep 2017, Michal Hocko wrote:

> > No, I agree that we shouldn't compare sibling memory cgroups based on 
> > different criteria depending on whether group_oom is set or not.
> > 
> > I think it would be better to compare siblings based on the same criteria 
> > independent of group_oom if the user has mounted the hierarchy with the 
> > new mode (I think we all agree that the mount option is needed).  It's 
> > very easy to describe to the user and the selection is simple to 
> > understand. 
> 
> I disagree. Just take the most simplistic example when cgroups reflect
> some other higher level organization - e.g. school with teachers,
> students and admins as the top level cgroups to control the proper cpu
> share load. Now you want to have a fair OOM selection between different
> entities. Do you consider selecting students all the time as an expected
> behavior just because their are the largest group? This just doesn't
> make any sense to me.
> 

Are you referring to this?

	root
       /    \
students    admins
/      \    /    \
A      B    C    D

If the cumulative usage of all students exceeds the cumulative usage of 
all admins, yes, the choice is to kill from the /students tree.  This has 
been Roman's design from the very beginning.  If the preference is to kill 
the single largest process, which may be attached to either subtree, you 
would not have opted-in to the new heuristic.

> > Then, once a cgroup has been chosen as the victim cgroup, 
> > kill the process with the highest badness, allowing the user to influence 
> > that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
> > otherwise, kill all eligible processes if enabled.
> 
> And now, what should be the semantic of group_oom on an intermediate
> (non-leaf) memcg? Why should we compare it to other killable entities?
> Roman was mentioning a setup where a _single_ workload consists of a
> deeper hierarchy which has to be shut down at once. It absolutely makes
> sense to consider the cumulative memory of that hierarchy when we are
> going to kill it all.
> 

If group_oom is enabled on an intermediate memcg, I think the intuitive 
way to handle it would be that all descendants are also implicitly or 
explicitly group_oom.  It is compared to sibling cgroups based on 
cumulative usage at the time of oom and the largest is chosen and 
iterated.  The point is to separate out the selection heuristic (policy) 
from group_oom (mechanism) so that we don't bias or prefer subtrees based 
on group_oom, which makes this much more complex.

> But what you are proposing is something different from oom_score_adj.
> That only sets bias to the killable entities while priorities on
> intermediate non-killable memcgs controls how the whole oom hierarchy
> is traversed. So a non-killable intermediate memcg can hugely influence
> what gets killed in the end.

Why is there an intermediate non-killable memcg allowed?  Cgroup oom 
priorities should not be allowed to disable oom killing, it should only 
set a priority.  The only reason an intermediate cgroup should be 
non-killable is if there are no processes attached, but I don't think 
anyone is arguing we should just do nothing in that scenario.  The point 
is that the user has infleunce over the decisionmaking with a per-process 
heuristic with oom_score_adj and should also have influence over the 
decisionmaking with a per-cgroup heuristic.

> This is IMHO a tricky and I would even dare
> to claim a wrong semantic. I can see priorities being very useful on
> killable entities for sure. I am not entirely sure what would be the
> best approach yet and that is why I've suggested that to postpone to
> after we settle with a simple approach first. Bringing priorities back
> to the discussion again will not help to move that forward I am afraid.
> 

I agree to keep it as simple as possible, especially since some users want 
specific victim selection, it should be clear to document, and it 
shouldn't be influenced by some excessive amount of usage in another 
subtree the user has no control over (/admins over /students) to prevent 
the user from defining that it really wants to be the first oom victim or 
the admin from defining it really prefers something else killed first.

My suggestion is that Roman's implementation is clear, well defined, and 
has real-world usecases and it should be the direction that this moves in.  
I think victim selection and group_oom are distinct and should not 
influence the decisionmaking.  I think that oom_priority should influence 
the decisionmaking.

When mounted with the new option, as the oom hierarchy is iterated, 
compare all sibling cgroups regarding cumulative size unless an oom 
priority overrides that (either user specifying it wants to be oom killed 
or admin specifying it prefers something else).  When a victim memcg is 
chosen, use group_oom to determine what should be killed, otherwise choose 
by oom_score_adj.  I can't imagine how this can be any simpler.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-26 21:04                           ` David Rientjes
@ 2017-09-27  7:37                             ` Michal Hocko
  2017-09-27  9:57                               ` Roman Gushchin
  0 siblings, 1 reply; 78+ messages in thread
From: Michal Hocko @ 2017-09-27  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Roman Gushchin, Tejun Heo, kernel-team,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Tue 26-09-17 14:04:41, David Rientjes wrote:
> On Tue, 26 Sep 2017, Michal Hocko wrote:
> 
> > > No, I agree that we shouldn't compare sibling memory cgroups based on 
> > > different criteria depending on whether group_oom is set or not.
> > > 
> > > I think it would be better to compare siblings based on the same criteria 
> > > independent of group_oom if the user has mounted the hierarchy with the 
> > > new mode (I think we all agree that the mount option is needed).  It's 
> > > very easy to describe to the user and the selection is simple to 
> > > understand. 
> > 
> > I disagree. Just take the most simplistic example when cgroups reflect
> > some other higher level organization - e.g. school with teachers,
> > students and admins as the top level cgroups to control the proper cpu
> > share load. Now you want to have a fair OOM selection between different
> > entities. Do you consider selecting students all the time as an expected
> > behavior just because their are the largest group? This just doesn't
> > make any sense to me.
> > 
> 
> Are you referring to this?
> 
> 	root
>        /    \
> students    admins
> /      \    /    \
> A      B    C    D
> 
> If the cumulative usage of all students exceeds the cumulative usage of 
> all admins, yes, the choice is to kill from the /students tree.

Which is wrong IMHO because the number of stutends is likely much more
larger than admins (or teachers) yet it might be the admins one to run
away. This example simply shows how comparing siblinks highly depends
on the way you organize the hierarchy rather than the actual memory
consumer runaways which is the primary goal of the OOM killer to handle.

> This has been Roman's design from the very beginning.

I suspect this was the case because deeper hierarchies for
organizational purposes haven't been considered.

> If the preference is to kill 
> the single largest process, which may be attached to either subtree, you 
> would not have opted-in to the new heuristic.

I believe you are making a wrong assumption here. The container cleanup
is sound reason to opt in and deeper hierarchies are simply required in
the cgroup v2 world where you do not have separate hierarchies.
 
> > > Then, once a cgroup has been chosen as the victim cgroup, 
> > > kill the process with the highest badness, allowing the user to influence 
> > > that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
> > > otherwise, kill all eligible processes if enabled.
> > 
> > And now, what should be the semantic of group_oom on an intermediate
> > (non-leaf) memcg? Why should we compare it to other killable entities?
> > Roman was mentioning a setup where a _single_ workload consists of a
> > deeper hierarchy which has to be shut down at once. It absolutely makes
> > sense to consider the cumulative memory of that hierarchy when we are
> > going to kill it all.
> > 
> 
> If group_oom is enabled on an intermediate memcg, I think the intuitive 
> way to handle it would be that all descendants are also implicitly or 
> explicitly group_oom.

This is an interesting point. I would tend to agree here. If somebody
requires all-in clean up up the hierarchy it feels strange that a
subtree would disagree (e.g. during memcg oom on the subtree). I can
hardly see a usecase that would really need a different group_oom policy
depending on where in the hierarchy the oom happened to be honest.
Roman?

> It is compared to sibling cgroups based on 
> cumulative usage at the time of oom and the largest is chosen and 
> iterated.  The point is to separate out the selection heuristic (policy) 
> from group_oom (mechanism) so that we don't bias or prefer subtrees based 
> on group_oom, which makes this much more complex.

I disagree. group_oom determines killable entity and making a decision
based on a non-killable entities is weird as already pointed out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-27  7:37                             ` Michal Hocko
@ 2017-09-27  9:57                               ` Roman Gushchin
  0 siblings, 0 replies; 78+ messages in thread
From: Roman Gushchin @ 2017-09-27  9:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Johannes Weiner, Tejun Heo, kernel-team,
	linux-mm, Vladimir Davydov, Tetsuo Handa, Andrew Morton, cgroups,
	linux-doc, linux-kernel

On Wed, Sep 27, 2017 at 09:37:44AM +0200, Michal Hocko wrote:
> On Tue 26-09-17 14:04:41, David Rientjes wrote:
> > On Tue, 26 Sep 2017, Michal Hocko wrote:
> > 
> > > > No, I agree that we shouldn't compare sibling memory cgroups based on 
> > > > different criteria depending on whether group_oom is set or not.
> > > > 
> > > > I think it would be better to compare siblings based on the same criteria 
> > > > independent of group_oom if the user has mounted the hierarchy with the 
> > > > new mode (I think we all agree that the mount option is needed).  It's 
> > > > very easy to describe to the user and the selection is simple to 
> > > > understand. 
> > > 
> > > I disagree. Just take the most simplistic example when cgroups reflect
> > > some other higher level organization - e.g. school with teachers,
> > > students and admins as the top level cgroups to control the proper cpu
> > > share load. Now you want to have a fair OOM selection between different
> > > entities. Do you consider selecting students all the time as an expected
> > > behavior just because their are the largest group? This just doesn't
> > > make any sense to me.
> > > 
> > 
> > Are you referring to this?
> > 
> > 	root
> >        /    \
> > students    admins
> > /      \    /    \
> > A      B    C    D
> > 
> > If the cumulative usage of all students exceeds the cumulative usage of 
> > all admins, yes, the choice is to kill from the /students tree.
> 
> Which is wrong IMHO because the number of stutends is likely much more
> larger than admins (or teachers) yet it might be the admins one to run
> away. This example simply shows how comparing siblinks highly depends
> on the way you organize the hierarchy rather than the actual memory
> consumer runaways which is the primary goal of the OOM killer to handle.
> 
> > This has been Roman's design from the very beginning.
> 
> I suspect this was the case because deeper hierarchies for
> organizational purposes haven't been considered.
> 
> > If the preference is to kill 
> > the single largest process, which may be attached to either subtree, you 
> > would not have opted-in to the new heuristic.
> 
> I believe you are making a wrong assumption here. The container cleanup
> is sound reason to opt in and deeper hierarchies are simply required in
> the cgroup v2 world where you do not have separate hierarchies.
>  
> > > > Then, once a cgroup has been chosen as the victim cgroup, 
> > > > kill the process with the highest badness, allowing the user to influence 
> > > > that with /proc/pid/oom_score_adj just as today, if group_oom is disabled; 
> > > > otherwise, kill all eligible processes if enabled.
> > > 
> > > And now, what should be the semantic of group_oom on an intermediate
> > > (non-leaf) memcg? Why should we compare it to other killable entities?
> > > Roman was mentioning a setup where a _single_ workload consists of a
> > > deeper hierarchy which has to be shut down at once. It absolutely makes
> > > sense to consider the cumulative memory of that hierarchy when we are
> > > going to kill it all.
> > > 
> > 
> > If group_oom is enabled on an intermediate memcg, I think the intuitive 
> > way to handle it would be that all descendants are also implicitly or 
> > explicitly group_oom.
> 
> This is an interesting point. I would tend to agree here. If somebody
> requires all-in clean up up the hierarchy it feels strange that a
> subtree would disagree (e.g. during memcg oom on the subtree). I can
> hardly see a usecase that would really need a different group_oom policy
> depending on where in the hierarchy the oom happened to be honest.
> Roman?

Yes, I'd say that it's strange to apply settings from outside the OOMing
cgroup to the subtree, but actually it's not. The oom_group setting should
basically mean that the OOM killer will not kill a random task in the subtree.
And it doesn't matter if it was global or memcg-wide OOM.

Applied to v9. Thanks!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-11 20:44 ` [v8 0/4] " David Rientjes
  2017-09-13 12:29   ` Michal Hocko
@ 2017-09-21 14:21   ` Johannes Weiner
  2017-09-21 21:17     ` David Rientjes
  1 sibling, 1 reply; 78+ messages in thread
From: Johannes Weiner @ 2017-09-21 14:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Mon, Sep 11, 2017 at 01:44:39PM -0700, David Rientjes wrote:
> On Mon, 11 Sep 2017, Roman Gushchin wrote:
> 
> > This patchset makes the OOM killer cgroup-aware.
> > 
> > v8:
> >   - Do not kill tasks with OOM_SCORE_ADJ -1000
> >   - Make the whole thing opt-in with cgroup mount option control
> >   - Drop oom_priority for further discussions
> 
> Nack, we specifically require oom_priority for this to function correctly, 
> otherwise we cannot prefer to kill from low priority leaf memcgs as 
> required.  v8 appears to implement new functionality that we want, to 
> compare two memcgs based on usage, but without the ability to influence 
> that decision to protect important userspace, so now I'm in a position 
> where (1) nothing has changed if I don't use the new mount option or (2) I 
> get completely different oom kill selection with the new mount option but 
> not the ability to influence it.  I was much happier with the direction 
> that v7 was taking, but since v8 causes us to regress without the ability 
> to change memcg priority, this has to be nacked.

That's a ridiculous nak.

The fact that this patch series doesn't solve your particular problem
is not a technical argument to *reject* somebody else's work to solve
a different problem. It's not a regression when behavior is completely
unchanged unless you explicitly opt into a new functionality.

So let's stay reasonable here.

The patch series has merit as it currently stands. It makes OOM
killing in a cgrouped system fairer and less surprising. Whether you
have the ability to influence this in a new way is an entirely
separate discussion. It's one that involves ABI and user guarantees.

Right now Roman's patches make no guarantees on how the cgroup tree is
descended. But once we define an interface for prioritization, it
locks the victim algorithm into place to a certain extent.

It also involves a discussion about how much control userspace should
have over OOM killing in the first place. It's a last-minute effort to
save the kernel from deadlocking on memory. Whether that is the time
and place to have userspace make clever resource management decisions
is an entirely different thing than what Roman is doing.

But this patch series doesn't prevent any such future discussion and
implementations, and it's not useless without it. So let's not
conflate these two things, and hold the priority patch for now.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-21 14:21   ` Johannes Weiner
@ 2017-09-21 21:17     ` David Rientjes
  2017-09-21 21:51       ` Johannes Weiner
  2017-09-22 15:44       ` Tejun Heo
  0 siblings, 2 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-21 21:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 21 Sep 2017, Johannes Weiner wrote:

> That's a ridiculous nak.
> 
> The fact that this patch series doesn't solve your particular problem
> is not a technical argument to *reject* somebody else's work to solve
> a different problem. It's not a regression when behavior is completely
> unchanged unless you explicitly opt into a new functionality.
> 
> So let's stay reasonable here.
> 

The issue is that if you opt-in to the new feature, then you are forced to 
change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
you do not want oom killed based on size to be oom disabled.  The kernel 
provides no other remedy without oom priorities since the new feature 
would otherwise disregard oom_score_adj.  In that case, userspace is 
racing in two ways: (1) attach of process to a memcg you want to protect 
from oom kill (first class, vital, large memory hog job) to set to oom 
disable and (2) adjustment of other cgroups to make them eligible after 
first oom kill.

It doesn't have anything to do with my particular usecase, but rather the 
ability of userspace to influence the decisions of the kernel.  Previous 
to this patchset, when selection is done based on process size, userspace 
has full control over selection.  After this patchset, userspace has no 
control other than setting all processes to be oom disabled if the largest 
memory consumer is to be protected.  Roman's memory.oom_priority provides 
a perfect solution for userspace to be able to influence this decision 
making and causes no change in behavior for users who choose not to tune 
memory.oom_priority.  The nack originates from the general need for 
userspace influence over oom victim selection and to avoid userspace 
needing to take the rather drastic measure of setting all processes to be 
oom disabled to prevent oom kill in kernels before oom priorities are 
introduced.

> The patch series has merit as it currently stands. It makes OOM
> killing in a cgrouped system fairer and less surprising. Whether you
> have the ability to influence this in a new way is an entirely
> separate discussion. It's one that involves ABI and user guarantees.
> 
> Right now Roman's patches make no guarantees on how the cgroup tree is
> descended. But once we define an interface for prioritization, it
> locks the victim algorithm into place to a certain extent.
> 

The patchset compares memory cgroup size relative to sibling cgroups only, 
the same comparison for memory.oom_priority.  There is a guarantee 
provided on how cgroup size is compared in select_victim_memcg(), it 
hierarchically accumulates the "size" from leaf nodes up to the root memcg 
and then iterates the tree comparing sizes between sibling cgroups to 
choose a victim memcg.  That algorithm could be more elaborately described 
in the documentation, but we simply cannot change the implementation of 
select_victim_memcg() later even without oom priorities since users cannot 
get inconsistent results after opting into a feature between kernel 
versions.  I believe the selection criteria should be implemented to be 
deterministic, as select_victim_memcg() does, and the documentation should 
fully describe what the selection criteria is, and then allow the user to 
decide.

> It also involves a discussion about how much control userspace should
> have over OOM killing in the first place. It's a last-minute effort to
> save the kernel from deadlocking on memory. Whether that is the time
> and place to have userspace make clever resource management decisions
> is an entirely different thing than what Roman is doing.
> 
> But this patch series doesn't prevent any such future discussion and
> implementations, and it's not useless without it. So let's not
> conflate these two things, and hold the priority patch for now.
> 

Roman is planning on introducing memory.oom_priority back into the 
patchset per https://marc.info/?l=linux-kernel&m=150574701126877 and I 
agree with the very clear semantic that it introduces: to have the 
size-based comparison use the same rules as the userspace priority 
comparison.  It's very powerful and I'm happy to ack the final version 
that he plans on posting.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-21 21:17     ` David Rientjes
@ 2017-09-21 21:51       ` Johannes Weiner
  2017-09-22 20:53         ` David Rientjes
  2017-09-22 15:44       ` Tejun Heo
  1 sibling, 1 reply; 78+ messages in thread
From: Johannes Weiner @ 2017-09-21 21:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, Sep 21, 2017 at 02:17:25PM -0700, David Rientjes wrote:
> On Thu, 21 Sep 2017, Johannes Weiner wrote:
> 
> > That's a ridiculous nak.
> > 
> > The fact that this patch series doesn't solve your particular problem
> > is not a technical argument to *reject* somebody else's work to solve
> > a different problem. It's not a regression when behavior is completely
> > unchanged unless you explicitly opt into a new functionality.
> > 
> > So let's stay reasonable here.
> > 
> 
> The issue is that if you opt-in to the new feature, then you are forced to 
> change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
> you do not want oom killed based on size to be oom disabled.

You're assuming that most people would want to influence the oom
behavior in the first place. I think the opposite is the case: most
people don't care as long as the OOM killer takes the intent the user
has expressed wrt runtime containerization/grouping into account.

> The kernel provides no other remedy without oom priorities since the
> new feature would otherwise disregard oom_score_adj.

As of v8, it respects this setting and doesn't kill min score tasks.

> The nack originates from the general need for userspace influence
> over oom victim selection and to avoid userspace needing to take the
> rather drastic measure of setting all processes to be oom disabled
> to prevent oom kill in kernels before oom priorities are introduced.

As I said, we can discuss this in a separate context. Because again, I
really don't see how the lack of configurability in an opt-in feature
would diminish its value for many people who don't even care to adjust
and influence this behavior.

> > The patch series has merit as it currently stands. It makes OOM
> > killing in a cgrouped system fairer and less surprising. Whether you
> > have the ability to influence this in a new way is an entirely
> > separate discussion. It's one that involves ABI and user guarantees.
> > 
> > Right now Roman's patches make no guarantees on how the cgroup tree is
> > descended. But once we define an interface for prioritization, it
> > locks the victim algorithm into place to a certain extent.
> > 
> 
> The patchset compares memory cgroup size relative to sibling cgroups only, 
> the same comparison for memory.oom_priority.  There is a guarantee 
> provided on how cgroup size is compared in select_victim_memcg(), it 
> hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> and then iterates the tree comparing sizes between sibling cgroups to 
> choose a victim memcg.  That algorithm could be more elaborately described 
> in the documentation, but we simply cannot change the implementation of 
> select_victim_memcg() later even without oom priorities since users cannot 
> get inconsistent results after opting into a feature between kernel 
> versions.  I believe the selection criteria should be implemented to be 
> deterministic, as select_victim_memcg() does, and the documentation should 
> fully describe what the selection criteria is, and then allow the user to 
> decide.

I wholeheartedly disagree. We have changed the behavior multiple times
in the past. In fact, you have arguably done the most drastic changes
to the algorithm since the OOM killer was first introduced. E.g.

	a63d83f427fb oom: badness heuristic rewrite

And that's completely fine. Because this thing is not a resource
management tool for userspace, it's the kernel saving itself. At best
in a manner that's not too surprising to userspace.

To me, your argument behind the NAK still boils down to "this doesn't
support my highly specialized usecase." But since it doesn't prohibit
your usecase - which isn't even supported upstream, btw - this really
doesn't carry much weight.

I'd say if you want configurability on top of Roman's code, please
submit patches and push the case for these in a separate effort.

Thanks

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-21 21:51       ` Johannes Weiner
@ 2017-09-22 20:53         ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-22 20:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
	Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
	linux-doc, linux-kernel

On Thu, 21 Sep 2017, Johannes Weiner wrote:

> > The issue is that if you opt-in to the new feature, then you are forced to 
> > change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
> > you do not want oom killed based on size to be oom disabled.
> 
> You're assuming that most people would want to influence the oom
> behavior in the first place. I think the opposite is the case: most
> people don't care as long as the OOM killer takes the intent the user
> has expressed wrt runtime containerization/grouping into account.
> 

If you do not want to influence the oom behavior, do not change 
memory.oom_priority from its default.  It's that simple.

> > The kernel provides no other remedy without oom priorities since the
> > new feature would otherwise disregard oom_score_adj.
> 
> As of v8, it respects this setting and doesn't kill min score tasks.
> 

That's the issue.  To protect a memory cgroup from being oom killed in a 
system oom condition, you need to change oom_score_adj of *all* processes 
attached to be oom disabled.  Then, you have a huge problem in memory 
cgroup oom conditions because nothing can be killed in that hierarchy 
itself.

> > The patchset compares memory cgroup size relative to sibling cgroups only, 
> > the same comparison for memory.oom_priority.  There is a guarantee 
> > provided on how cgroup size is compared in select_victim_memcg(), it 
> > hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> > and then iterates the tree comparing sizes between sibling cgroups to 
> > choose a victim memcg.  That algorithm could be more elaborately described 
> > in the documentation, but we simply cannot change the implementation of 
> > select_victim_memcg() later even without oom priorities since users cannot 
> > get inconsistent results after opting into a feature between kernel 
> > versions.  I believe the selection criteria should be implemented to be 
> > deterministic, as select_victim_memcg() does, and the documentation should 
> > fully describe what the selection criteria is, and then allow the user to 
> > decide.
> 
> I wholeheartedly disagree. We have changed the behavior multiple times
> in the past. In fact, you have arguably done the most drastic changes
> to the algorithm since the OOM killer was first introduced. E.g.
> 
> 	a63d83f427fb oom: badness heuristic rewrite
> 
> And that's completely fine. Because this thing is not a resource
> management tool for userspace, it's the kernel saving itself. At best
> in a manner that's not too surprising to userspace.
> 

When I did that, I had to add /proc/pid/oom_score_adj to allow userspace 
to influence selection.  We came up with /proc/pid/oom_score_adj when 
working with kde, openssh, chromium, and udev because they cared about the 
ability to influence the decisionmaking.  I'm perfectly happy with the new 
heuristic presented in this patchset, I simply want userspace to be able 
to influence it, if it desires.  Requiring userspace to set all processes 
to be oom disabled to protect a hierarchy is totally and completely 
broken.  It livelocks the memory cgroup if it is oom itself.

> To me, your argument behind the NAK still boils down to "this doesn't
> support my highly specialized usecase." But since it doesn't prohibit
> your usecase - which isn't even supported upstream, btw - this really
> doesn't carry much weight.
> 
> I'd say if you want configurability on top of Roman's code, please
> submit patches and push the case for these in a separate effort.
> 

Roman implemented memory.oom_priority himself, it has my Tested-by, and it 
allows users who want to protect high priority memory cgroups from using 
the size based comparison for all other cgroups that we very much desire.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-21 21:17     ` David Rientjes
  2017-09-21 21:51       ` Johannes Weiner
@ 2017-09-22 15:44       ` Tejun Heo
  2017-09-22 20:39         ` David Rientjes
  1 sibling, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2017-09-22 15:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

Hello, David.

On Thu, Sep 21, 2017 at 02:17:25PM -0700, David Rientjes wrote:
> It doesn't have anything to do with my particular usecase, but rather the 
> ability of userspace to influence the decisions of the kernel.  Previous 
> to this patchset, when selection is done based on process size, userspace 
> has full control over selection.  After this patchset, userspace has no 
> control other than setting all processes to be oom disabled if the largest 
> memory consumer is to be protected.  Roman's memory.oom_priority provides 
> a perfect solution for userspace to be able to influence this decision 
> making and causes no change in behavior for users who choose not to tune 
> memory.oom_priority.  The nack originates from the general need for 
> userspace influence over oom victim selection and to avoid userspace 
> needing to take the rather drastic measure of setting all processes to be 
> oom disabled to prevent oom kill in kernels before oom priorities are 
> introduced.

Overall, I think that OOM killing is the wrong place to implement
sophisticated intelligence in.  It's too late to be smart - the
workload already has suffered significantly and there's only very
limited amount of computing which can be performed.  That said, if
there's a useful and general enough mechanism to configure OOM killer
behavior from userland, that can definitely be useful.

> The patchset compares memory cgroup size relative to sibling cgroups only, 
> the same comparison for memory.oom_priority.  There is a guarantee 
> provided on how cgroup size is compared in select_victim_memcg(), it 
> hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> and then iterates the tree comparing sizes between sibling cgroups to 
> choose a victim memcg.  That algorithm could be more elaborately described 
> in the documentation, but we simply cannot change the implementation of 
> select_victim_memcg() later even without oom priorities since users cannot 
> get inconsistent results after opting into a feature between kernel 
> versions.  I believe the selection criteria should be implemented to be 
> deterministic, as select_victim_memcg() does, and the documentation should 
> fully describe what the selection criteria is, and then allow the user to 
> decide.

We even change the whole scheduling behaviors and try really hard to
not get locked into specific implementation details which exclude
future improvements.  Guaranteeing OOM killing selection would be
crazy.  Why would we prevent ourselves from doing things better in the
future?  We aren't talking about the semantics of read(2) here.  This
is a kernel emergency mechanism to avoid deadlock at the last moment.

> Roman is planning on introducing memory.oom_priority back into the 
> patchset per https://marc.info/?l=linux-kernel&m=150574701126877 and I 
> agree with the very clear semantic that it introduces: to have the 
> size-based comparison use the same rules as the userspace priority 
> comparison.  It's very powerful and I'm happy to ack the final version 
> that he plans on posting.

To me, the proposed oom_priority mechanism seems too limited and makes
the error of tightly coupling the hierarchical behavior of resource
distribution with OOM victim selection.  They can be related but are
not the same and coupling them together in the kernel interface is
likely a mistake which will lead to long term pains that we can't
easily get out of.

Here's a really simple use case.  Imagine a system which hosts two
containers of services and one is somewhat favored over the other and
wants to set up cgroup hierarchy so that resources are split at the
top level between the two containers.  oom_priority is set accordingly
too.  Let's say a low priority maintenance job in higher priority
container goes berserk, as they oftne do, and pushing the system into
OOM.

With the proposed static oom_priority mechanism, the only
configuration which can be expressed is "kill all of the lower top
level subtree before any of the higher one", which is a silly
restriction leading to silly behavior and a direct result of
conflating resource distribution network with level-by-level OOM
killing decsion.

If we want to allow users to steer OOM killing, I suspect that it
should be aligned at delegation boundaries rather than on cgroup
hierarchy itself.  We can discuss that but it is a separate
discussion.

The mechanism being proposed is fundamentally flawed.  You can't push
that in by nacking other improvements.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-22 15:44       ` Tejun Heo
@ 2017-09-22 20:39         ` David Rientjes
  2017-09-22 21:05           ` Tejun Heo
  0 siblings, 1 reply; 78+ messages in thread
From: David Rientjes @ 2017-09-22 20:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

On Fri, 22 Sep 2017, Tejun Heo wrote:

> > It doesn't have anything to do with my particular usecase, but rather the 
> > ability of userspace to influence the decisions of the kernel.  Previous 
> > to this patchset, when selection is done based on process size, userspace 
> > has full control over selection.  After this patchset, userspace has no 
> > control other than setting all processes to be oom disabled if the largest 
> > memory consumer is to be protected.  Roman's memory.oom_priority provides 
> > a perfect solution for userspace to be able to influence this decision 
> > making and causes no change in behavior for users who choose not to tune 
> > memory.oom_priority.  The nack originates from the general need for 
> > userspace influence over oom victim selection and to avoid userspace 
> > needing to take the rather drastic measure of setting all processes to be 
> > oom disabled to prevent oom kill in kernels before oom priorities are 
> > introduced.
> 
> Overall, I think that OOM killing is the wrong place to implement
> sophisticated intelligence in.  It's too late to be smart - the
> workload already has suffered significantly and there's only very
> limited amount of computing which can be performed.  That said, if
> there's a useful and general enough mechanism to configure OOM killer
> behavior from userland, that can definitely be useful.
> 

What is under discussion is a new way to compare sibling cgroups when 
selecting a victim for oom kill.  It's a new heuristic based on a 
characteristic of the memory cgroup rather than the individual process.  
We want this behavior that the patchset implements.  The only desire is a 
way for userspace to influence that decision making in the same way that 
/proc/pid/oom_score_adj allows userspace to influence the current 
heuristic.

Current heuristic based on processes is coupled with per-process
/proc/pid/oom_score_adj.  The proposed 
heuristic has no ability to be influenced by userspace, and it needs one.  
The proposed heuristic based on memory cgroups coupled with Roman's 
per-memcg memory.oom_priority is appropriate and needed.  It is not 
"sophisticated intelligence," it merely allows userspace to protect vital 
memory cgroups when opting into the new features (cgroups compared based 
on size and memory.oom_group) that we very much want.

> We even change the whole scheduling behaviors and try really hard to
> not get locked into specific implementation details which exclude
> future improvements.  Guaranteeing OOM killing selection would be
> crazy.  Why would we prevent ourselves from doing things better in the
> future?  We aren't talking about the semantics of read(2) here.  This
> is a kernel emergency mechanism to avoid deadlock at the last moment.
> 

We merely want to prefer other memory cgroups are oom killed on system oom 
conditions before important ones, regardless if the important one is using 
more memory than the others because of the new heuristic this patchset 
introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
current heuristic.

> Here's a really simple use case.  Imagine a system which hosts two
> containers of services and one is somewhat favored over the other and
> wants to set up cgroup hierarchy so that resources are split at the
> top level between the two containers.  oom_priority is set accordingly
> too.  Let's say a low priority maintenance job in higher priority
> container goes berserk, as they oftne do, and pushing the system into
> OOM.
> 
> With the proposed static oom_priority mechanism, the only
> configuration which can be expressed is "kill all of the lower top
> level subtree before any of the higher one", which is a silly
> restriction leading to silly behavior and a direct result of
> conflating resource distribution network with level-by-level OOM
> killing decsion.
> 

The problem you're describing is an issue with the top-level limits after 
this patchset is merged, not memory.oom_priority at all.

If they are truly split evenly, this patchset kills the largest process 
from the hierarchy with the most charged memory.  That's unchanged if the 
two priorities are equal.  By changing the priority to be more preferred 
for a hierarchy, you indeed prefer oom kills from the lower priority 
hierarchy.  You've opted in.  One hierarchy is more important than the 
other, regardless of any hypothetical low priority maintenance job going 
berserk.

If you have this low priority maintenance job charging memory to the high 
priority hierarchy, you're already misconfigured unless you adjust 
/proc/pid/oom_score_adj because it will oom kill any larger process than 
itself in today's kernels anyway.

A better configuration would be attach this hypothetical low priority 
maintenance job to its own sibling cgroup with its own memory limit to 
avoid exactly that problem: it going berserk and charging too much memory 
to the high priority container that results in one of its processes 
getting oom killed.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-22 20:39         ` David Rientjes
@ 2017-09-22 21:05           ` Tejun Heo
  2017-09-23  8:16             ` David Rientjes
  0 siblings, 1 reply; 78+ messages in thread
From: Tejun Heo @ 2017-09-22 21:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

Hello,

On Fri, Sep 22, 2017 at 01:39:55PM -0700, David Rientjes wrote:
> Current heuristic based on processes is coupled with per-process
> /proc/pid/oom_score_adj.  The proposed 
> heuristic has no ability to be influenced by userspace, and it needs one.  
> The proposed heuristic based on memory cgroups coupled with Roman's 
> per-memcg memory.oom_priority is appropriate and needed.  It is not 

So, this is where we disagree.  I don't think it's a good design.

> "sophisticated intelligence," it merely allows userspace to protect vital 
> memory cgroups when opting into the new features (cgroups compared based 
> on size and memory.oom_group) that we very much want.

which can't achieve that goal very well for wide variety of users.

> > We even change the whole scheduling behaviors and try really hard to
> > not get locked into specific implementation details which exclude
> > future improvements.  Guaranteeing OOM killing selection would be
> > crazy.  Why would we prevent ourselves from doing things better in the
> > future?  We aren't talking about the semantics of read(2) here.  This
> > is a kernel emergency mechanism to avoid deadlock at the last moment.
> 
> We merely want to prefer other memory cgroups are oom killed on system oom 
> conditions before important ones, regardless if the important one is using 
> more memory than the others because of the new heuristic this patchset 
> introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
> current heuristic.

You were arguing that we should lock into a specific heuristics and
guarantee the same behavior.  We shouldn't.

When we introduce a user visible interface, we're making a lot of
promises.  My point is that we need to be really careful when making
those promises.

> If you have this low priority maintenance job charging memory to the high 
> priority hierarchy, you're already misconfigured unless you adjust 
> /proc/pid/oom_score_adj because it will oom kill any larger process than 
> itself in today's kernels anyway.
> 
> A better configuration would be attach this hypothetical low priority 
> maintenance job to its own sibling cgroup with its own memory limit to 
> avoid exactly that problem: it going berserk and charging too much memory 
> to the high priority container that results in one of its processes 
> getting oom killed.

And how do you guarantee that across delegation boundaries?  The
points you raise on why the priority should be applied level-by-level
are exactly the same points why this doesn't really work.  OOM killing
priority isn't something which can be distributed across cgroup
hierarchy level-by-level.  The resulting decision tree doesn't make
any sense.

I'm not against adding something which works but strict level-by-level
comparison isn't the solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [v8 0/4] cgroup-aware OOM killer
  2017-09-22 21:05           ` Tejun Heo
@ 2017-09-23  8:16             ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2017-09-23  8:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Roman Gushchin, linux-mm, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, Andrew Morton, kernel-team,
	cgroups, linux-doc, linux-kernel

On Fri, 22 Sep 2017, Tejun Heo wrote:

> > If you have this low priority maintenance job charging memory to the high 
> > priority hierarchy, you're already misconfigured unless you adjust 
> > /proc/pid/oom_score_adj because it will oom kill any larger process than 
> > itself in today's kernels anyway.
> > 
> > A better configuration would be attach this hypothetical low priority 
> > maintenance job to its own sibling cgroup with its own memory limit to 
> > avoid exactly that problem: it going berserk and charging too much memory 
> > to the high priority container that results in one of its processes 
> > getting oom killed.
> 
> And how do you guarantee that across delegation boundaries?  The
> points you raise on why the priority should be applied level-by-level
> are exactly the same points why this doesn't really work.  OOM killing
> priority isn't something which can be distributed across cgroup
> hierarchy level-by-level.  The resulting decision tree doesn't make
> any sense.
> 

It works very well in practice with real world usecases, and Roman has 
developed the same design independently that we have used for the past 
four years.  Saying it doesn't make any sense doesn't hold a lot of weight 
when we both independently designed and implemented the same solution to 
address our usecases.

> I'm not against adding something which works but strict level-by-level
> comparison isn't the solution.
> 

Each of the eight versions of Roman's cgroup aware oom killer has done 
comparisons between siblings at each level.  Userspace influence on that 
comparison would thus also need to be done at each level.  It's a very 
powerful combination in practice.

Thanks.

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2017-10-02 20:55 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-09-11 20:51   ` David Rientjes
2017-09-14 13:42   ` Michal Hocko
2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-09-13 20:46   ` David Rientjes
2017-09-13 21:59     ` Roman Gushchin
2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin
2017-09-11 20:48   ` David Rientjes
2017-09-12 20:01     ` Roman Gushchin
2017-09-12 20:23       ` David Rientjes
2017-09-13 12:23       ` Michal Hocko
2017-09-11 13:17 ` [v8 4/4] mm, oom, docs: describe the " Roman Gushchin
2017-09-11 20:44 ` [v8 0/4] " David Rientjes
2017-09-13 12:29   ` Michal Hocko
2017-09-13 20:46     ` David Rientjes
2017-09-14 13:34       ` Michal Hocko
2017-09-14 20:07         ` David Rientjes
2017-09-13 21:56     ` Roman Gushchin
2017-09-14 13:40       ` Michal Hocko
2017-09-14 16:05         ` Roman Gushchin
2017-09-15 10:58           ` Michal Hocko
2017-09-15 15:23             ` Roman Gushchin
2017-09-15 19:55               ` David Rientjes
2017-09-15 21:08                 ` Roman Gushchin
2017-09-18  6:20                   ` Michal Hocko
2017-09-18 15:02                     ` Roman Gushchin
2017-09-21  8:30                       ` David Rientjes
2017-09-19 20:54                   ` David Rientjes
2017-09-20 22:24                     ` Roman Gushchin
2017-09-21  8:27                       ` David Rientjes
2017-09-18  6:16                 ` Michal Hocko
2017-09-19 20:51                   ` David Rientjes
2017-09-18  6:14               ` Michal Hocko
2017-09-20 21:53                 ` Roman Gushchin
2017-09-25 12:24                   ` Michal Hocko
2017-09-25 17:00                     ` Johannes Weiner
2017-09-25 18:15                       ` Roman Gushchin
2017-09-25 20:25                         ` Michal Hocko
2017-09-26 10:59                           ` Roman Gushchin
2017-09-26 11:21                             ` Michal Hocko
2017-09-26 12:13                               ` Roman Gushchin
2017-09-26 13:30                                 ` Michal Hocko
2017-09-26 17:26                                   ` Johannes Weiner
2017-09-27  3:37                                     ` Tim Hockin
2017-09-27  7:43                                       ` Michal Hocko
2017-09-27 10:19                                         ` Roman Gushchin
2017-09-27 15:35                                         ` Tim Hockin
2017-09-27 16:23                                           ` Roman Gushchin
2017-09-27 18:11                                             ` Tim Hockin
2017-10-01 23:29                                               ` Shakeel Butt
2017-10-02 11:56                                                 ` Tetsuo Handa
2017-10-02 12:24                                                 ` Michal Hocko
2017-10-02 12:47                                                   ` Roman Gushchin
2017-10-02 14:29                                                     ` Michal Hocko
2017-10-02 19:00                                                   ` Shakeel Butt
2017-10-02 19:28                                                     ` Michal Hocko
2017-10-02 19:45                                                       ` Shakeel Butt
2017-10-02 19:56                                                         ` Michal Hocko
2017-10-02 20:00                                                           ` Tim Hockin
2017-10-02 20:08                                                             ` Michal Hocko
2017-10-02 20:20                                                             ` Shakeel Butt
2017-10-02 20:24                                                           ` Shakeel Butt
2017-10-02 20:34                                                             ` Johannes Weiner
2017-10-02 20:55                                                             ` Michal Hocko
2017-09-25 22:21                       ` David Rientjes
2017-09-26  8:46                         ` Michal Hocko
2017-09-26 21:04                           ` David Rientjes
2017-09-27  7:37                             ` Michal Hocko
2017-09-27  9:57                               ` Roman Gushchin
2017-09-21 14:21   ` Johannes Weiner
2017-09-21 21:17     ` David Rientjes
2017-09-21 21:51       ` Johannes Weiner
2017-09-22 20:53         ` David Rientjes
2017-09-22 15:44       ` Tejun Heo
2017-09-22 20:39         ` David Rientjes
2017-09-22 21:05           ` Tejun Heo
2017-09-23  8:16             ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).