All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 16:28 Roman Gushchin
  2017-05-18 17:30   ` Michal Hocko
  2017-05-20 18:37   ` Vladimir Davydov
  0 siblings, 2 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-05-18 16:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Tejun Heo, Li Zefan, Michal Hocko,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups, linux-doc,
	linux-kernel, linux-mm

Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers. There are two main issues:

1) There is no fairness between containers. A small container with
a few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. So, in general, a much safer behavior is
to kill the whole cgroup. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, cgroup-aware OOM killer is introduced.
Under OOM conditions, it looks for a memcg with highest oom score,
and kills all processes inside.

Memcg oom score is calculated as a size of active and inactive
anon LRU lists, unevictable LRU list and swap size.

For a cgroup-wide OOM, only cgroups belonging to the subtree of
the OOMing cgroup are considered.

If there is no elegible memcg found, OOM killer falls back to
a traditional per-process behavior.

This change affects only cgroup v2.

Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/cgroup-v2.txt | 24 ++++++++++++++--
 include/linux/memcontrol.h  |  3 ++
 include/linux/oom.h         |  1 +
 mm/memcontrol.c             | 69 +++++++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c               | 49 ++++++++++++++++++++++++++++----
 5 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dc5e2dc..6583041 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -44,6 +44,7 @@ CONTENTS
     5-2-1. Memory Interface Files
     5-2-2. Usage Guidelines
     5-2-3. Memory Ownership
+    5-2-4. Cgroup-aware OOM Killer
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
@@ -831,8 +832,7 @@ PAGE_SIZE multiple when read back.
 	  oom
 
 		The number of times the OOM killer has been invoked in
-		the cgroup.  This may not exactly match the number of
-		processes killed but should generally be close.
+		the cgroup.
 
   memory.stat
 
@@ -988,6 +988,26 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
 
+5-2-4. Cgroup-aware OOM Killer
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats memory cgroups as memory consumers
+rather then individual processes. Under the OOM conditions it tries
+to find an elegible leaf memory cgroup, and kill all processes
+in this cgroup. If it's not possible (e.g. all processes belong
+to the root cgroup), it falls back to the traditional per-process
+behaviour.
+
+The memory controller tries to make the best choise of a victim cgroup.
+In general, it tries to select the largest cgroup, matching given
+node/zone requirements, but the concrete algorithm is not defined,
+and may be changed later.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to a sub-tree
+of the OOM-ing cgroup, including itself.
+
+
 5-3. IO
 
 The "io" controller regulates the distribution of IO resources.  This
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 899949b..fb0ff64 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -34,6 +34,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -465,6 +466,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..51e71f2 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -39,6 +39,7 @@ struct oom_control {
 	unsigned long totalpages;
 	struct task_struct *chosen;
 	unsigned long chosen_points;
+	struct mem_cgroup *chosen_memcg;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c131f7e..8d07481 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 	return ret;
 }
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+	struct mem_cgroup *iter;
+	unsigned long chosen_memcg_points;
+
+	oc->chosen_memcg = NULL;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return false;
+
+	pr_info("Choosing a victim memcg because of %s",
+		oc->memcg ?
+		"memory limit reached of cgroup " :
+		"out of memory\n");
+	if (oc->memcg) {
+		pr_cont_cgroup_path(oc->memcg->css.cgroup);
+		pr_cont("\n");
+	}
+
+	chosen_memcg_points = 0;
+
+	for_each_mem_cgroup_tree(iter, oc->memcg) {
+		unsigned long points;
+		int nid;
+
+		if (mem_cgroup_is_root(iter))
+			continue;
+
+		if (memcg_has_children(iter))
+			continue;
+
+		points = 0;
+		for_each_node_state(nid, N_MEMORY) {
+			if (oc->nodemask && !node_isset(nid, *oc->nodemask))
+				continue;
+			points += mem_cgroup_node_nr_lru_pages(iter, nid,
+					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+		}
+		points += mem_cgroup_get_nr_swap_pages(iter);
+
+		pr_info("Memcg ");
+		pr_cont_cgroup_path(iter->css.cgroup);
+		pr_cont(": %lu\n", points);
+
+		if (points > chosen_memcg_points) {
+			if (oc->chosen_memcg)
+				css_put(&oc->chosen_memcg->css);
+
+			oc->chosen_memcg = iter;
+			css_get(&iter->css);
+
+			chosen_memcg_points = points;
+		}
+	}
+
+	if (oc->chosen_memcg) {
+		pr_info("Kill memcg ");
+		pr_cont_cgroup_path(oc->chosen_memcg->css.cgroup);
+		pr_cont(" (%lu)\n", chosen_memcg_points);
+	} else {
+		pr_info("No elegible memory cgroup found\n");
+	}
+
+	return !!oc->chosen_memcg;
+}
+
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..c000495 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -802,6 +802,8 @@ static bool task_will_free_mem(struct task_struct *task)
 	return ret;
 }
 
+static void __oom_kill_process(struct task_struct *victim);
+
 static void oom_kill_process(struct oom_control *oc, const char *message)
 {
 	struct task_struct *p = oc->chosen;
@@ -809,11 +811,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	struct task_struct *victim = p;
 	struct task_struct *child;
 	struct task_struct *t;
-	struct mm_struct *mm;
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -863,6 +863,15 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	read_unlock(&tasklist_lock);
 
+	__oom_kill_process(victim);
+}
+
+static void __oom_kill_process(struct task_struct *victim)
+{
+	struct task_struct *p;
+	struct mm_struct *mm;
+	bool can_oom_reap = true;
+
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -970,6 +979,20 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+static int oom_kill_task_fn(struct task_struct *p, void *arg)
+{
+	if (is_global_init(p))
+		return 0;
+
+	if (p->flags & PF_KTHREAD)
+		return 0;
+
+	get_task_struct(p);
+	__oom_kill_process(p);
+
+	return 0;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1032,13 +1055,29 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	select_bad_process(oc);
+	/*
+	 * Try to find an elegible memory cgroup. If nothing found,
+	 * fallback to a per-process OOM.
+	 */
+	if (!mem_cgroup_select_oom_victim(oc))
+		select_bad_process(oc);
+
 	/* Found nothing?!?! Either we hang forever, or we panic. */
-	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+	if (!oc->chosen_memcg && !oc->chosen && !is_sysrq_oom(oc) &&
+	    !is_memcg_oom(oc)) {
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+
+	if (oc->chosen_memcg) {
+		/* Try to kill the whole memory cgroup. */
+		if (!is_memcg_oom(oc))
+			mem_cgroup_event(oc->chosen_memcg, MEMCG_OOM);
+		mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_task_fn, NULL);
+
+		css_put(&oc->chosen_memcg->css);
+		schedule_timeout_killable(1);
+	} else if (oc->chosen && oc->chosen != (void *)-1UL) {
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
 		/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 16:28 [RFC PATCH] mm, oom: cgroup-aware OOM-killer Roman Gushchin
@ 2017-05-18 17:30   ` Michal Hocko
  2017-05-20 18:37   ` Vladimir Davydov
  1 sibling, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-18 17:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> a few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. So, in general, a much safer behavior is
> to kill the whole cgroup. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> Under OOM conditions, it looks for a memcg with highest oom score,
> and kills all processes inside.
> 
> Memcg oom score is calculated as a size of active and inactive
> anon LRU lists, unevictable LRU list and swap size.
> 
> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> the OOMing cgroup are considered.

While this might make sense for some workloads/setups it is not a
generally acceptable policy IMHO. We have discussed that different OOM
policies might be interesting few years back at LSFMM but there was no
real consensus on how to do that. One possibility was to allow bpf like
mechanisms. Could you explore that path?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 17:30   ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-18 17:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> a few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. So, in general, a much safer behavior is
> to kill the whole cgroup. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> Under OOM conditions, it looks for a memcg with highest oom score,
> and kills all processes inside.
> 
> Memcg oom score is calculated as a size of active and inactive
> anon LRU lists, unevictable LRU list and swap size.
> 
> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> the OOMing cgroup are considered.

While this might make sense for some workloads/setups it is not a
generally acceptable policy IMHO. We have discussed that different OOM
policies might be interesting few years back at LSFMM but there was no
real consensus on how to do that. One possibility was to allow bpf like
mechanisms. Could you explore that path?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 17:30   ` Michal Hocko
@ 2017-05-18 18:11     ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers. There are two main issues:
> > 
> > 1) There is no fairness between containers. A small container with
> > a few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. So, in general, a much safer behavior is
> > to kill the whole cgroup. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, cgroup-aware OOM killer is introduced.
> > Under OOM conditions, it looks for a memcg with highest oom score,
> > and kills all processes inside.
> > 
> > Memcg oom score is calculated as a size of active and inactive
> > anon LRU lists, unevictable LRU list and swap size.
> > 
> > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > the OOMing cgroup are considered.
> 
> While this might make sense for some workloads/setups it is not a
> generally acceptable policy IMHO. We have discussed that different OOM
> policies might be interesting few years back at LSFMM but there was no
> real consensus on how to do that. One possibility was to allow bpf like
> mechanisms. Could you explore that path?

OOM policy is an orthogonal discussion, though.

The OOM killer's job is to pick a memory consumer to kill. Per default
the unit of the memory consumer is a process, but cgroups allow
grouping processes into compound consumers. Extending the OOM killer
to respect the new definition of "consumer" is not a new policy.

I don't think it's reasonable to ask the person who's trying to make
the OOM killer support group-consumers to design a dynamic OOM policy
framework instead.

All we want is the OOM policy, whatever it is, applied to cgroups.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 18:11     ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers. There are two main issues:
> > 
> > 1) There is no fairness between containers. A small container with
> > a few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. So, in general, a much safer behavior is
> > to kill the whole cgroup. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, cgroup-aware OOM killer is introduced.
> > Under OOM conditions, it looks for a memcg with highest oom score,
> > and kills all processes inside.
> > 
> > Memcg oom score is calculated as a size of active and inactive
> > anon LRU lists, unevictable LRU list and swap size.
> > 
> > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > the OOMing cgroup are considered.
> 
> While this might make sense for some workloads/setups it is not a
> generally acceptable policy IMHO. We have discussed that different OOM
> policies might be interesting few years back at LSFMM but there was no
> real consensus on how to do that. One possibility was to allow bpf like
> mechanisms. Could you explore that path?

OOM policy is an orthogonal discussion, though.

The OOM killer's job is to pick a memory consumer to kill. Per default
the unit of the memory consumer is a process, but cgroups allow
grouping processes into compound consumers. Extending the OOM killer
to respect the new definition of "consumer" is not a new policy.

I don't think it's reasonable to ask the person who's trying to make
the OOM killer support group-consumers to design a dynamic OOM policy
framework instead.

All we want is the OOM policy, whatever it is, applied to cgroups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 17:30   ` Michal Hocko
@ 2017-05-18 18:37     ` Balbir Singh
  -1 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 18:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
>> Traditionally, the OOM killer is operating on a process level.
>> Under oom conditions, it finds a process with the highest oom score
>> and kills it.
>>
>> This behavior doesn't suit well the system with many running
>> containers. There are two main issues:
>>
>> 1) There is no fairness between containers. A small container with
>> a few large processes will be chosen over a large one with huge
>> number of small processes.
>>
>> 2) Containers often do not expect that some random process inside
>> will be killed. So, in general, a much safer behavior is
>> to kill the whole cgroup. Traditionally, this was implemented
>> in userspace, but doing it in the kernel has some advantages,
>> especially in a case of a system-wide OOM.
>>
>> To address these issues, cgroup-aware OOM killer is introduced.
>> Under OOM conditions, it looks for a memcg with highest oom score,
>> and kills all processes inside.
>>
>> Memcg oom score is calculated as a size of active and inactive
>> anon LRU lists, unevictable LRU list and swap size.
>>
>> For a cgroup-wide OOM, only cgroups belonging to the subtree of
>> the OOMing cgroup are considered.
>
> While this might make sense for some workloads/setups it is not a
> generally acceptable policy IMHO. We have discussed that different OOM
> policies might be interesting few years back at LSFMM but there was no
> real consensus on how to do that. One possibility was to allow bpf like
> mechanisms. Could you explore that path?

I agree, I think it needs more thought. I wonder if the real issue is something
else. For example

1. Did we overcommit a particular container too much?
2. Do we need something like https://lwn.net/Articles/604212/ to solve
the problem?
3. We have oom notifiers now, could those be used (assuming you are interested
in non memcg related OOM's affecting a container
4. How do we determine limits for these containers? From a fariness
perspective

Just trying to understand what leads to the issues you are seeing

Balbir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 18:37     ` Balbir Singh
  0 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 18:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
>> Traditionally, the OOM killer is operating on a process level.
>> Under oom conditions, it finds a process with the highest oom score
>> and kills it.
>>
>> This behavior doesn't suit well the system with many running
>> containers. There are two main issues:
>>
>> 1) There is no fairness between containers. A small container with
>> a few large processes will be chosen over a large one with huge
>> number of small processes.
>>
>> 2) Containers often do not expect that some random process inside
>> will be killed. So, in general, a much safer behavior is
>> to kill the whole cgroup. Traditionally, this was implemented
>> in userspace, but doing it in the kernel has some advantages,
>> especially in a case of a system-wide OOM.
>>
>> To address these issues, cgroup-aware OOM killer is introduced.
>> Under OOM conditions, it looks for a memcg with highest oom score,
>> and kills all processes inside.
>>
>> Memcg oom score is calculated as a size of active and inactive
>> anon LRU lists, unevictable LRU list and swap size.
>>
>> For a cgroup-wide OOM, only cgroups belonging to the subtree of
>> the OOMing cgroup are considered.
>
> While this might make sense for some workloads/setups it is not a
> generally acceptable policy IMHO. We have discussed that different OOM
> policies might be interesting few years back at LSFMM but there was no
> real consensus on how to do that. One possibility was to allow bpf like
> mechanisms. Could you explore that path?

I agree, I think it needs more thought. I wonder if the real issue is something
else. For example

1. Did we overcommit a particular container too much?
2. Do we need something like https://lwn.net/Articles/604212/ to solve
the problem?
3. We have oom notifiers now, could those be used (assuming you are interested
in non memcg related OOM's affecting a container
4. How do we determine limits for these containers? From a fariness
perspective

Just trying to understand what leads to the issues you are seeing

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 18:37     ` Balbir Singh
@ 2017-05-18 19:20       ` Roman Gushchin
  -1 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-05-18 19:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?

Imagine, you have a machine with multiple containers,
each with it's own process tree, and the machine is overcommited,
i.e. sum of container's memory limits is larger the amount available RAM.

In a case of a system-wide OOM some random container will be affected.

Historically, this problem was solving by some user-space daemon,
which was monitoring OOM events and cleaning up affected containers.
But this approach can't solve the main problem: non-optimal selection
of a victim. 

> 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> the problem?

I don't think it's related.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

They can be used to inform an userspace daemon about an already happened OOM,
but they do not affect victim selection.

> 4. How do we determine limits for these containers? From a fariness
> perspective

Limits are usually set from some high-level understanding of the nature
of tasks which are working inside, but overcommiting the machine is
a common place, I assume.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 19:20       ` Roman Gushchin
  0 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-05-18 19:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?

Imagine, you have a machine with multiple containers,
each with it's own process tree, and the machine is overcommited,
i.e. sum of container's memory limits is larger the amount available RAM.

In a case of a system-wide OOM some random container will be affected.

Historically, this problem was solving by some user-space daemon,
which was monitoring OOM events and cleaning up affected containers.
But this approach can't solve the main problem: non-optimal selection
of a victim. 

> 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> the problem?

I don't think it's related.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

They can be used to inform an userspace daemon about an already happened OOM,
but they do not affect victim selection.

> 4. How do we determine limits for these containers? From a fariness
> perspective

Limits are usually set from some high-level understanding of the nature
of tasks which are working inside, but overcommiting the machine is
a common place, I assume.

Thank you!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 18:37     ` Balbir Singh
@ 2017-05-18 19:22       ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 19:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?
> 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> the problem?

The occasional OOM kill is an unavoidable reality on our systems (and
I bet on most deployments). If we tried not to overcommit, we'd waste
a *lot* of memory.

The problem is when OOM happens, we really want the biggest *job* to
get killed. Before cgroups, we assumed jobs were processes. But with
cgroups, the user is able to define a group of processes as a job, and
then an individual process is no longer a first-class memory consumer.

Without a patch like this, the OOM killer will compare the sizes of
the random subparticles that the jobs in the system are composed of
and kill the single biggest particle, leaving behind the incoherent
remains of one of the jobs. That doesn't make a whole lot of sense.

If you want to determine the most expensive car in a parking lot, you
can't go off and compare the price of one car's muffler with the door
handle of another, then point to a wind shield and yell "This is it!"

You need to compare the cars as a whole with each other.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

Right now, we watch for OOM notifications and then have userspace kill
the rest of a job. That works - somewhat. What remains is the problem
that I described above, that comparing individual process sizes is not
meaningful when the terminal memory consumer is a cgroup.

> 4. How do we determine limits for these containers? From a fariness
> perspective

How do you mean?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 19:22       ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 19:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
> 
> 1. Did we overcommit a particular container too much?
> 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> the problem?

The occasional OOM kill is an unavoidable reality on our systems (and
I bet on most deployments). If we tried not to overcommit, we'd waste
a *lot* of memory.

The problem is when OOM happens, we really want the biggest *job* to
get killed. Before cgroups, we assumed jobs were processes. But with
cgroups, the user is able to define a group of processes as a job, and
then an individual process is no longer a first-class memory consumer.

Without a patch like this, the OOM killer will compare the sizes of
the random subparticles that the jobs in the system are composed of
and kill the single biggest particle, leaving behind the incoherent
remains of one of the jobs. That doesn't make a whole lot of sense.

If you want to determine the most expensive car in a parking lot, you
can't go off and compare the price of one car's muffler with the door
handle of another, then point to a wind shield and yell "This is it!"

You need to compare the cars as a whole with each other.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

Right now, we watch for OOM notifications and then have userspace kill
the rest of a job. That works - somewhat. What remains is the problem
that I described above, that comparing individual process sizes is not
meaningful when the terminal memory consumer is a cgroup.

> 4. How do we determine limits for these containers? From a fariness
> perspective

How do you mean?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 19:20       ` Roman Gushchin
@ 2017-05-18 19:41         ` Balbir Singh
  -1 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 19:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Thu, 2017-05-18 at 20:20 +0100, Roman Gushchin wrote:
> On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > Traditionally, the OOM killer is operating on a process level.
> > > > Under oom conditions, it finds a process with the highest oom score
> > > > and kills it.
> > > > 
> > > > This behavior doesn't suit well the system with many running
> > > > containers. There are two main issues:
> > > > 
> > > > 1) There is no fairness between containers. A small container with
> > > > a few large processes will be chosen over a large one with huge
> > > > number of small processes.
> > > > 
> > > > 2) Containers often do not expect that some random process inside
> > > > will be killed. So, in general, a much safer behavior is
> > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > in userspace, but doing it in the kernel has some advantages,
> > > > especially in a case of a system-wide OOM.
> > > > 
> > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > and kills all processes inside.
> > > > 
> > > > Memcg oom score is calculated as a size of active and inactive
> > > > anon LRU lists, unevictable LRU list and swap size.
> > > > 
> > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > the OOMing cgroup are considered.
> > > 
> > > While this might make sense for some workloads/setups it is not a
> > > generally acceptable policy IMHO. We have discussed that different OOM
> > > policies might be interesting few years back at LSFMM but there was no
> > > real consensus on how to do that. One possibility was to allow bpf like
> > > mechanisms. Could you explore that path?
> > 
> > I agree, I think it needs more thought. I wonder if the real issue is something
> > else. For example
> > 
> > 1. Did we overcommit a particular container too much?
> 
> Imagine, you have a machine with multiple containers,
> each with it's own process tree, and the machine is overcommited,
> i.e. sum of container's memory limits is larger the amount available RAM.
> 
> In a case of a system-wide OOM some random container will be affected.
> 

The random container containing the most expensive task, yes!

> Historically, this problem was solving by some user-space daemon,
> which was monitoring OOM events and cleaning up affected containers.
> But this approach can't solve the main problem: non-optimal selection
> of a victim. 

Why do you think the problem is non-optimal selection, is it because
we believe that memory cgroup limits should play a role in decision
making of global OOM?


> 
> > 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> > the problem?
>
 
The URL got changed to something non-parsable, probably for security, but
could you email client please not do that.

> I don't think it's related.

I was thinking that if we have virtual memory limits and we could set
some sane ones, we could avoid OOM altogether. OOM is a big hammer and
having allocations fail is far more acceptable than killing processes.
I believe that several applications may have much larger VM than actual
memory usage, but I believe with a good overcommit/virtual memory limiter
the problem can be better tackled.

> 
> > 3. We have oom notifiers now, could those be used (assuming you are interested
> > in non memcg related OOM's affecting a container
> 
> They can be used to inform an userspace daemon about an already happened OOM,
> but they do not affect victim selection.

Yes, the whole point is for the OS to select the victim, the notifiers
provide an opportunity for us to do reclaim to probably prevent OOM

In oom_kill, I see

                blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
                if (freed > 0)
                        /* Got some memory back in the last second. */
                        return true;

Could the notification to user space then decide what to cleanup to free
memory? We also have event notification inside of memcg. I am trying to
understand why these are not sufficient?

We also have soft limits to push containers to a smaller size at the
time of global pressure.

> 
> > 4. How do we determine limits for these containers? From a fariness
> > perspective
> 
> Limits are usually set from some high-level understanding of the nature
> of tasks which are working inside, but overcommiting the machine is
> a common place, I assume.

Agreed overcommit is a given and that is why we wrote the cgroup controllers.
I was wondering if the container limits not being set correctly could cause
these issues. I am also trying to understand with the infrastructure we
have for notification and control, do we need more?

> 
> Thank you!
> 
> Roman

Cheers,
Balbir Singh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 19:41         ` Balbir Singh
  0 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 19:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Thu, 2017-05-18 at 20:20 +0100, Roman Gushchin wrote:
> On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > Traditionally, the OOM killer is operating on a process level.
> > > > Under oom conditions, it finds a process with the highest oom score
> > > > and kills it.
> > > > 
> > > > This behavior doesn't suit well the system with many running
> > > > containers. There are two main issues:
> > > > 
> > > > 1) There is no fairness between containers. A small container with
> > > > a few large processes will be chosen over a large one with huge
> > > > number of small processes.
> > > > 
> > > > 2) Containers often do not expect that some random process inside
> > > > will be killed. So, in general, a much safer behavior is
> > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > in userspace, but doing it in the kernel has some advantages,
> > > > especially in a case of a system-wide OOM.
> > > > 
> > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > and kills all processes inside.
> > > > 
> > > > Memcg oom score is calculated as a size of active and inactive
> > > > anon LRU lists, unevictable LRU list and swap size.
> > > > 
> > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > the OOMing cgroup are considered.
> > > 
> > > While this might make sense for some workloads/setups it is not a
> > > generally acceptable policy IMHO. We have discussed that different OOM
> > > policies might be interesting few years back at LSFMM but there was no
> > > real consensus on how to do that. One possibility was to allow bpf like
> > > mechanisms. Could you explore that path?
> > 
> > I agree, I think it needs more thought. I wonder if the real issue is something
> > else. For example
> > 
> > 1. Did we overcommit a particular container too much?
> 
> Imagine, you have a machine with multiple containers,
> each with it's own process tree, and the machine is overcommited,
> i.e. sum of container's memory limits is larger the amount available RAM.
> 
> In a case of a system-wide OOM some random container will be affected.
> 

The random container containing the most expensive task, yes!

> Historically, this problem was solving by some user-space daemon,
> which was monitoring OOM events and cleaning up affected containers.
> But this approach can't solve the main problem: non-optimal selection
> of a victim. 

Why do you think the problem is non-optimal selection, is it because
we believe that memory cgroup limits should play a role in decision
making of global OOM?


> 
> > 2. Do we need something like https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c&s=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs&e=  to solve
> > the problem?
>
 
The URL got changed to something non-parsable, probably for security, but
could you email client please not do that.

> I don't think it's related.

I was thinking that if we have virtual memory limits and we could set
some sane ones, we could avoid OOM altogether. OOM is a big hammer and
having allocations fail is far more acceptable than killing processes.
I believe that several applications may have much larger VM than actual
memory usage, but I believe with a good overcommit/virtual memory limiter
the problem can be better tackled.

> 
> > 3. We have oom notifiers now, could those be used (assuming you are interested
> > in non memcg related OOM's affecting a container
> 
> They can be used to inform an userspace daemon about an already happened OOM,
> but they do not affect victim selection.

Yes, the whole point is for the OS to select the victim, the notifiers
provide an opportunity for us to do reclaim to probably prevent OOM

In oom_kill, I see

                blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
                if (freed > 0)
                        /* Got some memory back in the last second. */
                        return true;

Could the notification to user space then decide what to cleanup to free
memory? We also have event notification inside of memcg. I am trying to
understand why these are not sufficient?

We also have soft limits to push containers to a smaller size at the
time of global pressure.

> 
> > 4. How do we determine limits for these containers? From a fariness
> > perspective
> 
> Limits are usually set from some high-level understanding of the nature
> of tasks which are working inside, but overcommiting the machine is
> a common place, I assume.

Agreed overcommit is a given and that is why we wrote the cgroup controllers.
I was wondering if the container limits not being set correctly could cause
these issues. I am also trying to understand with the infrastructure we
have for notification and control, do we need more?

> 
> Thank you!
> 
> Roman

Cheers,
Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 19:22       ` Johannes Weiner
@ 2017-05-18 19:43         ` Balbir Singh
  -1 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 19:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Thu, 2017-05-18 at 15:22 -0400, Johannes Weiner wrote:
> On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > Traditionally, the OOM killer is operating on a process level.
> > > > Under oom conditions, it finds a process with the highest oom score
> > > > and kills it.
> > > > 
> > > > This behavior doesn't suit well the system with many running
> > > > containers. There are two main issues:
> > > > 
> > > > 1) There is no fairness between containers. A small container with
> > > > a few large processes will be chosen over a large one with huge
> > > > number of small processes.
> > > > 
> > > > 2) Containers often do not expect that some random process inside
> > > > will be killed. So, in general, a much safer behavior is
> > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > in userspace, but doing it in the kernel has some advantages,
> > > > especially in a case of a system-wide OOM.
> > > > 
> > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > and kills all processes inside.
> > > > 
> > > > Memcg oom score is calculated as a size of active and inactive
> > > > anon LRU lists, unevictable LRU list and swap size.
> > > > 
> > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > the OOMing cgroup are considered.
> > > 
> > > While this might make sense for some workloads/setups it is not a
> > > generally acceptable policy IMHO. We have discussed that different OOM
> > > policies might be interesting few years back at LSFMM but there was no
> > > real consensus on how to do that. One possibility was to allow bpf like
> > > mechanisms. Could you explore that path?
> > 
> > I agree, I think it needs more thought. I wonder if the real issue is something
> > else. For example
> > 
> > 1. Did we overcommit a particular container too much?
> > 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> > the problem?
> 
> The occasional OOM kill is an unavoidable reality on our systems (and
> I bet on most deployments). If we tried not to overcommit, we'd waste
> a *lot* of memory.
> 
> The problem is when OOM happens, we really want the biggest *job* to
> get killed. Before cgroups, we assumed jobs were processes. But with
> cgroups, the user is able to define a group of processes as a job, and
> then an individual process is no longer a first-class memory consumer.
> 
> Without a patch like this, the OOM killer will compare the sizes of
> the random subparticles that the jobs in the system are composed of
> and kill the single biggest particle, leaving behind the incoherent
> remains of one of the jobs. That doesn't make a whole lot of sense.

I agree, but see my response on oom_notifiers in parallel that I sent
to Roman.

> 
> If you want to determine the most expensive car in a parking lot, you
> can't go off and compare the price of one car's muffler with the door
> handle of another, then point to a wind shield and yell "This is it!"
> 
> You need to compare the cars as a whole with each other.
> 
> > 3. We have oom notifiers now, could those be used (assuming you are interested
> > in non memcg related OOM's affecting a container
> 
> Right now, we watch for OOM notifications and then have userspace kill
> the rest of a job. That works - somewhat. What remains is the problem
> that I described above, that comparing individual process sizes is not
> meaningful when the terminal memory consumer is a cgroup.

Could the cgroup limit be used as the comparison point? stats inside
of the memory cgroup?

> 
> > 4. How do we determine limits for these containers? From a fariness
> > perspective
> 
> How do you mean?

How do we set them up so that the larger job gets more of the limits
as opposed to the small ones?

Balbir Singh.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 19:43         ` Balbir Singh
  0 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2017-05-18 19:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Thu, 2017-05-18 at 15:22 -0400, Johannes Weiner wrote:
> On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > Traditionally, the OOM killer is operating on a process level.
> > > > Under oom conditions, it finds a process with the highest oom score
> > > > and kills it.
> > > > 
> > > > This behavior doesn't suit well the system with many running
> > > > containers. There are two main issues:
> > > > 
> > > > 1) There is no fairness between containers. A small container with
> > > > a few large processes will be chosen over a large one with huge
> > > > number of small processes.
> > > > 
> > > > 2) Containers often do not expect that some random process inside
> > > > will be killed. So, in general, a much safer behavior is
> > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > in userspace, but doing it in the kernel has some advantages,
> > > > especially in a case of a system-wide OOM.
> > > > 
> > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > and kills all processes inside.
> > > > 
> > > > Memcg oom score is calculated as a size of active and inactive
> > > > anon LRU lists, unevictable LRU list and swap size.
> > > > 
> > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > the OOMing cgroup are considered.
> > > 
> > > While this might make sense for some workloads/setups it is not a
> > > generally acceptable policy IMHO. We have discussed that different OOM
> > > policies might be interesting few years back at LSFMM but there was no
> > > real consensus on how to do that. One possibility was to allow bpf like
> > > mechanisms. Could you explore that path?
> > 
> > I agree, I think it needs more thought. I wonder if the real issue is something
> > else. For example
> > 
> > 1. Did we overcommit a particular container too much?
> > 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> > the problem?
> 
> The occasional OOM kill is an unavoidable reality on our systems (and
> I bet on most deployments). If we tried not to overcommit, we'd waste
> a *lot* of memory.
> 
> The problem is when OOM happens, we really want the biggest *job* to
> get killed. Before cgroups, we assumed jobs were processes. But with
> cgroups, the user is able to define a group of processes as a job, and
> then an individual process is no longer a first-class memory consumer.
> 
> Without a patch like this, the OOM killer will compare the sizes of
> the random subparticles that the jobs in the system are composed of
> and kill the single biggest particle, leaving behind the incoherent
> remains of one of the jobs. That doesn't make a whole lot of sense.

I agree, but see my response on oom_notifiers in parallel that I sent
to Roman.

> 
> If you want to determine the most expensive car in a parking lot, you
> can't go off and compare the price of one car's muffler with the door
> handle of another, then point to a wind shield and yell "This is it!"
> 
> You need to compare the cars as a whole with each other.
> 
> > 3. We have oom notifiers now, could those be used (assuming you are interested
> > in non memcg related OOM's affecting a container
> 
> Right now, we watch for OOM notifications and then have userspace kill
> the rest of a job. That works - somewhat. What remains is the problem
> that I described above, that comparing individual process sizes is not
> meaningful when the terminal memory consumer is a cgroup.

Could the cgroup limit be used as the comparison point? stats inside
of the memory cgroup?

> 
> > 4. How do we determine limits for these containers? From a fariness
> > perspective
> 
> How do you mean?

How do we set them up so that the larger job gets more of the limits
as opposed to the small ones?

Balbir Singh.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 19:43         ` Balbir Singh
@ 2017-05-18 20:15           ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 20:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 05:43:59AM +1000, Balbir Singh wrote:
> On Thu, 2017-05-18 at 15:22 -0400, Johannes Weiner wrote:
> > On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > > Traditionally, the OOM killer is operating on a process level.
> > > > > Under oom conditions, it finds a process with the highest oom score
> > > > > and kills it.
> > > > > 
> > > > > This behavior doesn't suit well the system with many running
> > > > > containers. There are two main issues:
> > > > > 
> > > > > 1) There is no fairness between containers. A small container with
> > > > > a few large processes will be chosen over a large one with huge
> > > > > number of small processes.
> > > > > 
> > > > > 2) Containers often do not expect that some random process inside
> > > > > will be killed. So, in general, a much safer behavior is
> > > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > > in userspace, but doing it in the kernel has some advantages,
> > > > > especially in a case of a system-wide OOM.
> > > > > 
> > > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > > and kills all processes inside.
> > > > > 
> > > > > Memcg oom score is calculated as a size of active and inactive
> > > > > anon LRU lists, unevictable LRU list and swap size.
> > > > > 
> > > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > > the OOMing cgroup are considered.
> > > > 
> > > > While this might make sense for some workloads/setups it is not a
> > > > generally acceptable policy IMHO. We have discussed that different OOM
> > > > policies might be interesting few years back at LSFMM but there was no
> > > > real consensus on how to do that. One possibility was to allow bpf like
> > > > mechanisms. Could you explore that path?
> > > 
> > > I agree, I think it needs more thought. I wonder if the real issue is something
> > > else. For example
> > > 
> > > 1. Did we overcommit a particular container too much?
> > > 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> > > the problem?
> > 
> > The occasional OOM kill is an unavoidable reality on our systems (and
> > I bet on most deployments). If we tried not to overcommit, we'd waste
> > a *lot* of memory.
> > 
> > The problem is when OOM happens, we really want the biggest *job* to
> > get killed. Before cgroups, we assumed jobs were processes. But with
> > cgroups, the user is able to define a group of processes as a job, and
> > then an individual process is no longer a first-class memory consumer.
> > 
> > Without a patch like this, the OOM killer will compare the sizes of
> > the random subparticles that the jobs in the system are composed of
> > and kill the single biggest particle, leaving behind the incoherent
> > remains of one of the jobs. That doesn't make a whole lot of sense.
> 
> I agree, but see my response on oom_notifiers in parallel that I sent
> to Roman.

I don't see how they're related to an abstraction problem in the
victim evaluation.

> > If you want to determine the most expensive car in a parking lot, you
> > can't go off and compare the price of one car's muffler with the door
> > handle of another, then point to a wind shield and yell "This is it!"
> > 
> > You need to compare the cars as a whole with each other.
> > 
> > > 3. We have oom notifiers now, could those be used (assuming you are interested
> > > in non memcg related OOM's affecting a container
> > 
> > Right now, we watch for OOM notifications and then have userspace kill
> > the rest of a job. That works - somewhat. What remains is the problem
> > that I described above, that comparing individual process sizes is not
> > meaningful when the terminal memory consumer is a cgroup.
> 
> Could the cgroup limit be used as the comparison point? stats inside
> of the memory cgroup?

The OOM is a result of physical memory shortage, but the limits don't
tell you how much physical memory you are consuming - only how much
you might if it weren't for a lack of physical memory.

We *do* use the stats inside of the cgroup, namely the amount of
memory they consumed overall, to compare them against each other.

As far as configurable priorities comparable to the oom score on the
system level goes, that's seems like a separate discussion. We could
add memory.oom_score, we could think about subtracting memory.low from
the badness of each cgroup (as that's the portion the group is
supposed to be able to consume in peace, and which we always expect to
be available in physical memory, so we want to kill the group with the
most overage above the memory.low limit) etc.

Either way, it's always possible to add configurability as patch 2/2.
Again, this patch is first and foremost about functionality, not about
interfacing and configurability.

> > > 4. How do we determine limits for these containers? From a fariness
> > > perspective
> > 
> > How do you mean?
> 
> How do we set them up so that the larger job gets more of the limits
> as opposed to the small ones?

I'm afraid I still don't entirely understand.

Is this about comparing groups not just by their physical size, but
also by their *intended* size and the difference between the two?
Meaning that a 10G-limit group with 9G allocated could be considered a
larger consumer than a 20G-limit group with 10G worth of memory?

If yes, I think that's where the fact that you overcommit comes
in. Because clearly you don't have 30G - the sum of the memory.max
limits - to hand out, seeing that you OOMed when these groups have
only 19G combined. So the memory.max settings cannot be considered the
intended distribution of memory in the system.

But that's exactly what memory.low is for.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-18 20:15           ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-18 20:15 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, Li Zefan,
	Vladimir Davydov, Tetsuo Handa, kernel-team, cgroups,
	open list:DOCUMENTATION, linux-kernel, linux-mm

On Fri, May 19, 2017 at 05:43:59AM +1000, Balbir Singh wrote:
> On Thu, 2017-05-18 at 15:22 -0400, Johannes Weiner wrote:
> > On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > > Traditionally, the OOM killer is operating on a process level.
> > > > > Under oom conditions, it finds a process with the highest oom score
> > > > > and kills it.
> > > > > 
> > > > > This behavior doesn't suit well the system with many running
> > > > > containers. There are two main issues:
> > > > > 
> > > > > 1) There is no fairness between containers. A small container with
> > > > > a few large processes will be chosen over a large one with huge
> > > > > number of small processes.
> > > > > 
> > > > > 2) Containers often do not expect that some random process inside
> > > > > will be killed. So, in general, a much safer behavior is
> > > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > > in userspace, but doing it in the kernel has some advantages,
> > > > > especially in a case of a system-wide OOM.
> > > > > 
> > > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > > and kills all processes inside.
> > > > > 
> > > > > Memcg oom score is calculated as a size of active and inactive
> > > > > anon LRU lists, unevictable LRU list and swap size.
> > > > > 
> > > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > > the OOMing cgroup are considered.
> > > > 
> > > > While this might make sense for some workloads/setups it is not a
> > > > generally acceptable policy IMHO. We have discussed that different OOM
> > > > policies might be interesting few years back at LSFMM but there was no
> > > > real consensus on how to do that. One possibility was to allow bpf like
> > > > mechanisms. Could you explore that path?
> > > 
> > > I agree, I think it needs more thought. I wonder if the real issue is something
> > > else. For example
> > > 
> > > 1. Did we overcommit a particular container too much?
> > > 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> > > the problem?
> > 
> > The occasional OOM kill is an unavoidable reality on our systems (and
> > I bet on most deployments). If we tried not to overcommit, we'd waste
> > a *lot* of memory.
> > 
> > The problem is when OOM happens, we really want the biggest *job* to
> > get killed. Before cgroups, we assumed jobs were processes. But with
> > cgroups, the user is able to define a group of processes as a job, and
> > then an individual process is no longer a first-class memory consumer.
> > 
> > Without a patch like this, the OOM killer will compare the sizes of
> > the random subparticles that the jobs in the system are composed of
> > and kill the single biggest particle, leaving behind the incoherent
> > remains of one of the jobs. That doesn't make a whole lot of sense.
> 
> I agree, but see my response on oom_notifiers in parallel that I sent
> to Roman.

I don't see how they're related to an abstraction problem in the
victim evaluation.

> > If you want to determine the most expensive car in a parking lot, you
> > can't go off and compare the price of one car's muffler with the door
> > handle of another, then point to a wind shield and yell "This is it!"
> > 
> > You need to compare the cars as a whole with each other.
> > 
> > > 3. We have oom notifiers now, could those be used (assuming you are interested
> > > in non memcg related OOM's affecting a container
> > 
> > Right now, we watch for OOM notifications and then have userspace kill
> > the rest of a job. That works - somewhat. What remains is the problem
> > that I described above, that comparing individual process sizes is not
> > meaningful when the terminal memory consumer is a cgroup.
> 
> Could the cgroup limit be used as the comparison point? stats inside
> of the memory cgroup?

The OOM is a result of physical memory shortage, but the limits don't
tell you how much physical memory you are consuming - only how much
you might if it weren't for a lack of physical memory.

We *do* use the stats inside of the cgroup, namely the amount of
memory they consumed overall, to compare them against each other.

As far as configurable priorities comparable to the oom score on the
system level goes, that's seems like a separate discussion. We could
add memory.oom_score, we could think about subtracting memory.low from
the badness of each cgroup (as that's the portion the group is
supposed to be able to consume in peace, and which we always expect to
be available in physical memory, so we want to kill the group with the
most overage above the memory.low limit) etc.

Either way, it's always possible to add configurability as patch 2/2.
Again, this patch is first and foremost about functionality, not about
interfacing and configurability.

> > > 4. How do we determine limits for these containers? From a fariness
> > > perspective
> > 
> > How do you mean?
> 
> How do we set them up so that the larger job gets more of the limits
> as opposed to the small ones?

I'm afraid I still don't entirely understand.

Is this about comparing groups not just by their physical size, but
also by their *intended* size and the difference between the two?
Meaning that a 10G-limit group with 9G allocated could be considered a
larger consumer than a 20G-limit group with 10G worth of memory?

If yes, I think that's where the fact that you overcommit comes
in. Because clearly you don't have 30G - the sum of the memory.max
limits - to hand out, seeing that you OOMed when these groups have
only 19G combined. So the memory.max settings cannot be considered the
intended distribution of memory in the system.

But that's exactly what memory.low is for.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 18:11     ` Johannes Weiner
@ 2017-05-19  8:02       ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-19  8:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu 18-05-17 14:11:17, Johannes Weiner wrote:
> On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers. There are two main issues:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > a few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. So, in general, a much safer behavior is
> > > to kill the whole cgroup. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > and kills all processes inside.
> > > 
> > > Memcg oom score is calculated as a size of active and inactive
> > > anon LRU lists, unevictable LRU list and swap size.
> > > 
> > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > the OOMing cgroup are considered.
> > 
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> OOM policy is an orthogonal discussion, though.
> 
> The OOM killer's job is to pick a memory consumer to kill. Per default
> the unit of the memory consumer is a process, but cgroups allow
> grouping processes into compound consumers. Extending the OOM killer
> to respect the new definition of "consumer" is not a new policy.

I do not want to play word games here but picking a task or more tasks
is a policy from my POV but that is not all that important. My primary
point is that this new "implementation" is most probably not what people
who use memory cgroups outside of containers want. Why? Mostly because
they do not care that only a part of the memcg is still alive pretty
much like the current global OOM behavior when a single task (or its
children) are gone all of the sudden. Why should I kill the whole user
slice just because one of its processes went wild?
 
> I don't think it's reasonable to ask the person who's trying to make
> the OOM killer support group-consumers to design a dynamic OOM policy
> framework instead.
> 
> All we want is the OOM policy, whatever it is, applied to cgroups.

And I am not dismissing this usecase. I believe it is valid but not
universally applicable when memory cgroups are deployed. That is why
I think that we need a way to define those policies in some sane way.
Our current oom policies are basically random -
/proc/sys/vm/oom_kill_allocating_task resp. /proc/sys/vm/panic_on_oom.

I am not really sure we want another hardcoded one e.g.
/proc/sys/vm/oom_kill_container because even that might turn out not the
great fit for different container usecases. Do we want to kill the
largest container or the one with the largest memory hog? Should some
containers have a higher priority over others? I am pretty sure more
criterion would pop up with more usecases.

That's why I think that the current OOM killer implementation should
stay as a last resort and be process oriented and we should think about
a way to override it for particular usecases. The exact mechanism is not
completely clear to me to be honest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-19  8:02       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-19  8:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Tejun Heo, Li Zefan, Vladimir Davydov,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu 18-05-17 14:11:17, Johannes Weiner wrote:
> On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers. There are two main issues:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > a few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. So, in general, a much safer behavior is
> > > to kill the whole cgroup. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > and kills all processes inside.
> > > 
> > > Memcg oom score is calculated as a size of active and inactive
> > > anon LRU lists, unevictable LRU list and swap size.
> > > 
> > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > the OOMing cgroup are considered.
> > 
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> OOM policy is an orthogonal discussion, though.
> 
> The OOM killer's job is to pick a memory consumer to kill. Per default
> the unit of the memory consumer is a process, but cgroups allow
> grouping processes into compound consumers. Extending the OOM killer
> to respect the new definition of "consumer" is not a new policy.

I do not want to play word games here but picking a task or more tasks
is a policy from my POV but that is not all that important. My primary
point is that this new "implementation" is most probably not what people
who use memory cgroups outside of containers want. Why? Mostly because
they do not care that only a part of the memcg is still alive pretty
much like the current global OOM behavior when a single task (or its
children) are gone all of the sudden. Why should I kill the whole user
slice just because one of its processes went wild?
 
> I don't think it's reasonable to ask the person who's trying to make
> the OOM killer support group-consumers to design a dynamic OOM policy
> framework instead.
> 
> All we want is the OOM policy, whatever it is, applied to cgroups.

And I am not dismissing this usecase. I believe it is valid but not
universally applicable when memory cgroups are deployed. That is why
I think that we need a way to define those policies in some sane way.
Our current oom policies are basically random -
/proc/sys/vm/oom_kill_allocating_task resp. /proc/sys/vm/panic_on_oom.

I am not really sure we want another hardcoded one e.g.
/proc/sys/vm/oom_kill_container because even that might turn out not the
great fit for different container usecases. Do we want to kill the
largest container or the one with the largest memory hog? Should some
containers have a higher priority over others? I am pretty sure more
criterion would pop up with more usecases.

That's why I think that the current OOM killer implementation should
stay as a last resort and be process oriented and we should think about
a way to override it for particular usecases. The exact mechanism is not
completely clear to me to be honest.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-18 16:28 [RFC PATCH] mm, oom: cgroup-aware OOM-killer Roman Gushchin
@ 2017-05-20 18:37   ` Vladimir Davydov
  2017-05-20 18:37   ` Vladimir Davydov
  1 sibling, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2017-05-20 18:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Michal Hocko, Tetsuo Handa,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

Hello Roman,

On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
...
> +5-2-4. Cgroup-aware OOM Killer
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as memory consumers
> +rather then individual processes. Under the OOM conditions it tries
> +to find an elegible leaf memory cgroup, and kill all processes
> +in this cgroup. If it's not possible (e.g. all processes belong
> +to the root cgroup), it falls back to the traditional per-process
> +behaviour.

I agree that the current OOM victim selection algorithm is totally
unfair in a system using containers and it has been crying for rework
for the last few years now, so it's great to see this finally coming.

However, I don't reckon that killing a whole leaf cgroup is always the
best practice. It does make sense when cgroups are used for
containerizing services or applications, because a service is unlikely
to remain operational after one of its processes is gone, but one can
also use cgroups to containerize processes started by a user. Kicking a
user out for one of her process has gone mad doesn't sound right to me.

Another example when the policy you're suggesting fails in my opinion is
in case a service (cgroup) consists of sub-services (sub-cgroups) that
run processes. The main service may stop working normally if one of its
sub-services is killed. So it might make sense to kill not just an
individual process or a leaf cgroup, but the whole main service with all
its sub-services.

And both kinds of workloads (services/applications and individual
processes run by users) can co-exist on the same host - consider the
default systemd setup, for instance.

IMHO it would be better to give users a choice regarding what they
really want for a particular cgroup in case of OOM - killing the whole
cgroup or one of its descendants. For example, we could introduce a
per-cgroup flag that would tell the kernel whether the cgroup can
tolerate killing a descendant or not. If it can, the kernel will pick
the fattest sub-cgroup or process and check it. If it cannot, it will
kill the whole cgroup and all its processes and sub-cgroups.

> +
> +The memory controller tries to make the best choise of a victim cgroup.
> +In general, it tries to select the largest cgroup, matching given
> +node/zone requirements, but the concrete algorithm is not defined,
> +and may be changed later.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to a sub-tree
> +of the OOM-ing cgroup, including itself.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c131f7e..8d07481 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_memcg_points;
> +
> +	oc->chosen_memcg = NULL;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	pr_info("Choosing a victim memcg because of %s",
> +		oc->memcg ?
> +		"memory limit reached of cgroup " :
> +		"out of memory\n");
> +	if (oc->memcg) {
> +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> +		pr_cont("\n");
> +	}
> +
> +	chosen_memcg_points = 0;
> +
> +	for_each_mem_cgroup_tree(iter, oc->memcg) {
> +		unsigned long points;
> +		int nid;
> +
> +		if (mem_cgroup_is_root(iter))
> +			continue;
> +
> +		if (memcg_has_children(iter))
> +			continue;
> +
> +		points = 0;
> +		for_each_node_state(nid, N_MEMORY) {
> +			if (oc->nodemask && !node_isset(nid, *oc->nodemask))
> +				continue;
> +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +		}
> +		points += mem_cgroup_get_nr_swap_pages(iter);

I guess we should also take into account kmem as well (unreclaimable
slabs, kernel stacks, socket buffers).

> +
> +		pr_info("Memcg ");
> +		pr_cont_cgroup_path(iter->css.cgroup);
> +		pr_cont(": %lu\n", points);
> +
> +		if (points > chosen_memcg_points) {
> +			if (oc->chosen_memcg)
> +				css_put(&oc->chosen_memcg->css);
> +
> +			oc->chosen_memcg = iter;
> +			css_get(&iter->css);
> +
> +			chosen_memcg_points = points;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg) {
> +		pr_info("Kill memcg ");
> +		pr_cont_cgroup_path(oc->chosen_memcg->css.cgroup);
> +		pr_cont(" (%lu)\n", chosen_memcg_points);
> +	} else {
> +		pr_info("No elegible memory cgroup found\n");
> +	}
> +
> +	return !!oc->chosen_memcg;
> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-20 18:37   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2017-05-20 18:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Michal Hocko, Tetsuo Handa,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

Hello Roman,

On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
...
> +5-2-4. Cgroup-aware OOM Killer
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as memory consumers
> +rather then individual processes. Under the OOM conditions it tries
> +to find an elegible leaf memory cgroup, and kill all processes
> +in this cgroup. If it's not possible (e.g. all processes belong
> +to the root cgroup), it falls back to the traditional per-process
> +behaviour.

I agree that the current OOM victim selection algorithm is totally
unfair in a system using containers and it has been crying for rework
for the last few years now, so it's great to see this finally coming.

However, I don't reckon that killing a whole leaf cgroup is always the
best practice. It does make sense when cgroups are used for
containerizing services or applications, because a service is unlikely
to remain operational after one of its processes is gone, but one can
also use cgroups to containerize processes started by a user. Kicking a
user out for one of her process has gone mad doesn't sound right to me.

Another example when the policy you're suggesting fails in my opinion is
in case a service (cgroup) consists of sub-services (sub-cgroups) that
run processes. The main service may stop working normally if one of its
sub-services is killed. So it might make sense to kill not just an
individual process or a leaf cgroup, but the whole main service with all
its sub-services.

And both kinds of workloads (services/applications and individual
processes run by users) can co-exist on the same host - consider the
default systemd setup, for instance.

IMHO it would be better to give users a choice regarding what they
really want for a particular cgroup in case of OOM - killing the whole
cgroup or one of its descendants. For example, we could introduce a
per-cgroup flag that would tell the kernel whether the cgroup can
tolerate killing a descendant or not. If it can, the kernel will pick
the fattest sub-cgroup or process and check it. If it cannot, it will
kill the whole cgroup and all its processes and sub-cgroups.

> +
> +The memory controller tries to make the best choise of a victim cgroup.
> +In general, it tries to select the largest cgroup, matching given
> +node/zone requirements, but the concrete algorithm is not defined,
> +and may be changed later.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to a sub-tree
> +of the OOM-ing cgroup, including itself.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c131f7e..8d07481 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> +	struct mem_cgroup *iter;
> +	unsigned long chosen_memcg_points;
> +
> +	oc->chosen_memcg = NULL;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return false;
> +
> +	pr_info("Choosing a victim memcg because of %s",
> +		oc->memcg ?
> +		"memory limit reached of cgroup " :
> +		"out of memory\n");
> +	if (oc->memcg) {
> +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> +		pr_cont("\n");
> +	}
> +
> +	chosen_memcg_points = 0;
> +
> +	for_each_mem_cgroup_tree(iter, oc->memcg) {
> +		unsigned long points;
> +		int nid;
> +
> +		if (mem_cgroup_is_root(iter))
> +			continue;
> +
> +		if (memcg_has_children(iter))
> +			continue;
> +
> +		points = 0;
> +		for_each_node_state(nid, N_MEMORY) {
> +			if (oc->nodemask && !node_isset(nid, *oc->nodemask))
> +				continue;
> +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +		}
> +		points += mem_cgroup_get_nr_swap_pages(iter);

I guess we should also take into account kmem as well (unreclaimable
slabs, kernel stacks, socket buffers).

> +
> +		pr_info("Memcg ");
> +		pr_cont_cgroup_path(iter->css.cgroup);
> +		pr_cont(": %lu\n", points);
> +
> +		if (points > chosen_memcg_points) {
> +			if (oc->chosen_memcg)
> +				css_put(&oc->chosen_memcg->css);
> +
> +			oc->chosen_memcg = iter;
> +			css_get(&iter->css);
> +
> +			chosen_memcg_points = points;
> +		}
> +	}
> +
> +	if (oc->chosen_memcg) {
> +		pr_info("Kill memcg ");
> +		pr_cont_cgroup_path(oc->chosen_memcg->css.cgroup);
> +		pr_cont(" (%lu)\n", chosen_memcg_points);
> +	} else {
> +		pr_info("No elegible memory cgroup found\n");
> +	}
> +
> +	return !!oc->chosen_memcg;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-20 18:37   ` Vladimir Davydov
@ 2017-05-22 17:01     ` Roman Gushchin
  -1 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-05-22 17:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Michal Hocko, Tetsuo Handa,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> Hello Roman,

Hi Vladimir!

> 
> On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> ...
> > +5-2-4. Cgroup-aware OOM Killer
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as memory consumers
> > +rather then individual processes. Under the OOM conditions it tries
> > +to find an elegible leaf memory cgroup, and kill all processes
> > +in this cgroup. If it's not possible (e.g. all processes belong
> > +to the root cgroup), it falls back to the traditional per-process
> > +behaviour.
> 
> I agree that the current OOM victim selection algorithm is totally
> unfair in a system using containers and it has been crying for rework
> for the last few years now, so it's great to see this finally coming.
> 
> However, I don't reckon that killing a whole leaf cgroup is always the
> best practice. It does make sense when cgroups are used for
> containerizing services or applications, because a service is unlikely
> to remain operational after one of its processes is gone, but one can
> also use cgroups to containerize processes started by a user. Kicking a
> user out for one of her process has gone mad doesn't sound right to me.

I agree, that it's not always a best practise, if you're not allowed
to change the cgroup configuration (e.g. create new cgroups).
IMHO, this case is mostly covered by using the v1 cgroup interface,
which remains unchanged.
If you do have control over cgroups, you can put processes into
separate cgroups, and obtain control over OOM victim selection and killing.

> Another example when the policy you're suggesting fails in my opinion is
> in case a service (cgroup) consists of sub-services (sub-cgroups) that
> run processes. The main service may stop working normally if one of its
> sub-services is killed. So it might make sense to kill not just an
> individual process or a leaf cgroup, but the whole main service with all
> its sub-services.

I agree, although I do not pretend for solving all possible
userspace problems caused by an OOM.

How to react on an OOM - is definitely a policy, which depends
on the workload. Nothing is changing here from how it's working now,
except now kernel will choose a victim cgroup, and kill the victim cgroup
rather than a process.

> And both kinds of workloads (services/applications and individual
> processes run by users) can co-exist on the same host - consider the
> default systemd setup, for instance.
> 
> IMHO it would be better to give users a choice regarding what they
> really want for a particular cgroup in case of OOM - killing the whole
> cgroup or one of its descendants. For example, we could introduce a
> per-cgroup flag that would tell the kernel whether the cgroup can
> tolerate killing a descendant or not. If it can, the kernel will pick
> the fattest sub-cgroup or process and check it. If it cannot, it will
> kill the whole cgroup and all its processes and sub-cgroups.

The last thing we want to do, is to compare processes with cgroups.
I agree, that we can have some option to disable the cgroup-aware OOM at all,
mostly for backward-compatibility. But I don't think it should be a
per-cgroup configuration option, which we will support forever.

> 
> > +
> > +The memory controller tries to make the best choise of a victim cgroup.
> > +In general, it tries to select the largest cgroup, matching given
> > +node/zone requirements, but the concrete algorithm is not defined,
> > +and may be changed later.
> > +
> > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > +the memory controller considers only cgroups belonging to a sub-tree
> > +of the OOM-ing cgroup, including itself.
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c131f7e..8d07481 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> >  	return ret;
> >  }
> >  
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter;
> > +	unsigned long chosen_memcg_points;
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +		return false;
> > +
> > +	pr_info("Choosing a victim memcg because of %s",
> > +		oc->memcg ?
> > +		"memory limit reached of cgroup " :
> > +		"out of memory\n");
> > +	if (oc->memcg) {
> > +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > +		pr_cont("\n");
> > +	}
> > +
> > +	chosen_memcg_points = 0;
> > +
> > +	for_each_mem_cgroup_tree(iter, oc->memcg) {
> > +		unsigned long points;
> > +		int nid;
> > +
> > +		if (mem_cgroup_is_root(iter))
> > +			continue;
> > +
> > +		if (memcg_has_children(iter))
> > +			continue;
> > +
> > +		points = 0;
> > +		for_each_node_state(nid, N_MEMORY) {
> > +			if (oc->nodemask && !node_isset(nid, *oc->nodemask))
> > +				continue;
> > +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> > +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +		}
> > +		points += mem_cgroup_get_nr_swap_pages(iter);
> 
> I guess we should also take into account kmem as well (unreclaimable
> slabs, kernel stacks, socket buffers).

Added to v2 (I'll post it soon).
Good idea, thanks!

--
Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-22 17:01     ` Roman Gushchin
  0 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-05-22 17:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Tejun Heo, Li Zefan, Michal Hocko, Tetsuo Handa,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-mm

On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> Hello Roman,

Hi Vladimir!

> 
> On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> ...
> > +5-2-4. Cgroup-aware OOM Killer
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as memory consumers
> > +rather then individual processes. Under the OOM conditions it tries
> > +to find an elegible leaf memory cgroup, and kill all processes
> > +in this cgroup. If it's not possible (e.g. all processes belong
> > +to the root cgroup), it falls back to the traditional per-process
> > +behaviour.
> 
> I agree that the current OOM victim selection algorithm is totally
> unfair in a system using containers and it has been crying for rework
> for the last few years now, so it's great to see this finally coming.
> 
> However, I don't reckon that killing a whole leaf cgroup is always the
> best practice. It does make sense when cgroups are used for
> containerizing services or applications, because a service is unlikely
> to remain operational after one of its processes is gone, but one can
> also use cgroups to containerize processes started by a user. Kicking a
> user out for one of her process has gone mad doesn't sound right to me.

I agree, that it's not always a best practise, if you're not allowed
to change the cgroup configuration (e.g. create new cgroups).
IMHO, this case is mostly covered by using the v1 cgroup interface,
which remains unchanged.
If you do have control over cgroups, you can put processes into
separate cgroups, and obtain control over OOM victim selection and killing.

> Another example when the policy you're suggesting fails in my opinion is
> in case a service (cgroup) consists of sub-services (sub-cgroups) that
> run processes. The main service may stop working normally if one of its
> sub-services is killed. So it might make sense to kill not just an
> individual process or a leaf cgroup, but the whole main service with all
> its sub-services.

I agree, although I do not pretend for solving all possible
userspace problems caused by an OOM.

How to react on an OOM - is definitely a policy, which depends
on the workload. Nothing is changing here from how it's working now,
except now kernel will choose a victim cgroup, and kill the victim cgroup
rather than a process.

> And both kinds of workloads (services/applications and individual
> processes run by users) can co-exist on the same host - consider the
> default systemd setup, for instance.
> 
> IMHO it would be better to give users a choice regarding what they
> really want for a particular cgroup in case of OOM - killing the whole
> cgroup or one of its descendants. For example, we could introduce a
> per-cgroup flag that would tell the kernel whether the cgroup can
> tolerate killing a descendant or not. If it can, the kernel will pick
> the fattest sub-cgroup or process and check it. If it cannot, it will
> kill the whole cgroup and all its processes and sub-cgroups.

The last thing we want to do, is to compare processes with cgroups.
I agree, that we can have some option to disable the cgroup-aware OOM at all,
mostly for backward-compatibility. But I don't think it should be a
per-cgroup configuration option, which we will support forever.

> 
> > +
> > +The memory controller tries to make the best choise of a victim cgroup.
> > +In general, it tries to select the largest cgroup, matching given
> > +node/zone requirements, but the concrete algorithm is not defined,
> > +and may be changed later.
> > +
> > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > +the memory controller considers only cgroups belonging to a sub-tree
> > +of the OOM-ing cgroup, including itself.
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c131f7e..8d07481 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> >  	return ret;
> >  }
> >  
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > +	struct mem_cgroup *iter;
> > +	unsigned long chosen_memcg_points;
> > +
> > +	oc->chosen_memcg = NULL;
> > +
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +		return false;
> > +
> > +	pr_info("Choosing a victim memcg because of %s",
> > +		oc->memcg ?
> > +		"memory limit reached of cgroup " :
> > +		"out of memory\n");
> > +	if (oc->memcg) {
> > +		pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > +		pr_cont("\n");
> > +	}
> > +
> > +	chosen_memcg_points = 0;
> > +
> > +	for_each_mem_cgroup_tree(iter, oc->memcg) {
> > +		unsigned long points;
> > +		int nid;
> > +
> > +		if (mem_cgroup_is_root(iter))
> > +			continue;
> > +
> > +		if (memcg_has_children(iter))
> > +			continue;
> > +
> > +		points = 0;
> > +		for_each_node_state(nid, N_MEMORY) {
> > +			if (oc->nodemask && !node_isset(nid, *oc->nodemask))
> > +				continue;
> > +			points += mem_cgroup_node_nr_lru_pages(iter, nid,
> > +					LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +		}
> > +		points += mem_cgroup_get_nr_swap_pages(iter);
> 
> I guess we should also take into account kmem as well (unreclaimable
> slabs, kernel stacks, socket buffers).

Added to v2 (I'll post it soon).
Good idea, thanks!

--
Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-22 17:01     ` Roman Gushchin
@ 2017-05-23  7:07       ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-23  7:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vladimir Davydov, Johannes Weiner, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > Hello Roman,
> 
> Hi Vladimir!
> 
> > 
> > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > ...
> > > +5-2-4. Cgroup-aware OOM Killer
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as memory consumers
> > > +rather then individual processes. Under the OOM conditions it tries
> > > +to find an elegible leaf memory cgroup, and kill all processes
> > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > +to the root cgroup), it falls back to the traditional per-process
> > > +behaviour.
> > 
> > I agree that the current OOM victim selection algorithm is totally
> > unfair in a system using containers and it has been crying for rework
> > for the last few years now, so it's great to see this finally coming.
> > 
> > However, I don't reckon that killing a whole leaf cgroup is always the
> > best practice. It does make sense when cgroups are used for
> > containerizing services or applications, because a service is unlikely
> > to remain operational after one of its processes is gone, but one can
> > also use cgroups to containerize processes started by a user. Kicking a
> > user out for one of her process has gone mad doesn't sound right to me.
> 
> I agree, that it's not always a best practise, if you're not allowed
> to change the cgroup configuration (e.g. create new cgroups).
> IMHO, this case is mostly covered by using the v1 cgroup interface,
> which remains unchanged.

But there are features which are v2 only and users might really want to
use it. So I really do not buy this v2-only argument.

> If you do have control over cgroups, you can put processes into
> separate cgroups, and obtain control over OOM victim selection and killing.

Usually you do not have that control because there is a global daemon
doing the placement for you.

> > Another example when the policy you're suggesting fails in my opinion is
> > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > run processes. The main service may stop working normally if one of its
> > sub-services is killed. So it might make sense to kill not just an
> > individual process or a leaf cgroup, but the whole main service with all
> > its sub-services.
> 
> I agree, although I do not pretend for solving all possible
> userspace problems caused by an OOM.
> 
> How to react on an OOM - is definitely a policy, which depends
> on the workload. Nothing is changing here from how it's working now,
> except now kernel will choose a victim cgroup, and kill the victim cgroup
> rather than a process.

There is a _big_ difference. The current implementation just tries
to recover from the OOM situation without carying much about the
consequences on the workload. This is the last resort and a services for
the _system_ to get back to sane state. You are trying to make it more
clever and workload aware and that is inevitable going to depend on the
specific workload. I really do think we cannot simply hardcode any
policy into the kernel for this purpose and that is why I would like to
see a discussion about how to do that in a more extensible way. This
might be harder to implement now but it I believe it will turn out
better longerm.

> > And both kinds of workloads (services/applications and individual
> > processes run by users) can co-exist on the same host - consider the
> > default systemd setup, for instance.
> > 
> > IMHO it would be better to give users a choice regarding what they
> > really want for a particular cgroup in case of OOM - killing the whole
> > cgroup or one of its descendants. For example, we could introduce a
> > per-cgroup flag that would tell the kernel whether the cgroup can
> > tolerate killing a descendant or not. If it can, the kernel will pick
> > the fattest sub-cgroup or process and check it. If it cannot, it will
> > kill the whole cgroup and all its processes and sub-cgroups.
> 
> The last thing we want to do, is to compare processes with cgroups.
> I agree, that we can have some option to disable the cgroup-aware OOM at all,
> mostly for backward-compatibility. But I don't think it should be a
> per-cgroup configuration option, which we will support forever.

I can clearly see a demand for "this is definitely more important
container than others so do not kill" usecases. I can also see demand
for "do not kill this container running for X days". And more are likely
to pop out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-23  7:07       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-23  7:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vladimir Davydov, Johannes Weiner, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > Hello Roman,
> 
> Hi Vladimir!
> 
> > 
> > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > ...
> > > +5-2-4. Cgroup-aware OOM Killer
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as memory consumers
> > > +rather then individual processes. Under the OOM conditions it tries
> > > +to find an elegible leaf memory cgroup, and kill all processes
> > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > +to the root cgroup), it falls back to the traditional per-process
> > > +behaviour.
> > 
> > I agree that the current OOM victim selection algorithm is totally
> > unfair in a system using containers and it has been crying for rework
> > for the last few years now, so it's great to see this finally coming.
> > 
> > However, I don't reckon that killing a whole leaf cgroup is always the
> > best practice. It does make sense when cgroups are used for
> > containerizing services or applications, because a service is unlikely
> > to remain operational after one of its processes is gone, but one can
> > also use cgroups to containerize processes started by a user. Kicking a
> > user out for one of her process has gone mad doesn't sound right to me.
> 
> I agree, that it's not always a best practise, if you're not allowed
> to change the cgroup configuration (e.g. create new cgroups).
> IMHO, this case is mostly covered by using the v1 cgroup interface,
> which remains unchanged.

But there are features which are v2 only and users might really want to
use it. So I really do not buy this v2-only argument.

> If you do have control over cgroups, you can put processes into
> separate cgroups, and obtain control over OOM victim selection and killing.

Usually you do not have that control because there is a global daemon
doing the placement for you.

> > Another example when the policy you're suggesting fails in my opinion is
> > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > run processes. The main service may stop working normally if one of its
> > sub-services is killed. So it might make sense to kill not just an
> > individual process or a leaf cgroup, but the whole main service with all
> > its sub-services.
> 
> I agree, although I do not pretend for solving all possible
> userspace problems caused by an OOM.
> 
> How to react on an OOM - is definitely a policy, which depends
> on the workload. Nothing is changing here from how it's working now,
> except now kernel will choose a victim cgroup, and kill the victim cgroup
> rather than a process.

There is a _big_ difference. The current implementation just tries
to recover from the OOM situation without carying much about the
consequences on the workload. This is the last resort and a services for
the _system_ to get back to sane state. You are trying to make it more
clever and workload aware and that is inevitable going to depend on the
specific workload. I really do think we cannot simply hardcode any
policy into the kernel for this purpose and that is why I would like to
see a discussion about how to do that in a more extensible way. This
might be harder to implement now but it I believe it will turn out
better longerm.

> > And both kinds of workloads (services/applications and individual
> > processes run by users) can co-exist on the same host - consider the
> > default systemd setup, for instance.
> > 
> > IMHO it would be better to give users a choice regarding what they
> > really want for a particular cgroup in case of OOM - killing the whole
> > cgroup or one of its descendants. For example, we could introduce a
> > per-cgroup flag that would tell the kernel whether the cgroup can
> > tolerate killing a descendant or not. If it can, the kernel will pick
> > the fattest sub-cgroup or process and check it. If it cannot, it will
> > kill the whole cgroup and all its processes and sub-cgroups.
> 
> The last thing we want to do, is to compare processes with cgroups.
> I agree, that we can have some option to disable the cgroup-aware OOM at all,
> mostly for backward-compatibility. But I don't think it should be a
> per-cgroup configuration option, which we will support forever.

I can clearly see a demand for "this is definitely more important
container than others so do not kill" usecases. I can also see demand
for "do not kill this container running for X days". And more are likely
to pop out.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-23  7:07       ` Michal Hocko
@ 2017-05-23 13:25         ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-23 13:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > > ...
> > > > +5-2-4. Cgroup-aware OOM Killer
> > > > +
> > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > > +It means that it treats memory cgroups as memory consumers
> > > > +rather then individual processes. Under the OOM conditions it tries
> > > > +to find an elegible leaf memory cgroup, and kill all processes
> > > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > > +to the root cgroup), it falls back to the traditional per-process
> > > > +behaviour.
> > > 
> > > I agree that the current OOM victim selection algorithm is totally
> > > unfair in a system using containers and it has been crying for rework
> > > for the last few years now, so it's great to see this finally coming.
> > > 
> > > However, I don't reckon that killing a whole leaf cgroup is always the
> > > best practice. It does make sense when cgroups are used for
> > > containerizing services or applications, because a service is unlikely
> > > to remain operational after one of its processes is gone, but one can
> > > also use cgroups to containerize processes started by a user. Kicking a
> > > user out for one of her process has gone mad doesn't sound right to me.
> > 
> > I agree, that it's not always a best practise, if you're not allowed
> > to change the cgroup configuration (e.g. create new cgroups).
> > IMHO, this case is mostly covered by using the v1 cgroup interface,
> > which remains unchanged.
> 
> But there are features which are v2 only and users might really want to
> use it. So I really do not buy this v2-only argument.

I have to agree here. We won't get around making the leaf killing
opt-in or opt-out in some fashion.

> > > Another example when the policy you're suggesting fails in my opinion is
> > > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > > run processes. The main service may stop working normally if one of its
> > > sub-services is killed. So it might make sense to kill not just an
> > > individual process or a leaf cgroup, but the whole main service with all
> > > its sub-services.
> > 
> > I agree, although I do not pretend for solving all possible
> > userspace problems caused by an OOM.
> > 
> > How to react on an OOM - is definitely a policy, which depends
> > on the workload. Nothing is changing here from how it's working now,
> > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > rather than a process.
> 
> There is a _big_ difference. The current implementation just tries
> to recover from the OOM situation without carying much about the
> consequences on the workload. This is the last resort and a services for
> the _system_ to get back to sane state. You are trying to make it more
> clever and workload aware and that is inevitable going to depend on the
> specific workload. I really do think we cannot simply hardcode any
> policy into the kernel for this purpose and that is why I would like to
> see a discussion about how to do that in a more extensible way. This
> might be harder to implement now but it I believe it will turn out
> better longerm.

And that's where I still maintain that this isn't really a policy
change. Because what this code does ISN'T more clever, and the OOM
killer STILL IS a last-resort thing. We don't need any elaborate
just-in-time evaluation of what each entity is worth. We just want to
kill the biggest job, not the biggest MM. Just like you wouldn't want
just the biggest VMA unmapped and freed, since it leaves your process
incoherent, killing one process leaves a job incoherent.

I understand that making it fully configurable is a tempting thought,
because you'd offload all responsibility to userspace. But on the
other hand, this was brought up years ago and nothing has happened
since. And to me this is evidence that nobody really cares all that
much. Because it's still a rather rare event, and there isn't much you
cannot accomplish with periodic score adjustments.

> > > And both kinds of workloads (services/applications and individual
> > > processes run by users) can co-exist on the same host - consider the
> > > default systemd setup, for instance.
> > > 
> > > IMHO it would be better to give users a choice regarding what they
> > > really want for a particular cgroup in case of OOM - killing the whole
> > > cgroup or one of its descendants. For example, we could introduce a
> > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > kill the whole cgroup and all its processes and sub-cgroups.
> > 
> > The last thing we want to do, is to compare processes with cgroups.
> > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > mostly for backward-compatibility. But I don't think it should be a
> > per-cgroup configuration option, which we will support forever.
> 
> I can clearly see a demand for "this is definitely more important
> container than others so do not kill" usecases. I can also see demand
> for "do not kill this container running for X days". And more are likely
> to pop out.

That can all be done with scoring.

In fact, we HAD the oom killer consider a target's cputime/runtime
before, and David replaced it all with simple scoring in a63d83f427fb
("oom: badness heuristic rewrite").

This was 10 years ago, and nobody has missed anything critical enough
to implement something beyond scoring. So I don't see why we'd need to
do it for cgroups all of a sudden.

They're nothing special, they just group together things we have been
OOM killing for ages. So why shouldn't we use the same config model?

It seems to me, what we need for this patch is 1) a way to toggle
whether the processes and subgroups of a group are interdependent or
independent and 2) configurable OOM scoring per cgroup analogous to
what we have per process already. If a group is marked interdependent
we stop descending into it and evaluate it as one entity. Otherwise,
we go look for victims in its subgroups and individual processes.

Are there real-life usecases that wouldn't be covered by this?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-23 13:25         ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-23 13:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > > ...
> > > > +5-2-4. Cgroup-aware OOM Killer
> > > > +
> > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > > +It means that it treats memory cgroups as memory consumers
> > > > +rather then individual processes. Under the OOM conditions it tries
> > > > +to find an elegible leaf memory cgroup, and kill all processes
> > > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > > +to the root cgroup), it falls back to the traditional per-process
> > > > +behaviour.
> > > 
> > > I agree that the current OOM victim selection algorithm is totally
> > > unfair in a system using containers and it has been crying for rework
> > > for the last few years now, so it's great to see this finally coming.
> > > 
> > > However, I don't reckon that killing a whole leaf cgroup is always the
> > > best practice. It does make sense when cgroups are used for
> > > containerizing services or applications, because a service is unlikely
> > > to remain operational after one of its processes is gone, but one can
> > > also use cgroups to containerize processes started by a user. Kicking a
> > > user out for one of her process has gone mad doesn't sound right to me.
> > 
> > I agree, that it's not always a best practise, if you're not allowed
> > to change the cgroup configuration (e.g. create new cgroups).
> > IMHO, this case is mostly covered by using the v1 cgroup interface,
> > which remains unchanged.
> 
> But there are features which are v2 only and users might really want to
> use it. So I really do not buy this v2-only argument.

I have to agree here. We won't get around making the leaf killing
opt-in or opt-out in some fashion.

> > > Another example when the policy you're suggesting fails in my opinion is
> > > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > > run processes. The main service may stop working normally if one of its
> > > sub-services is killed. So it might make sense to kill not just an
> > > individual process or a leaf cgroup, but the whole main service with all
> > > its sub-services.
> > 
> > I agree, although I do not pretend for solving all possible
> > userspace problems caused by an OOM.
> > 
> > How to react on an OOM - is definitely a policy, which depends
> > on the workload. Nothing is changing here from how it's working now,
> > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > rather than a process.
> 
> There is a _big_ difference. The current implementation just tries
> to recover from the OOM situation without carying much about the
> consequences on the workload. This is the last resort and a services for
> the _system_ to get back to sane state. You are trying to make it more
> clever and workload aware and that is inevitable going to depend on the
> specific workload. I really do think we cannot simply hardcode any
> policy into the kernel for this purpose and that is why I would like to
> see a discussion about how to do that in a more extensible way. This
> might be harder to implement now but it I believe it will turn out
> better longerm.

And that's where I still maintain that this isn't really a policy
change. Because what this code does ISN'T more clever, and the OOM
killer STILL IS a last-resort thing. We don't need any elaborate
just-in-time evaluation of what each entity is worth. We just want to
kill the biggest job, not the biggest MM. Just like you wouldn't want
just the biggest VMA unmapped and freed, since it leaves your process
incoherent, killing one process leaves a job incoherent.

I understand that making it fully configurable is a tempting thought,
because you'd offload all responsibility to userspace. But on the
other hand, this was brought up years ago and nothing has happened
since. And to me this is evidence that nobody really cares all that
much. Because it's still a rather rare event, and there isn't much you
cannot accomplish with periodic score adjustments.

> > > And both kinds of workloads (services/applications and individual
> > > processes run by users) can co-exist on the same host - consider the
> > > default systemd setup, for instance.
> > > 
> > > IMHO it would be better to give users a choice regarding what they
> > > really want for a particular cgroup in case of OOM - killing the whole
> > > cgroup or one of its descendants. For example, we could introduce a
> > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > kill the whole cgroup and all its processes and sub-cgroups.
> > 
> > The last thing we want to do, is to compare processes with cgroups.
> > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > mostly for backward-compatibility. But I don't think it should be a
> > per-cgroup configuration option, which we will support forever.
> 
> I can clearly see a demand for "this is definitely more important
> container than others so do not kill" usecases. I can also see demand
> for "do not kill this container running for X days". And more are likely
> to pop out.

That can all be done with scoring.

In fact, we HAD the oom killer consider a target's cputime/runtime
before, and David replaced it all with simple scoring in a63d83f427fb
("oom: badness heuristic rewrite").

This was 10 years ago, and nobody has missed anything critical enough
to implement something beyond scoring. So I don't see why we'd need to
do it for cgroups all of a sudden.

They're nothing special, they just group together things we have been
OOM killing for ages. So why shouldn't we use the same config model?

It seems to me, what we need for this patch is 1) a way to toggle
whether the processes and subgroups of a group are interdependent or
independent and 2) configurable OOM scoring per cgroup analogous to
what we have per process already. If a group is marked interdependent
we stop descending into it and evaluate it as one entity. Otherwise,
we go look for victims in its subgroups and individual processes.

Are there real-life usecases that wouldn't be covered by this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-23 13:25         ` Johannes Weiner
@ 2017-05-25 15:38           ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-25 15:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
[...]
> > > How to react on an OOM - is definitely a policy, which depends
> > > on the workload. Nothing is changing here from how it's working now,
> > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > rather than a process.
> > 
> > There is a _big_ difference. The current implementation just tries
> > to recover from the OOM situation without carying much about the
> > consequences on the workload. This is the last resort and a services for
> > the _system_ to get back to sane state. You are trying to make it more
> > clever and workload aware and that is inevitable going to depend on the
> > specific workload. I really do think we cannot simply hardcode any
> > policy into the kernel for this purpose and that is why I would like to
> > see a discussion about how to do that in a more extensible way. This
> > might be harder to implement now but it I believe it will turn out
> > better longerm.
> 
> And that's where I still maintain that this isn't really a policy
> change. Because what this code does ISN'T more clever, and the OOM
> killer STILL IS a last-resort thing.

The thing I wanted to point out is that what and how much to kill
definitely depends on the usecase. We currently kill all tasks which
share the mm struct because that is the smallest unit that can unpin
user memory. And that makes a lot of sense to me as a general default.
I would call any attempt to guess tasks belonging to the same
workload/job as a "more clever".

> We don't need any elaborate
> just-in-time evaluation of what each entity is worth. We just want to
> kill the biggest job, not the biggest MM. Just like you wouldn't want
> just the biggest VMA unmapped and freed, since it leaves your process
> incoherent, killing one process leaves a job incoherent.
> 
> I understand that making it fully configurable is a tempting thought,
> because you'd offload all responsibility to userspace.

It is not only tempting it is also the only place which can define
a more advanced OOM semantic sanely IMHO.

> But on the
> other hand, this was brought up years ago and nothing has happened
> since. And to me this is evidence that nobody really cares all that
> much. Because it's still a rather rare event, and there isn't much you
> cannot accomplish with periodic score adjustments.

Yes and there were no attempts since then which suggests that people
didn't care all that much. Maybe things have changed now that containers
got much more popular.

> > > > And both kinds of workloads (services/applications and individual
> > > > processes run by users) can co-exist on the same host - consider the
> > > > default systemd setup, for instance.
> > > > 
> > > > IMHO it would be better to give users a choice regarding what they
> > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > cgroup or one of its descendants. For example, we could introduce a
> > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > 
> > > The last thing we want to do, is to compare processes with cgroups.
> > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > mostly for backward-compatibility. But I don't think it should be a
> > > per-cgroup configuration option, which we will support forever.
> > 
> > I can clearly see a demand for "this is definitely more important
> > container than others so do not kill" usecases. I can also see demand
> > for "do not kill this container running for X days". And more are likely
> > to pop out.
> 
> That can all be done with scoring.

Maybe. But that requires somebody to tweak the scoring which can be hard
from trivial.
 
> In fact, we HAD the oom killer consider a target's cputime/runtime
> before, and David replaced it all with simple scoring in a63d83f427fb
> ("oom: badness heuristic rewrite").

Yes, that is correct and I agree that this was definitely step in the
right direction because time based heuristics tend to behave very
unpredictably in general workloads.

> This was 10 years ago, and nobody has missed anything critical enough
> to implement something beyond scoring. So I don't see why we'd need to
> do it for cgroups all of a sudden.
> 
> They're nothing special, they just group together things we have been
> OOM killing for ages. So why shouldn't we use the same config model?
> 
> It seems to me, what we need for this patch is 1) a way to toggle
> whether the processes and subgroups of a group are interdependent or
> independent and 2) configurable OOM scoring per cgroup analogous to
> what we have per process already. If a group is marked interdependent
> we stop descending into it and evaluate it as one entity. Otherwise,
> we go look for victims in its subgroups and individual processes.

This would be an absolute minimum, yes.

But I am still not convinced we should make this somehow "hardcoded" in
the core oom killer handler.  Why cannot we allow a callback for modules
and implement all these non-default OOM strategies in modules? We have
oom_notify_list already but that doesn't get the full oom context which
could be fixable but I suspect this is not the greatest interface at
all. We do not really need multiple implementations of the OOM handling
at the same time and a simple callback should be sufficient

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..926a36625322 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
+	 * Try a registered oom handler to run and fallback to the default
+	 * implementation if it cannot handle the current oom context
+	 */
+	if (oom_handler && oom_handler(oc))
+		return true;
+
+	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.

Please note that I haven't explored how much of the infrastructure
needed for the OOM decision making is available to modules. But we can
export a lot of what we currently have in oom_kill.c. I admit it might
turn out that this is simply not feasible but I would like this to be at
least explored before we go and implement yet another hardcoded way to
handle (see how I didn't use policy ;)) OOM situation.

> Are there real-life usecases that wouldn't be covered by this?

I really do not dare to envision that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-25 15:38           ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-25 15:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
[...]
> > > How to react on an OOM - is definitely a policy, which depends
> > > on the workload. Nothing is changing here from how it's working now,
> > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > rather than a process.
> > 
> > There is a _big_ difference. The current implementation just tries
> > to recover from the OOM situation without carying much about the
> > consequences on the workload. This is the last resort and a services for
> > the _system_ to get back to sane state. You are trying to make it more
> > clever and workload aware and that is inevitable going to depend on the
> > specific workload. I really do think we cannot simply hardcode any
> > policy into the kernel for this purpose and that is why I would like to
> > see a discussion about how to do that in a more extensible way. This
> > might be harder to implement now but it I believe it will turn out
> > better longerm.
> 
> And that's where I still maintain that this isn't really a policy
> change. Because what this code does ISN'T more clever, and the OOM
> killer STILL IS a last-resort thing.

The thing I wanted to point out is that what and how much to kill
definitely depends on the usecase. We currently kill all tasks which
share the mm struct because that is the smallest unit that can unpin
user memory. And that makes a lot of sense to me as a general default.
I would call any attempt to guess tasks belonging to the same
workload/job as a "more clever".

> We don't need any elaborate
> just-in-time evaluation of what each entity is worth. We just want to
> kill the biggest job, not the biggest MM. Just like you wouldn't want
> just the biggest VMA unmapped and freed, since it leaves your process
> incoherent, killing one process leaves a job incoherent.
> 
> I understand that making it fully configurable is a tempting thought,
> because you'd offload all responsibility to userspace.

It is not only tempting it is also the only place which can define
a more advanced OOM semantic sanely IMHO.

> But on the
> other hand, this was brought up years ago and nothing has happened
> since. And to me this is evidence that nobody really cares all that
> much. Because it's still a rather rare event, and there isn't much you
> cannot accomplish with periodic score adjustments.

Yes and there were no attempts since then which suggests that people
didn't care all that much. Maybe things have changed now that containers
got much more popular.

> > > > And both kinds of workloads (services/applications and individual
> > > > processes run by users) can co-exist on the same host - consider the
> > > > default systemd setup, for instance.
> > > > 
> > > > IMHO it would be better to give users a choice regarding what they
> > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > cgroup or one of its descendants. For example, we could introduce a
> > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > 
> > > The last thing we want to do, is to compare processes with cgroups.
> > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > mostly for backward-compatibility. But I don't think it should be a
> > > per-cgroup configuration option, which we will support forever.
> > 
> > I can clearly see a demand for "this is definitely more important
> > container than others so do not kill" usecases. I can also see demand
> > for "do not kill this container running for X days". And more are likely
> > to pop out.
> 
> That can all be done with scoring.

Maybe. But that requires somebody to tweak the scoring which can be hard
from trivial.
 
> In fact, we HAD the oom killer consider a target's cputime/runtime
> before, and David replaced it all with simple scoring in a63d83f427fb
> ("oom: badness heuristic rewrite").

Yes, that is correct and I agree that this was definitely step in the
right direction because time based heuristics tend to behave very
unpredictably in general workloads.

> This was 10 years ago, and nobody has missed anything critical enough
> to implement something beyond scoring. So I don't see why we'd need to
> do it for cgroups all of a sudden.
> 
> They're nothing special, they just group together things we have been
> OOM killing for ages. So why shouldn't we use the same config model?
> 
> It seems to me, what we need for this patch is 1) a way to toggle
> whether the processes and subgroups of a group are interdependent or
> independent and 2) configurable OOM scoring per cgroup analogous to
> what we have per process already. If a group is marked interdependent
> we stop descending into it and evaluate it as one entity. Otherwise,
> we go look for victims in its subgroups and individual processes.

This would be an absolute minimum, yes.

But I am still not convinced we should make this somehow "hardcoded" in
the core oom killer handler.  Why cannot we allow a callback for modules
and implement all these non-default OOM strategies in modules? We have
oom_notify_list already but that doesn't get the full oom context which
could be fixable but I suspect this is not the greatest interface at
all. We do not really need multiple implementations of the OOM handling
at the same time and a simple callback should be sufficient

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143a8625..926a36625322 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
+	 * Try a registered oom handler to run and fallback to the default
+	 * implementation if it cannot handle the current oom context
+	 */
+	if (oom_handler && oom_handler(oc))
+		return true;
+
+	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.

Please note that I haven't explored how much of the infrastructure
needed for the OOM decision making is available to modules. But we can
export a lot of what we currently have in oom_kill.c. I admit it might
turn out that this is simply not feasible but I would like this to be at
least explored before we go and implement yet another hardcoded way to
handle (see how I didn't use policy ;)) OOM situation.

> Are there real-life usecases that wouldn't be covered by this?

I really do not dare to envision that.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-25 15:38           ` Michal Hocko
  (?)
@ 2017-05-25 17:08             ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-25 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more are likely
> > > to pop out.
> > 
> > That can all be done with scoring.
> 
> Maybe. But that requires somebody to tweak the scoring which can be hard
> from trivial.

Why is sorting and picking in userspace harder than sorting and
picking in the kernel?

> > This was 10 years ago, and nobody has missed anything critical enough
> > to implement something beyond scoring. So I don't see why we'd need to
> > do it for cgroups all of a sudden.
> > 
> > They're nothing special, they just group together things we have been
> > OOM killing for ages. So why shouldn't we use the same config model?
> > 
> > It seems to me, what we need for this patch is 1) a way to toggle
> > whether the processes and subgroups of a group are interdependent or
> > independent and 2) configurable OOM scoring per cgroup analogous to
> > what we have per process already. If a group is marked interdependent
> > we stop descending into it and evaluate it as one entity. Otherwise,
> > we go look for victims in its subgroups and individual processes.
> 
> This would be an absolute minimum, yes.
> 
> But I am still not convinced we should make this somehow "hardcoded" in
> the core oom killer handler.  Why cannot we allow a callback for modules
> and implement all these non-default OOM strategies in modules? We have
> oom_notify_list already but that doesn't get the full oom context which
> could be fixable but I suspect this is not the greatest interface at
> all. We do not really need multiple implementations of the OOM handling
> at the same time and a simple callback should be sufficient
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..926a36625322 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	/*
> +	 * Try a registered oom handler to run and fallback to the default
> +	 * implementation if it cannot handle the current oom context
> +	 */
> +	if (oom_handler && oom_handler(oc))
> +		return true;

I think this would take us back to the dark days where memcg entry
points where big opaque branches in the generic VM code, which then
implemented their own thing, redundant locking, redundant LRU lists,
which was all very hard to maintain.

> +	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> 
> Please note that I haven't explored how much of the infrastructure
> needed for the OOM decision making is available to modules. But we can
> export a lot of what we currently have in oom_kill.c. I admit it might
> turn out that this is simply not feasible but I would like this to be at
> least explored before we go and implement yet another hardcoded way to
> handle (see how I didn't use policy ;)) OOM situation.

;)

My doubt here is mainly that we'll see many (or any) real-life cases
materialize that cannot be handled with cgroups and scoring. These are
powerful building blocks on which userspace can implement all kinds of
policy and sorting algorithms.

So this seems like a lot of churn and complicated code to handle one
extension. An extension that implements basic functionality.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-25 17:08             ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-25 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more are likely
> > > to pop out.
> > 
> > That can all be done with scoring.
> 
> Maybe. But that requires somebody to tweak the scoring which can be hard
> from trivial.

Why is sorting and picking in userspace harder than sorting and
picking in the kernel?

> > This was 10 years ago, and nobody has missed anything critical enough
> > to implement something beyond scoring. So I don't see why we'd need to
> > do it for cgroups all of a sudden.
> > 
> > They're nothing special, they just group together things we have been
> > OOM killing for ages. So why shouldn't we use the same config model?
> > 
> > It seems to me, what we need for this patch is 1) a way to toggle
> > whether the processes and subgroups of a group are interdependent or
> > independent and 2) configurable OOM scoring per cgroup analogous to
> > what we have per process already. If a group is marked interdependent
> > we stop descending into it and evaluate it as one entity. Otherwise,
> > we go look for victims in its subgroups and individual processes.
> 
> This would be an absolute minimum, yes.
> 
> But I am still not convinced we should make this somehow "hardcoded" in
> the core oom killer handler.  Why cannot we allow a callback for modules
> and implement all these non-default OOM strategies in modules? We have
> oom_notify_list already but that doesn't get the full oom context which
> could be fixable but I suspect this is not the greatest interface at
> all. We do not really need multiple implementations of the OOM handling
> at the same time and a simple callback should be sufficient
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..926a36625322 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	/*
> +	 * Try a registered oom handler to run and fallback to the default
> +	 * implementation if it cannot handle the current oom context
> +	 */
> +	if (oom_handler && oom_handler(oc))
> +		return true;

I think this would take us back to the dark days where memcg entry
points where big opaque branches in the generic VM code, which then
implemented their own thing, redundant locking, redundant LRU lists,
which was all very hard to maintain.

> +	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> 
> Please note that I haven't explored how much of the infrastructure
> needed for the OOM decision making is available to modules. But we can
> export a lot of what we currently have in oom_kill.c. I admit it might
> turn out that this is simply not feasible but I would like this to be at
> least explored before we go and implement yet another hardcoded way to
> handle (see how I didn't use policy ;)) OOM situation.

;)

My doubt here is mainly that we'll see many (or any) real-life cases
materialize that cannot be handled with cgroups and scoring. These are
powerful building blocks on which userspace can implement all kinds of
policy and sorting algorithms.

So this seems like a lot of churn and complicated code to handle one
extension. An extension that implements basic functionality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-25 17:08             ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-25 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team-b10kYP2dOMg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more are likely
> > > to pop out.
> > 
> > That can all be done with scoring.
> 
> Maybe. But that requires somebody to tweak the scoring which can be hard
> from trivial.

Why is sorting and picking in userspace harder than sorting and
picking in the kernel?

> > This was 10 years ago, and nobody has missed anything critical enough
> > to implement something beyond scoring. So I don't see why we'd need to
> > do it for cgroups all of a sudden.
> > 
> > They're nothing special, they just group together things we have been
> > OOM killing for ages. So why shouldn't we use the same config model?
> > 
> > It seems to me, what we need for this patch is 1) a way to toggle
> > whether the processes and subgroups of a group are interdependent or
> > independent and 2) configurable OOM scoring per cgroup analogous to
> > what we have per process already. If a group is marked interdependent
> > we stop descending into it and evaluate it as one entity. Otherwise,
> > we go look for victims in its subgroups and individual processes.
> 
> This would be an absolute minimum, yes.
> 
> But I am still not convinced we should make this somehow "hardcoded" in
> the core oom killer handler.  Why cannot we allow a callback for modules
> and implement all these non-default OOM strategies in modules? We have
> oom_notify_list already but that doesn't get the full oom context which
> could be fixable but I suspect this is not the greatest interface at
> all. We do not really need multiple implementations of the OOM handling
> at the same time and a simple callback should be sufficient
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 04c9143a8625..926a36625322 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
>  	}
>  
>  	/*
> +	 * Try a registered oom handler to run and fallback to the default
> +	 * implementation if it cannot handle the current oom context
> +	 */
> +	if (oom_handler && oom_handler(oc))
> +		return true;

I think this would take us back to the dark days where memcg entry
points where big opaque branches in the generic VM code, which then
implemented their own thing, redundant locking, redundant LRU lists,
which was all very hard to maintain.

> +	/*
>  	 * If current has a pending SIGKILL or is exiting, then automatically
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
> 
> Please note that I haven't explored how much of the infrastructure
> needed for the OOM decision making is available to modules. But we can
> export a lot of what we currently have in oom_kill.c. I admit it might
> turn out that this is simply not feasible but I would like this to be at
> least explored before we go and implement yet another hardcoded way to
> handle (see how I didn't use policy ;)) OOM situation.

;)

My doubt here is mainly that we'll see many (or any) real-life cases
materialize that cannot be handled with cgroups and scoring. These are
powerful building blocks on which userspace can implement all kinds of
policy and sorting algorithms.

So this seems like a lot of churn and complicated code to handle one
extension. An extension that implements basic functionality.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-25 17:08             ` Johannes Weiner
@ 2017-05-31 16:25               ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-31 16:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

[I am sorry I didn't get to reply earlier]

On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> > On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
[...]
> > > We don't need any elaborate
> > > just-in-time evaluation of what each entity is worth. We just want to
> > > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > > just the biggest VMA unmapped and freed, since it leaves your process
> > > incoherent, killing one process leaves a job incoherent.
> > > 
> > > I understand that making it fully configurable is a tempting thought,
> > > because you'd offload all responsibility to userspace.
> > 
> > It is not only tempting it is also the only place which can define
> > a more advanced OOM semantic sanely IMHO.
> 
> Why do you think that?

Because I believe that once we make the oom killer somehow workload
aware people will start demanding tweaks for their particular usecase.

> Everything the user would want to dynamically program in the kernel,
> say with bpf, they could do in userspace and then update the scores
> for each group and task periodically.

I am rather skeptical about dynamic scores. oom_{score_}adj has turned
to mere oom disable/enable knobs from my experience.

> The only limitation is that you have to recalculate and update the
> scoring tree every once in a while, whereas a bpf program could
> evaluate things just-in-time. But for that to matter in practice, OOM
> kills would have to be a fairly hot path.

I am not really sure how to reliably implement "kill the memcg with the
largest process" strategy. And who knows how many others strategies will
pop out.

> > > > > > And both kinds of workloads (services/applications and individual
> > > > > > processes run by users) can co-exist on the same host - consider the
> > > > > > default systemd setup, for instance.
> > > > > > 
> > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > 
> > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > per-cgroup configuration option, which we will support forever.
> > > > 
> > > > I can clearly see a demand for "this is definitely more important
> > > > container than others so do not kill" usecases. I can also see demand
> > > > for "do not kill this container running for X days". And more are likely
> > > > to pop out.
> > > 
> > > That can all be done with scoring.
> > 
> > Maybe. But that requires somebody to tweak the scoring which can be hard
> > from trivial.
> 
> Why is sorting and picking in userspace harder than sorting and
> picking in the kernel?

Because the userspace score based approach would be much more racy
especially in the busy system. This could lead to unexpected behavior
when OOM killer would kill a different than a run-away memcgs.

> > > This was 10 years ago, and nobody has missed anything critical enough
> > > to implement something beyond scoring. So I don't see why we'd need to
> > > do it for cgroups all of a sudden.
> > > 
> > > They're nothing special, they just group together things we have been
> > > OOM killing for ages. So why shouldn't we use the same config model?
> > > 
> > > It seems to me, what we need for this patch is 1) a way to toggle
> > > whether the processes and subgroups of a group are interdependent or
> > > independent and 2) configurable OOM scoring per cgroup analogous to
> > > what we have per process already. If a group is marked interdependent
> > > we stop descending into it and evaluate it as one entity. Otherwise,
> > > we go look for victims in its subgroups and individual processes.
> > 
> > This would be an absolute minimum, yes.
> > 
> > But I am still not convinced we should make this somehow "hardcoded" in
> > the core oom killer handler.  Why cannot we allow a callback for modules
> > and implement all these non-default OOM strategies in modules? We have
> > oom_notify_list already but that doesn't get the full oom context which
> > could be fixable but I suspect this is not the greatest interface at
> > all. We do not really need multiple implementations of the OOM handling
> > at the same time and a simple callback should be sufficient
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 04c9143a8625..926a36625322 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
> >  	}
> >  
> >  	/*
> > +	 * Try a registered oom handler to run and fallback to the default
> > +	 * implementation if it cannot handle the current oom context
> > +	 */
> > +	if (oom_handler && oom_handler(oc))
> > +		return true;
> 
> I think this would take us back to the dark days where memcg entry
> points where big opaque branches in the generic VM code, which then
> implemented their own thing, redundant locking, redundant LRU lists,
> which was all very hard to maintain.

Well, we can certainly help in that direction by exporting useful
library functions for those modules to use. E.g. the oom victim
selection is already half way there.
 
> > +	/*
> >  	 * If current has a pending SIGKILL or is exiting, then automatically
> >  	 * select it.  The goal is to allow it to allocate so that it may
> >  	 * quickly exit and free its memory.
> > 
> > Please note that I haven't explored how much of the infrastructure
> > needed for the OOM decision making is available to modules. But we can
> > export a lot of what we currently have in oom_kill.c. I admit it might
> > turn out that this is simply not feasible but I would like this to be at
> > least explored before we go and implement yet another hardcoded way to
> > handle (see how I didn't use policy ;)) OOM situation.
> 
> ;)
> 
> My doubt here is mainly that we'll see many (or any) real-life cases
> materialize that cannot be handled with cgroups and scoring. These are
> powerful building blocks on which userspace can implement all kinds of
> policy and sorting algorithms.
> 
> So this seems like a lot of churn and complicated code to handle one
> extension. An extension that implements basic functionality.

Well, as I've said I didn't get to explore this path so I have only a
very vague idea what we would have to export to implement e.g. the
proposed oom killing strategy suggested in this thread. Unfortunatelly I
do not have much time for that. I do not want to block a useful work
which you have a usecase for but I would be really happy if we could
consider longer term plans before diving into a "hardcoded"
implementation. We didn't do that previously and we are left with
oom_kill_allocating_task and similar one off things.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-31 16:25               ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-05-31 16:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

[I am sorry I didn't get to reply earlier]

On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> > On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
[...]
> > > We don't need any elaborate
> > > just-in-time evaluation of what each entity is worth. We just want to
> > > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > > just the biggest VMA unmapped and freed, since it leaves your process
> > > incoherent, killing one process leaves a job incoherent.
> > > 
> > > I understand that making it fully configurable is a tempting thought,
> > > because you'd offload all responsibility to userspace.
> > 
> > It is not only tempting it is also the only place which can define
> > a more advanced OOM semantic sanely IMHO.
> 
> Why do you think that?

Because I believe that once we make the oom killer somehow workload
aware people will start demanding tweaks for their particular usecase.

> Everything the user would want to dynamically program in the kernel,
> say with bpf, they could do in userspace and then update the scores
> for each group and task periodically.

I am rather skeptical about dynamic scores. oom_{score_}adj has turned
to mere oom disable/enable knobs from my experience.

> The only limitation is that you have to recalculate and update the
> scoring tree every once in a while, whereas a bpf program could
> evaluate things just-in-time. But for that to matter in practice, OOM
> kills would have to be a fairly hot path.

I am not really sure how to reliably implement "kill the memcg with the
largest process" strategy. And who knows how many others strategies will
pop out.

> > > > > > And both kinds of workloads (services/applications and individual
> > > > > > processes run by users) can co-exist on the same host - consider the
> > > > > > default systemd setup, for instance.
> > > > > > 
> > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > 
> > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > per-cgroup configuration option, which we will support forever.
> > > > 
> > > > I can clearly see a demand for "this is definitely more important
> > > > container than others so do not kill" usecases. I can also see demand
> > > > for "do not kill this container running for X days". And more are likely
> > > > to pop out.
> > > 
> > > That can all be done with scoring.
> > 
> > Maybe. But that requires somebody to tweak the scoring which can be hard
> > from trivial.
> 
> Why is sorting and picking in userspace harder than sorting and
> picking in the kernel?

Because the userspace score based approach would be much more racy
especially in the busy system. This could lead to unexpected behavior
when OOM killer would kill a different than a run-away memcgs.

> > > This was 10 years ago, and nobody has missed anything critical enough
> > > to implement something beyond scoring. So I don't see why we'd need to
> > > do it for cgroups all of a sudden.
> > > 
> > > They're nothing special, they just group together things we have been
> > > OOM killing for ages. So why shouldn't we use the same config model?
> > > 
> > > It seems to me, what we need for this patch is 1) a way to toggle
> > > whether the processes and subgroups of a group are interdependent or
> > > independent and 2) configurable OOM scoring per cgroup analogous to
> > > what we have per process already. If a group is marked interdependent
> > > we stop descending into it and evaluate it as one entity. Otherwise,
> > > we go look for victims in its subgroups and individual processes.
> > 
> > This would be an absolute minimum, yes.
> > 
> > But I am still not convinced we should make this somehow "hardcoded" in
> > the core oom killer handler.  Why cannot we allow a callback for modules
> > and implement all these non-default OOM strategies in modules? We have
> > oom_notify_list already but that doesn't get the full oom context which
> > could be fixable but I suspect this is not the greatest interface at
> > all. We do not really need multiple implementations of the OOM handling
> > at the same time and a simple callback should be sufficient
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 04c9143a8625..926a36625322 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -995,6 +995,13 @@ bool out_of_memory(struct oom_control *oc)
> >  	}
> >  
> >  	/*
> > +	 * Try a registered oom handler to run and fallback to the default
> > +	 * implementation if it cannot handle the current oom context
> > +	 */
> > +	if (oom_handler && oom_handler(oc))
> > +		return true;
> 
> I think this would take us back to the dark days where memcg entry
> points where big opaque branches in the generic VM code, which then
> implemented their own thing, redundant locking, redundant LRU lists,
> which was all very hard to maintain.

Well, we can certainly help in that direction by exporting useful
library functions for those modules to use. E.g. the oom victim
selection is already half way there.
 
> > +	/*
> >  	 * If current has a pending SIGKILL or is exiting, then automatically
> >  	 * select it.  The goal is to allow it to allocate so that it may
> >  	 * quickly exit and free its memory.
> > 
> > Please note that I haven't explored how much of the infrastructure
> > needed for the OOM decision making is available to modules. But we can
> > export a lot of what we currently have in oom_kill.c. I admit it might
> > turn out that this is simply not feasible but I would like this to be at
> > least explored before we go and implement yet another hardcoded way to
> > handle (see how I didn't use policy ;)) OOM situation.
> 
> ;)
> 
> My doubt here is mainly that we'll see many (or any) real-life cases
> materialize that cannot be handled with cgroups and scoring. These are
> powerful building blocks on which userspace can implement all kinds of
> policy and sorting algorithms.
> 
> So this seems like a lot of churn and complicated code to handle one
> extension. An extension that implements basic functionality.

Well, as I've said I didn't get to explore this path so I have only a
very vague idea what we would have to export to implement e.g. the
proposed oom killing strategy suggested in this thread. Unfortunatelly I
do not have much time for that. I do not want to block a useful work
which you have a usecase for but I would be really happy if we could
consider longer term plans before diving into a "hardcoded"
implementation. We didn't do that previously and we are left with
oom_kill_allocating_task and similar one off things.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-31 16:25               ` Michal Hocko
@ 2017-05-31 18:01                 ` Johannes Weiner
  -1 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-31 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > Everything the user would want to dynamically program in the kernel,
> > say with bpf, they could do in userspace and then update the scores
> > for each group and task periodically.
> 
> I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> to mere oom disable/enable knobs from my experience.

That doesn't necessarily have to be a deficiency with the scoring
system. I suspect that most people simply don't care as long as the
the picks for OOM victims aren't entirely stupid.

For example, we have a lot of machines that run one class of job. If
we run OOM there isn't much preference we'd need to express; just kill
one job - the biggest, whatever - and move on. (The biggest makes
sense because if all jobs are basically equal it's as good as any
other victim, but if one has a runaway bug it goes for that.)

Where we have more than one job class, it actually is mostly one hipri
and one lopri, in which case setting a hard limit on the lopri or the
-1000 OOM score trick is enough.

How many systems run more than two clearly distinguishable classes of
workloads concurrently?

I'm sure they exist. I'm just saying it doesn't surprise me that
elaborate OOM scoring isn't all that wide-spread.

> > The only limitation is that you have to recalculate and update the
> > scoring tree every once in a while, whereas a bpf program could
> > evaluate things just-in-time. But for that to matter in practice, OOM
> > kills would have to be a fairly hot path.
> 
> I am not really sure how to reliably implement "kill the memcg with the
> largest process" strategy. And who knows how many others strategies will
> pop out.

That seems fairly contrived.

What does it mean to divide memory into subdomains, but when you run
out of physical memory you kill based on biggest task?

Sure, it frees memory and gets the system going again, so it's as good
as any answer to overcommit gone wrong, I guess. But is that something
you'd intentionally want to express from a userspace perspective?

> > > > > > > And both kinds of workloads (services/applications and individual
> > > > > > > processes run by users) can co-exist on the same host - consider the
> > > > > > > default systemd setup, for instance.
> > > > > > > 
> > > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > > 
> > > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > > per-cgroup configuration option, which we will support forever.
> > > > > 
> > > > > I can clearly see a demand for "this is definitely more important
> > > > > container than others so do not kill" usecases. I can also see demand
> > > > > for "do not kill this container running for X days". And more are likely
> > > > > to pop out.
> > > > 
> > > > That can all be done with scoring.
> > > 
> > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > from trivial.
> > 
> > Why is sorting and picking in userspace harder than sorting and
> > picking in the kernel?
> 
> Because the userspace score based approach would be much more racy
> especially in the busy system. This could lead to unexpected behavior
> when OOM killer would kill a different than a run-away memcgs.

How would it be easier to weigh priority against runaway detection
inside the kernel?

> > > +	/*
> > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > >  	 * select it.  The goal is to allow it to allocate so that it may
> > >  	 * quickly exit and free its memory.
> > > 
> > > Please note that I haven't explored how much of the infrastructure
> > > needed for the OOM decision making is available to modules. But we can
> > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > turn out that this is simply not feasible but I would like this to be at
> > > least explored before we go and implement yet another hardcoded way to
> > > handle (see how I didn't use policy ;)) OOM situation.
> > 
> > ;)
> > 
> > My doubt here is mainly that we'll see many (or any) real-life cases
> > materialize that cannot be handled with cgroups and scoring. These are
> > powerful building blocks on which userspace can implement all kinds of
> > policy and sorting algorithms.
> > 
> > So this seems like a lot of churn and complicated code to handle one
> > extension. An extension that implements basic functionality.
> 
> Well, as I've said I didn't get to explore this path so I have only a
> very vague idea what we would have to export to implement e.g. the
> proposed oom killing strategy suggested in this thread. Unfortunatelly I
> do not have much time for that. I do not want to block a useful work
> which you have a usecase for but I would be really happy if we could
> consider longer term plans before diving into a "hardcoded"
> implementation. We didn't do that previously and we are left with
> oom_kill_allocating_task and similar one off things.

As I understand it, killing the allocating task was simply the default
before the OOM killer and was added as a compat knob. I really doubt
anybody is using it at this point, and we could probably delete it.

I appreciate your concern of being too short-sighted here, but the
fact that I cannot point to more usecases isn't for lack of trying. I
simply don't see the endless possibilities of usecases that you do.

It's unlikely for more types of memory domains to pop up besides MMs
and cgroups. (I mentioned vmas, but that just seems esoteric. And we
have panic_on_oom for whole-system death. What else could there be?)

And as I pointed out, there is no real evidence that the current
system for configuring preferences isn't sufficient in practice.

That's my thoughts on exploring. I'm not sure what else to do before
it feels like running off into fairly contrived hypotheticals.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-05-31 18:01                 ` Johannes Weiner
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Weiner @ 2017-05-31 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > Everything the user would want to dynamically program in the kernel,
> > say with bpf, they could do in userspace and then update the scores
> > for each group and task periodically.
> 
> I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> to mere oom disable/enable knobs from my experience.

That doesn't necessarily have to be a deficiency with the scoring
system. I suspect that most people simply don't care as long as the
the picks for OOM victims aren't entirely stupid.

For example, we have a lot of machines that run one class of job. If
we run OOM there isn't much preference we'd need to express; just kill
one job - the biggest, whatever - and move on. (The biggest makes
sense because if all jobs are basically equal it's as good as any
other victim, but if one has a runaway bug it goes for that.)

Where we have more than one job class, it actually is mostly one hipri
and one lopri, in which case setting a hard limit on the lopri or the
-1000 OOM score trick is enough.

How many systems run more than two clearly distinguishable classes of
workloads concurrently?

I'm sure they exist. I'm just saying it doesn't surprise me that
elaborate OOM scoring isn't all that wide-spread.

> > The only limitation is that you have to recalculate and update the
> > scoring tree every once in a while, whereas a bpf program could
> > evaluate things just-in-time. But for that to matter in practice, OOM
> > kills would have to be a fairly hot path.
> 
> I am not really sure how to reliably implement "kill the memcg with the
> largest process" strategy. And who knows how many others strategies will
> pop out.

That seems fairly contrived.

What does it mean to divide memory into subdomains, but when you run
out of physical memory you kill based on biggest task?

Sure, it frees memory and gets the system going again, so it's as good
as any answer to overcommit gone wrong, I guess. But is that something
you'd intentionally want to express from a userspace perspective?

> > > > > > > And both kinds of workloads (services/applications and individual
> > > > > > > processes run by users) can co-exist on the same host - consider the
> > > > > > > default systemd setup, for instance.
> > > > > > > 
> > > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > > 
> > > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > > I agree, that we can have some option to disable the cgroup-aware OOM at all,
> > > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > > per-cgroup configuration option, which we will support forever.
> > > > > 
> > > > > I can clearly see a demand for "this is definitely more important
> > > > > container than others so do not kill" usecases. I can also see demand
> > > > > for "do not kill this container running for X days". And more are likely
> > > > > to pop out.
> > > > 
> > > > That can all be done with scoring.
> > > 
> > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > from trivial.
> > 
> > Why is sorting and picking in userspace harder than sorting and
> > picking in the kernel?
> 
> Because the userspace score based approach would be much more racy
> especially in the busy system. This could lead to unexpected behavior
> when OOM killer would kill a different than a run-away memcgs.

How would it be easier to weigh priority against runaway detection
inside the kernel?

> > > +	/*
> > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > >  	 * select it.  The goal is to allow it to allocate so that it may
> > >  	 * quickly exit and free its memory.
> > > 
> > > Please note that I haven't explored how much of the infrastructure
> > > needed for the OOM decision making is available to modules. But we can
> > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > turn out that this is simply not feasible but I would like this to be at
> > > least explored before we go and implement yet another hardcoded way to
> > > handle (see how I didn't use policy ;)) OOM situation.
> > 
> > ;)
> > 
> > My doubt here is mainly that we'll see many (or any) real-life cases
> > materialize that cannot be handled with cgroups and scoring. These are
> > powerful building blocks on which userspace can implement all kinds of
> > policy and sorting algorithms.
> > 
> > So this seems like a lot of churn and complicated code to handle one
> > extension. An extension that implements basic functionality.
> 
> Well, as I've said I didn't get to explore this path so I have only a
> very vague idea what we would have to export to implement e.g. the
> proposed oom killing strategy suggested in this thread. Unfortunatelly I
> do not have much time for that. I do not want to block a useful work
> which you have a usecase for but I would be really happy if we could
> consider longer term plans before diving into a "hardcoded"
> implementation. We didn't do that previously and we are left with
> oom_kill_allocating_task and similar one off things.

As I understand it, killing the allocating task was simply the default
before the OOM killer and was added as a compat knob. I really doubt
anybody is using it at this point, and we could probably delete it.

I appreciate your concern of being too short-sighted here, but the
fact that I cannot point to more usecases isn't for lack of trying. I
simply don't see the endless possibilities of usecases that you do.

It's unlikely for more types of memory domains to pop up besides MMs
and cgroups. (I mentioned vmas, but that just seems esoteric. And we
have panic_on_oom for whole-system death. What else could there be?)

And as I pointed out, there is no real evidence that the current
system for configuring preferences isn't sufficient in practice.

That's my thoughts on exploring. I'm not sure what else to do before
it feels like running off into fairly contrived hypotheticals.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-05-31 18:01                 ` Johannes Weiner
@ 2017-06-02  8:43                   ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-06-02  8:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > > Everything the user would want to dynamically program in the kernel,
> > > say with bpf, they could do in userspace and then update the scores
> > > for each group and task periodically.
> > 
> > I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> > to mere oom disable/enable knobs from my experience.
> 
> That doesn't necessarily have to be a deficiency with the scoring
> system. I suspect that most people simply don't care as long as the
> the picks for OOM victims aren't entirely stupid.
> 
> For example, we have a lot of machines that run one class of job. If
> we run OOM there isn't much preference we'd need to express; just kill
> one job - the biggest, whatever - and move on. (The biggest makes
> sense because if all jobs are basically equal it's as good as any
> other victim, but if one has a runaway bug it goes for that.)
> 
> Where we have more than one job class, it actually is mostly one hipri
> and one lopri, in which case setting a hard limit on the lopri or the
> -1000 OOM score trick is enough.
> 
> How many systems run more than two clearly distinguishable classes of
> workloads concurrently?

What about those which run different containers on a large physical
machine?

> I'm sure they exist. I'm just saying it doesn't surprise me that
> elaborate OOM scoring isn't all that wide-spread.
> 
> > > The only limitation is that you have to recalculate and update the
> > > scoring tree every once in a while, whereas a bpf program could
> > > evaluate things just-in-time. But for that to matter in practice, OOM
> > > kills would have to be a fairly hot path.
> > 
> > I am not really sure how to reliably implement "kill the memcg with the
> > largest process" strategy. And who knows how many others strategies will
> > pop out.
> 
> That seems fairly contrived.
> 
> What does it mean to divide memory into subdomains, but when you run
> out of physical memory you kill based on biggest task?

Well, the biggest task might be the runaway one and so killing it first
before you kill other innocent ones makes some sense to me.

> Sure, it frees memory and gets the system going again, so it's as good
> as any answer to overcommit gone wrong, I guess. But is that something
> you'd intentionally want to express from a userspace perspective?
> 
[...]
> > > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > > from trivial.
> > > 
> > > Why is sorting and picking in userspace harder than sorting and
> > > picking in the kernel?
> > 
> > Because the userspace score based approach would be much more racy
> > especially in the busy system. This could lead to unexpected behavior
> > when OOM killer would kill a different than a run-away memcgs.
> 
> How would it be easier to weigh priority against runaway detection
> inside the kernel?

You have better chances to catch such a process at the time of the OOM
because you do the check at the time of the OOM rather than sometimes
back in time when your monitor was able to run and check all the
existing processes (which alone can be rather time consuming so you do
not want to do that very often).

> > > > +	/*
> > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > >  	 * quickly exit and free its memory.
> > > > 
> > > > Please note that I haven't explored how much of the infrastructure
> > > > needed for the OOM decision making is available to modules. But we can
> > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > turn out that this is simply not feasible but I would like this to be at
> > > > least explored before we go and implement yet another hardcoded way to
> > > > handle (see how I didn't use policy ;)) OOM situation.
> > > 
> > > ;)
> > > 
> > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > materialize that cannot be handled with cgroups and scoring. These are
> > > powerful building blocks on which userspace can implement all kinds of
> > > policy and sorting algorithms.
> > > 
> > > So this seems like a lot of churn and complicated code to handle one
> > > extension. An extension that implements basic functionality.
> > 
> > Well, as I've said I didn't get to explore this path so I have only a
> > very vague idea what we would have to export to implement e.g. the
> > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > do not have much time for that. I do not want to block a useful work
> > which you have a usecase for but I would be really happy if we could
> > consider longer term plans before diving into a "hardcoded"
> > implementation. We didn't do that previously and we are left with
> > oom_kill_allocating_task and similar one off things.
> 
> As I understand it, killing the allocating task was simply the default
> before the OOM killer and was added as a compat knob. I really doubt
> anybody is using it at this point, and we could probably delete it.

I might misremember but my recollection is that SGI simply had too
large machines with too many processes and so the task selection was
very expensinve.

> I appreciate your concern of being too short-sighted here, but the
> fact that I cannot point to more usecases isn't for lack of trying. I
> simply don't see the endless possibilities of usecases that you do.
> 
> It's unlikely for more types of memory domains to pop up besides MMs
> and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> have panic_on_oom for whole-system death. What else could there be?)
> 
> And as I pointed out, there is no real evidence that the current
> system for configuring preferences isn't sufficient in practice.
> 
> That's my thoughts on exploring. I'm not sure what else to do before
> it feels like running off into fairly contrived hypotheticals.

Yes, I do not want hypotheticals to block an otherwise useful feature,
of course. But I haven't heard a strong argument why a module based
approach would be a more maintenance burden longterm. From a very quick
glance over patches Roman has posted yesterday it seems that a large
part of the existing oom infrastructure can be reused reasonably.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-06-02  8:43                   ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-06-02  8:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > > Everything the user would want to dynamically program in the kernel,
> > > say with bpf, they could do in userspace and then update the scores
> > > for each group and task periodically.
> > 
> > I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> > to mere oom disable/enable knobs from my experience.
> 
> That doesn't necessarily have to be a deficiency with the scoring
> system. I suspect that most people simply don't care as long as the
> the picks for OOM victims aren't entirely stupid.
> 
> For example, we have a lot of machines that run one class of job. If
> we run OOM there isn't much preference we'd need to express; just kill
> one job - the biggest, whatever - and move on. (The biggest makes
> sense because if all jobs are basically equal it's as good as any
> other victim, but if one has a runaway bug it goes for that.)
> 
> Where we have more than one job class, it actually is mostly one hipri
> and one lopri, in which case setting a hard limit on the lopri or the
> -1000 OOM score trick is enough.
> 
> How many systems run more than two clearly distinguishable classes of
> workloads concurrently?

What about those which run different containers on a large physical
machine?

> I'm sure they exist. I'm just saying it doesn't surprise me that
> elaborate OOM scoring isn't all that wide-spread.
> 
> > > The only limitation is that you have to recalculate and update the
> > > scoring tree every once in a while, whereas a bpf program could
> > > evaluate things just-in-time. But for that to matter in practice, OOM
> > > kills would have to be a fairly hot path.
> > 
> > I am not really sure how to reliably implement "kill the memcg with the
> > largest process" strategy. And who knows how many others strategies will
> > pop out.
> 
> That seems fairly contrived.
> 
> What does it mean to divide memory into subdomains, but when you run
> out of physical memory you kill based on biggest task?

Well, the biggest task might be the runaway one and so killing it first
before you kill other innocent ones makes some sense to me.

> Sure, it frees memory and gets the system going again, so it's as good
> as any answer to overcommit gone wrong, I guess. But is that something
> you'd intentionally want to express from a userspace perspective?
> 
[...]
> > > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > > from trivial.
> > > 
> > > Why is sorting and picking in userspace harder than sorting and
> > > picking in the kernel?
> > 
> > Because the userspace score based approach would be much more racy
> > especially in the busy system. This could lead to unexpected behavior
> > when OOM killer would kill a different than a run-away memcgs.
> 
> How would it be easier to weigh priority against runaway detection
> inside the kernel?

You have better chances to catch such a process at the time of the OOM
because you do the check at the time of the OOM rather than sometimes
back in time when your monitor was able to run and check all the
existing processes (which alone can be rather time consuming so you do
not want to do that very often).

> > > > +	/*
> > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > >  	 * quickly exit and free its memory.
> > > > 
> > > > Please note that I haven't explored how much of the infrastructure
> > > > needed for the OOM decision making is available to modules. But we can
> > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > turn out that this is simply not feasible but I would like this to be at
> > > > least explored before we go and implement yet another hardcoded way to
> > > > handle (see how I didn't use policy ;)) OOM situation.
> > > 
> > > ;)
> > > 
> > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > materialize that cannot be handled with cgroups and scoring. These are
> > > powerful building blocks on which userspace can implement all kinds of
> > > policy and sorting algorithms.
> > > 
> > > So this seems like a lot of churn and complicated code to handle one
> > > extension. An extension that implements basic functionality.
> > 
> > Well, as I've said I didn't get to explore this path so I have only a
> > very vague idea what we would have to export to implement e.g. the
> > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > do not have much time for that. I do not want to block a useful work
> > which you have a usecase for but I would be really happy if we could
> > consider longer term plans before diving into a "hardcoded"
> > implementation. We didn't do that previously and we are left with
> > oom_kill_allocating_task and similar one off things.
> 
> As I understand it, killing the allocating task was simply the default
> before the OOM killer and was added as a compat knob. I really doubt
> anybody is using it at this point, and we could probably delete it.

I might misremember but my recollection is that SGI simply had too
large machines with too many processes and so the task selection was
very expensinve.

> I appreciate your concern of being too short-sighted here, but the
> fact that I cannot point to more usecases isn't for lack of trying. I
> simply don't see the endless possibilities of usecases that you do.
> 
> It's unlikely for more types of memory domains to pop up besides MMs
> and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> have panic_on_oom for whole-system death. What else could there be?)
> 
> And as I pointed out, there is no real evidence that the current
> system for configuring preferences isn't sufficient in practice.
> 
> That's my thoughts on exploring. I'm not sure what else to do before
> it feels like running off into fairly contrived hypotheticals.

Yes, I do not want hypotheticals to block an otherwise useful feature,
of course. But I haven't heard a strong argument why a module based
approach would be a more maintenance burden longterm. From a very quick
glance over patches Roman has posted yesterday it seems that a large
part of the existing oom infrastructure can be reused reasonably.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-06-02  8:43                   ` Michal Hocko
@ 2017-06-02 15:18                     ` Roman Gushchin
  -1 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-06-02 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > +	/*
> > > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > > >  	 * quickly exit and free its memory.
> > > > > 
> > > > > Please note that I haven't explored how much of the infrastructure
> > > > > needed for the OOM decision making is available to modules. But we can
> > > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > > turn out that this is simply not feasible but I would like this to be at
> > > > > least explored before we go and implement yet another hardcoded way to
> > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > 
> > > > ;)
> > > > 
> > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > powerful building blocks on which userspace can implement all kinds of
> > > > policy and sorting algorithms.
> > > > 
> > > > So this seems like a lot of churn and complicated code to handle one
> > > > extension. An extension that implements basic functionality.
> > > 
> > > Well, as I've said I didn't get to explore this path so I have only a
> > > very vague idea what we would have to export to implement e.g. the
> > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > do not have much time for that. I do not want to block a useful work
> > > which you have a usecase for but I would be really happy if we could
> > > consider longer term plans before diving into a "hardcoded"
> > > implementation. We didn't do that previously and we are left with
> > > oom_kill_allocating_task and similar one off things.
> > 
> > As I understand it, killing the allocating task was simply the default
> > before the OOM killer and was added as a compat knob. I really doubt
> > anybody is using it at this point, and we could probably delete it.
> 
> I might misremember but my recollection is that SGI simply had too
> large machines with too many processes and so the task selection was
> very expensinve.

Cgroup-aware OOM killer can be much better in case of large number of processes,
as we don't have to iterate over all processes locking each mm, and
can select an appropriate cgroup based mostly on lockless counters.
Of course, it depends on concrete setup, but it can be much more efficient
under right circumstances.

> 
> > I appreciate your concern of being too short-sighted here, but the
> > fact that I cannot point to more usecases isn't for lack of trying. I
> > simply don't see the endless possibilities of usecases that you do.
> > 
> > It's unlikely for more types of memory domains to pop up besides MMs
> > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > have panic_on_oom for whole-system death. What else could there be?)
> > 
> > And as I pointed out, there is no real evidence that the current
> > system for configuring preferences isn't sufficient in practice.
> > 
> > That's my thoughts on exploring. I'm not sure what else to do before
> > it feels like running off into fairly contrived hypotheticals.
> 
> Yes, I do not want hypotheticals to block an otherwise useful feature,
> of course. But I haven't heard a strong argument why a module based
> approach would be a more maintenance burden longterm. From a very quick
> glance over patches Roman has posted yesterday it seems that a large
> part of the existing oom infrastructure can be reused reasonably.

I have nothing against module based approach, but I don't think that a module
should implement anything rather than then oom score calculation
(for a process and a cgroup).
Maybe only some custom method for killing, but I can't really imagine anything
reasonable except killing one "worst" process or killing whole cgroup(s).
In case of a system wide OOM, we have to free some memory quickly,
and this means we can't do anything much more complex,
than killing some process(es).

So, in my understanding, what you're suggesting is not against the proposed
approach at all. We still need to iterate over cgroups, somehow define
their badness, find the worst one and destroy it. In my v2 I've tried
to separate these two potentially customizable areas in two simple functions:
mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().
So we can add an ability to customize these functions (and similar stuff
for processes), if we'll have some real examples of where the proposed
functionality is insufficient.

Do you have any examples which can't be covered by this approach?

Thanks!

Roman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-06-02 15:18                     ` Roman Gushchin
  0 siblings, 0 replies; 42+ messages in thread
From: Roman Gushchin @ 2017-06-02 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > +	/*
> > > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > > >  	 * quickly exit and free its memory.
> > > > > 
> > > > > Please note that I haven't explored how much of the infrastructure
> > > > > needed for the OOM decision making is available to modules. But we can
> > > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > > turn out that this is simply not feasible but I would like this to be at
> > > > > least explored before we go and implement yet another hardcoded way to
> > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > 
> > > > ;)
> > > > 
> > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > powerful building blocks on which userspace can implement all kinds of
> > > > policy and sorting algorithms.
> > > > 
> > > > So this seems like a lot of churn and complicated code to handle one
> > > > extension. An extension that implements basic functionality.
> > > 
> > > Well, as I've said I didn't get to explore this path so I have only a
> > > very vague idea what we would have to export to implement e.g. the
> > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > do not have much time for that. I do not want to block a useful work
> > > which you have a usecase for but I would be really happy if we could
> > > consider longer term plans before diving into a "hardcoded"
> > > implementation. We didn't do that previously and we are left with
> > > oom_kill_allocating_task and similar one off things.
> > 
> > As I understand it, killing the allocating task was simply the default
> > before the OOM killer and was added as a compat knob. I really doubt
> > anybody is using it at this point, and we could probably delete it.
> 
> I might misremember but my recollection is that SGI simply had too
> large machines with too many processes and so the task selection was
> very expensinve.

Cgroup-aware OOM killer can be much better in case of large number of processes,
as we don't have to iterate over all processes locking each mm, and
can select an appropriate cgroup based mostly on lockless counters.
Of course, it depends on concrete setup, but it can be much more efficient
under right circumstances.

> 
> > I appreciate your concern of being too short-sighted here, but the
> > fact that I cannot point to more usecases isn't for lack of trying. I
> > simply don't see the endless possibilities of usecases that you do.
> > 
> > It's unlikely for more types of memory domains to pop up besides MMs
> > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > have panic_on_oom for whole-system death. What else could there be?)
> > 
> > And as I pointed out, there is no real evidence that the current
> > system for configuring preferences isn't sufficient in practice.
> > 
> > That's my thoughts on exploring. I'm not sure what else to do before
> > it feels like running off into fairly contrived hypotheticals.
> 
> Yes, I do not want hypotheticals to block an otherwise useful feature,
> of course. But I haven't heard a strong argument why a module based
> approach would be a more maintenance burden longterm. From a very quick
> glance over patches Roman has posted yesterday it seems that a large
> part of the existing oom infrastructure can be reused reasonably.

I have nothing against module based approach, but I don't think that a module
should implement anything rather than then oom score calculation
(for a process and a cgroup).
Maybe only some custom method for killing, but I can't really imagine anything
reasonable except killing one "worst" process or killing whole cgroup(s).
In case of a system wide OOM, we have to free some memory quickly,
and this means we can't do anything much more complex,
than killing some process(es).

So, in my understanding, what you're suggesting is not against the proposed
approach at all. We still need to iterate over cgroups, somehow define
their badness, find the worst one and destroy it. In my v2 I've tried
to separate these two potentially customizable areas in two simple functions:
mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().
So we can add an ability to customize these functions (and similar stuff
for processes), if we'll have some real examples of where the proposed
functionality is insufficient.

Do you have any examples which can't be covered by this approach?

Thanks!

Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
  2017-06-02 15:18                     ` Roman Gushchin
@ 2017-06-05  8:27                       ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-06-05  8:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Fri 02-06-17 16:18:52, Roman Gushchin wrote:
> On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > > +	/*
> > > > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > > > >  	 * quickly exit and free its memory.
> > > > > > 
> > > > > > Please note that I haven't explored how much of the infrastructure
> > > > > > needed for the OOM decision making is available to modules. But we can
> > > > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > > > turn out that this is simply not feasible but I would like this to be at
> > > > > > least explored before we go and implement yet another hardcoded way to
> > > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > > 
> > > > > ;)
> > > > > 
> > > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > > powerful building blocks on which userspace can implement all kinds of
> > > > > policy and sorting algorithms.
> > > > > 
> > > > > So this seems like a lot of churn and complicated code to handle one
> > > > > extension. An extension that implements basic functionality.
> > > > 
> > > > Well, as I've said I didn't get to explore this path so I have only a
> > > > very vague idea what we would have to export to implement e.g. the
> > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > > do not have much time for that. I do not want to block a useful work
> > > > which you have a usecase for but I would be really happy if we could
> > > > consider longer term plans before diving into a "hardcoded"
> > > > implementation. We didn't do that previously and we are left with
> > > > oom_kill_allocating_task and similar one off things.
> > > 
> > > As I understand it, killing the allocating task was simply the default
> > > before the OOM killer and was added as a compat knob. I really doubt
> > > anybody is using it at this point, and we could probably delete it.
> > 
> > I might misremember but my recollection is that SGI simply had too
> > large machines with too many processes and so the task selection was
> > very expensinve.
> 
> Cgroup-aware OOM killer can be much better in case of large number of processes,
> as we don't have to iterate over all processes locking each mm, and
> can select an appropriate cgroup based mostly on lockless counters.
> Of course, it depends on concrete setup, but it can be much more efficient
> under right circumstances.

Yes, I agree with that.

> > > I appreciate your concern of being too short-sighted here, but the
> > > fact that I cannot point to more usecases isn't for lack of trying. I
> > > simply don't see the endless possibilities of usecases that you do.
> > > 
> > > It's unlikely for more types of memory domains to pop up besides MMs
> > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > > have panic_on_oom for whole-system death. What else could there be?)
> > > 
> > > And as I pointed out, there is no real evidence that the current
> > > system for configuring preferences isn't sufficient in practice.
> > > 
> > > That's my thoughts on exploring. I'm not sure what else to do before
> > > it feels like running off into fairly contrived hypotheticals.
> > 
> > Yes, I do not want hypotheticals to block an otherwise useful feature,
> > of course. But I haven't heard a strong argument why a module based
> > approach would be a more maintenance burden longterm. From a very quick
> > glance over patches Roman has posted yesterday it seems that a large
> > part of the existing oom infrastructure can be reused reasonably.
> 
> I have nothing against module based approach, but I don't think that a module
> should implement anything rather than then oom score calculation
> (for a process and a cgroup).
> Maybe only some custom method for killing, but I can't really imagine anything
> reasonable except killing one "worst" process or killing whole cgroup(s).
> In case of a system wide OOM, we have to free some memory quickly,
> and this means we can't do anything much more complex,
> than killing some process(es).
> 
> So, in my understanding, what you're suggesting is not against the proposed
> approach at all. We still need to iterate over cgroups, somehow define
> their badness, find the worst one and destroy it. In my v2 I've tried
> to separate these two potentially customizable areas in two simple functions:
> mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().

As I've said, I didn't get to look closer at your v2 yet. My point was
that we shouldn't hardcode the memcg specific selection nor the killing
strategy into the oom proper. Instead we could reuse the existing
infrastructure we already have. And yes from a quick look, you are
already doing something I have had in mind. I will look more closely
sometimes this week. The biggest concern I've had so far is to have
something hardcoded in the oom proper now if we can make this a module.

I will follow up in your v2 email thread.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
@ 2017-06-05  8:27                       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2017-06-05  8:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Vladimir Davydov, Tejun Heo, Li Zefan,
	Tetsuo Handa, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-mm

On Fri 02-06-17 16:18:52, Roman Gushchin wrote:
> On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > > +	/*
> > > > > >  	 * If current has a pending SIGKILL or is exiting, then automatically
> > > > > >  	 * select it.  The goal is to allow it to allocate so that it may
> > > > > >  	 * quickly exit and free its memory.
> > > > > > 
> > > > > > Please note that I haven't explored how much of the infrastructure
> > > > > > needed for the OOM decision making is available to modules. But we can
> > > > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > > > turn out that this is simply not feasible but I would like this to be at
> > > > > > least explored before we go and implement yet another hardcoded way to
> > > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > > 
> > > > > ;)
> > > > > 
> > > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > > powerful building blocks on which userspace can implement all kinds of
> > > > > policy and sorting algorithms.
> > > > > 
> > > > > So this seems like a lot of churn and complicated code to handle one
> > > > > extension. An extension that implements basic functionality.
> > > > 
> > > > Well, as I've said I didn't get to explore this path so I have only a
> > > > very vague idea what we would have to export to implement e.g. the
> > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > > do not have much time for that. I do not want to block a useful work
> > > > which you have a usecase for but I would be really happy if we could
> > > > consider longer term plans before diving into a "hardcoded"
> > > > implementation. We didn't do that previously and we are left with
> > > > oom_kill_allocating_task and similar one off things.
> > > 
> > > As I understand it, killing the allocating task was simply the default
> > > before the OOM killer and was added as a compat knob. I really doubt
> > > anybody is using it at this point, and we could probably delete it.
> > 
> > I might misremember but my recollection is that SGI simply had too
> > large machines with too many processes and so the task selection was
> > very expensinve.
> 
> Cgroup-aware OOM killer can be much better in case of large number of processes,
> as we don't have to iterate over all processes locking each mm, and
> can select an appropriate cgroup based mostly on lockless counters.
> Of course, it depends on concrete setup, but it can be much more efficient
> under right circumstances.

Yes, I agree with that.

> > > I appreciate your concern of being too short-sighted here, but the
> > > fact that I cannot point to more usecases isn't for lack of trying. I
> > > simply don't see the endless possibilities of usecases that you do.
> > > 
> > > It's unlikely for more types of memory domains to pop up besides MMs
> > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > > have panic_on_oom for whole-system death. What else could there be?)
> > > 
> > > And as I pointed out, there is no real evidence that the current
> > > system for configuring preferences isn't sufficient in practice.
> > > 
> > > That's my thoughts on exploring. I'm not sure what else to do before
> > > it feels like running off into fairly contrived hypotheticals.
> > 
> > Yes, I do not want hypotheticals to block an otherwise useful feature,
> > of course. But I haven't heard a strong argument why a module based
> > approach would be a more maintenance burden longterm. From a very quick
> > glance over patches Roman has posted yesterday it seems that a large
> > part of the existing oom infrastructure can be reused reasonably.
> 
> I have nothing against module based approach, but I don't think that a module
> should implement anything rather than then oom score calculation
> (for a process and a cgroup).
> Maybe only some custom method for killing, but I can't really imagine anything
> reasonable except killing one "worst" process or killing whole cgroup(s).
> In case of a system wide OOM, we have to free some memory quickly,
> and this means we can't do anything much more complex,
> than killing some process(es).
> 
> So, in my understanding, what you're suggesting is not against the proposed
> approach at all. We still need to iterate over cgroups, somehow define
> their badness, find the worst one and destroy it. In my v2 I've tried
> to separate these two potentially customizable areas in two simple functions:
> mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().

As I've said, I didn't get to look closer at your v2 yet. My point was
that we shouldn't hardcode the memcg specific selection nor the killing
strategy into the oom proper. Instead we could reuse the existing
infrastructure we already have. And yes from a quick look, you are
already doing something I have had in mind. I will look more closely
sometimes this week. The biggest concern I've had so far is to have
something hardcoded in the oom proper now if we can make this a module.

I will follow up in your v2 email thread.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-06-05  8:27 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 16:28 [RFC PATCH] mm, oom: cgroup-aware OOM-killer Roman Gushchin
2017-05-18 17:30 ` Michal Hocko
2017-05-18 17:30   ` Michal Hocko
2017-05-18 18:11   ` Johannes Weiner
2017-05-18 18:11     ` Johannes Weiner
2017-05-19  8:02     ` Michal Hocko
2017-05-19  8:02       ` Michal Hocko
2017-05-18 18:37   ` Balbir Singh
2017-05-18 18:37     ` Balbir Singh
2017-05-18 19:20     ` Roman Gushchin
2017-05-18 19:20       ` Roman Gushchin
2017-05-18 19:41       ` Balbir Singh
2017-05-18 19:41         ` Balbir Singh
2017-05-18 19:22     ` Johannes Weiner
2017-05-18 19:22       ` Johannes Weiner
2017-05-18 19:43       ` Balbir Singh
2017-05-18 19:43         ` Balbir Singh
2017-05-18 20:15         ` Johannes Weiner
2017-05-18 20:15           ` Johannes Weiner
2017-05-20 18:37 ` Vladimir Davydov
2017-05-20 18:37   ` Vladimir Davydov
2017-05-22 17:01   ` Roman Gushchin
2017-05-22 17:01     ` Roman Gushchin
2017-05-23  7:07     ` Michal Hocko
2017-05-23  7:07       ` Michal Hocko
2017-05-23 13:25       ` Johannes Weiner
2017-05-23 13:25         ` Johannes Weiner
2017-05-25 15:38         ` Michal Hocko
2017-05-25 15:38           ` Michal Hocko
2017-05-25 17:08           ` Johannes Weiner
2017-05-25 17:08             ` Johannes Weiner
2017-05-25 17:08             ` Johannes Weiner
2017-05-31 16:25             ` Michal Hocko
2017-05-31 16:25               ` Michal Hocko
2017-05-31 18:01               ` Johannes Weiner
2017-05-31 18:01                 ` Johannes Weiner
2017-06-02  8:43                 ` Michal Hocko
2017-06-02  8:43                   ` Michal Hocko
2017-06-02 15:18                   ` Roman Gushchin
2017-06-02 15:18                     ` Roman Gushchin
2017-06-05  8:27                     ` Michal Hocko
2017-06-05  8:27                       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.