* [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
` (8 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.
This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().
The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++---------------------------
1 file changed, 65 insertions(+), 58 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b0d0fed8480..f041534d77d3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -814,68 +814,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
}
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
{
- struct task_struct *p = oc->chosen;
- unsigned int points = oc->chosen_points;
- struct task_struct *victim = p;
- struct task_struct *child;
- struct task_struct *t;
+ struct task_struct *p;
struct mm_struct *mm;
- unsigned int victim_points = 0;
- static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
- /*
- * If the task is already exiting, don't alarm the sysadmin or kill
- * its children or threads, just give it access to memory reserves
- * so it can die quickly
- */
- task_lock(p);
- if (task_will_free_mem(p)) {
- mark_oom_victim(p);
- wake_oom_reaper(p);
- task_unlock(p);
- put_task_struct(p);
- return;
- }
- task_unlock(p);
-
- if (__ratelimit(&oom_rs))
- dump_header(oc, p);
-
- pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
- message, task_pid_nr(p), p->comm, points);
-
- /*
- * If any of p's children has a different mm and is eligible for kill,
- * the one with the highest oom_badness() score is sacrificed for its
- * parent. This attempts to lose the minimal amount of work done while
- * still freeing memory.
- */
- read_lock(&tasklist_lock);
- for_each_thread(p, t) {
- list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
-
- if (process_shares_mm(child, p->mm))
- continue;
- /*
- * oom_badness() returns 0 if the thread is unkillable
- */
- child_points = oom_badness(child,
- oc->memcg, oc->nodemask, oc->totalpages);
- if (child_points > victim_points) {
- put_task_struct(victim);
- victim = child;
- victim_points = child_points;
- get_task_struct(victim);
- }
- }
- }
- read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -949,6 +893,69 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
}
#undef K
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+ struct task_struct *p = oc->chosen;
+ unsigned int points = oc->chosen_points;
+ struct task_struct *victim = p;
+ struct task_struct *child;
+ struct task_struct *t;
+ unsigned int victim_points = 0;
+ static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ /*
+ * If the task is already exiting, don't alarm the sysadmin or kill
+ * its children or threads, just give it access to memory reserves
+ * so it can die quickly
+ */
+ task_lock(p);
+ if (task_will_free_mem(p)) {
+ mark_oom_victim(p);
+ wake_oom_reaper(p);
+ task_unlock(p);
+ put_task_struct(p);
+ return;
+ }
+ task_unlock(p);
+
+ if (__ratelimit(&oom_rs))
+ dump_header(oc, p);
+
+ pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);
+
+ /*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+ read_lock(&tasklist_lock);
+ for_each_thread(p, t) {
+ list_for_each_entry(child, &t->children, sibling) {
+ unsigned int child_points;
+
+ if (process_shares_mm(child, p->mm))
+ continue;
+ /*
+ * oom_badness() returns 0 if the thread is unkillable
+ */
+ child_points = oom_badness(child,
+ oc->memcg, oc->nodemask, oc->totalpages);
+ if (child_points > victim_points) {
+ put_task_struct(victim);
+ victim = child;
+ victim_points = child_points;
+ get_task_struct(victim);
+ }
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ __oom_kill_process(victim);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
` (7 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, Andrew Morton,
Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
linux-mm
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.
The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.
This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
mm/memcontrol.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4aac306ebe3..55fbda60cef6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -888,7 +888,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
* value, the function breaks the iteration loop and returns the value.
* Otherwise, it will iterate over all tasks and return 0.
*
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
*/
int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
int (*fn)(struct task_struct *, void *), void *arg)
@@ -896,8 +897,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
- BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -906,7 +905,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
- if (ret) {
+ if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:35 ` Michal Hocko
2017-12-07 1:24 ` Andrew Morton
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
` (6 subsequent siblings)
9 siblings, 2 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.
This behavior doesn't suit well the system with many running
containers:
1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.
2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.
To address these issues, the cgroup-aware OOM killer is introduced.
This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.
The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.
The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 17 +++++
include/linux/oom.h | 12 ++-
mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 84 +++++++++++++++------
4 files changed, 272 insertions(+), 22 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 882046863581..cb4db659a8b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27cd36b762b5..10f495c8454d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -10,6 +10,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -51,7 +58,8 @@ struct oom_control {
/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};
@@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55fbda60cef6..592ffb1c98a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
return ret;
}
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Root memory cgroup is a special case:
+ * we don't have necessary stats to evaluate it exactly as
+ * leaf memory cgroups, so we approximate it's oom_score
+ * by summing oom_score of all belonging tasks, which are
+ * owners of their mm structs.
+ *
+ * If there are inflight OOM victim tasks inside
+ * the root memcg, we return -1.
+ */
+ if (memcg == root_mem_cgroup) {
+ struct css_task_iter it;
+ struct task_struct *task;
+ long score = 0;
+
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP,
+ &task->signal->oom_mm->flags)) {
+ score = -1;
+ break;
+ }
+
+ task_lock(task);
+ if (!task->mm || task->mm->owner != task) {
+ task_unlock(task);
+ continue;
+ }
+ task_unlock(task);
+
+ score += oom_badness(task, memcg, nodemask,
+ totalpages);
+ }
+ css_task_iter_end(&it);
+
+ return score;
+ }
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter;
+
+ oc->chosen_memcg = NULL;
+ oc->chosen_points = 0;
+
+ /*
+ * The oom_score is calculated for leaf memory cgroups (including
+ * the root memcg).
+ */
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, root) {
+ long score;
+
+ if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ continue;
+
+ score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need
+ * to look further for new victims.
+ */
+ if (score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ break;
+ }
+
+ if (score > oc->chosen_points) {
+ oc->chosen_points = score;
+ oc->chosen_memcg = iter;
+ }
+ }
+
+ if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+ css_get(&oc->chosen_memcg->css);
+
+ rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+ else
+ root = root_mem_cgroup;
+
+ select_victim_memcg(root, oc);
+
+ return oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f041534d77d3..bcfa92f29407 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto next;
/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+ if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+ return oc->chosen_memcg;
+
+ /* Kill a task in the chosen memcg with the biggest memory footprint */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
+
+ __oom_kill_process(oc->chosen_task);
+
+out:
+ mem_cgroup_put(oc->chosen_memcg);
+ return oc->chosen_task;
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
{
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ bool delay = false; /* if set, delay next allocation attempt */
if (oom_killer_disabled)
return false;
@@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
if (oc->page)
return true;
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}
+ if (mem_cgroup_select_oom_victim(oc)) {
+ oc->page = alloc_pages_before_oomkill(oc);
+ if (oc->page) {
+ if (oc->chosen_memcg &&
+ oc->chosen_memcg != INFLIGHT_VICTIM)
+ mem_cgroup_put(oc->chosen_memcg);
+ return true;
+ }
+
+ if (oom_kill_memcg_victim(oc)) {
+ delay = true;
+ goto out;
+ }
+ }
+
select_bad_process(oc);
/*
* Try really last second allocation attempt after we selected an OOM
@@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
*/
oc->page = alloc_pages_before_oomkill(oc);
if (oc->page) {
- if (oc->chosen && oc->chosen != (void *)-1UL)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+ put_task_struct(oc->chosen_task);
return true;
}
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
+ delay = true;
}
- return !!oc->chosen;
+
+out:
+ /*
+ * Give the killed process a good chance to exit before trying
+ * to allocate memory again.
+ */
+ if (delay)
+ schedule_timeout_killable(1);
+
+ return !!oc->chosen_task;
}
/*
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01 8:35 ` Michal Hocko
2017-12-07 1:24 ` Andrew Morton
1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:35 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:20, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
> To address these issues, the cgroup-aware OOM killer is introduced.
>
> This patch introduces the core functionality: an ability to select
> a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
> looks for the biggest leaf memory cgroup and kills the biggest
> task belonging to it.
>
> The following patches will extend this functionality to consider
> non-leaf memory cgroups as OOM victims, and also provide an ability
> to kill all tasks belonging to the victim cgroup.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf memory cgroups.
> Due to memcg statistics implementation a special approximation
> is used for estimating oom_score of root memory cgroup: we sum
> oom_score of the belonging processes (or, to be more precise,
> tasks owning their mm structures).
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
I am not entirely happy that this patch enables the cgroup behavior
unconditioanlly for cgroup v2 but later patch fixes that up. I do not
expect people are going to bisect oom workloads over these few commits
so this should be a big deal.
Anyway I still _strongly_ believe that the new heuristic is not
suitable for the default behavior and the opt-in is required. So my ack
is under this condition.
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/memcontrol.h | 17 +++++
> include/linux/oom.h | 12 ++-
> mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++++++++++++
> mm/oom_kill.c | 84 +++++++++++++++------
> 4 files changed, 272 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 882046863581..cb4db659a8b5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
> struct page;
> struct mm_struct;
> struct kmem_cache;
> +struct oom_control;
>
> /* Cgroup-specific page state, on top of universal node page state */
> enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> return css ? container_of(css, struct mem_cgroup, css) : NULL;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
> #define mem_cgroup_from_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
>
> bool mem_cgroup_oom_synchronize(bool wait);
>
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
> return true;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline struct mem_cgroup *
> mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
> void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
> {
> }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
> #endif /* CONFIG_MEMCG */
>
> /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 27cd36b762b5..10f495c8454d 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -10,6 +10,13 @@
> #include <linux/sched/coredump.h> /* MMF_* */
> #include <linux/mm.h> /* VM_FAULT* */
>
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
> struct zonelist;
> struct notifier_block;
> struct mem_cgroup;
> @@ -51,7 +58,8 @@ struct oom_control {
>
> /* Used by oom implementation, do not set */
> unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
> unsigned long chosen_points;
> };
>
> @@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
>
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
> /* sysctls */
> extern int sysctl_oom_dump_tasks;
> extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 55fbda60cef6..592ffb1c98a7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Root memory cgroup is a special case:
> + * we don't have necessary stats to evaluate it exactly as
> + * leaf memory cgroups, so we approximate it's oom_score
> + * by summing oom_score of all belonging tasks, which are
> + * owners of their mm structs.
> + *
> + * If there are inflight OOM victim tasks inside
> + * the root memcg, we return -1.
> + */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP,
> + &task->signal->oom_mm->flags)) {
> + score = -1;
> + break;
> + }
> +
> + task_lock(task);
> + if (!task->mm || task->mm->owner != task) {
> + task_unlock(task);
> + continue;
> + }
> + task_unlock(task);
> +
> + score += oom_badness(task, memcg, nodemask,
> + totalpages);
> + }
> + css_task_iter_end(&it);
> +
> + return score;
> + }
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> + * The oom_score is calculated for leaf memory cgroups (including
> + * the root memcg).
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;
> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need
> + * to look further for new victims.
> + */
> + if (score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (score > oc->chosen_points) {
> + oc->chosen_points = score;
> + oc->chosen_memcg = iter;
> + }
> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> + else
> + root = root_mem_cgroup;
> +
> + select_victim_memcg(root, oc);
> +
> + return oc->chosen_memcg;
> +}
> +
> /*
> * Reclaims as many pages from the given memcg as possible.
> *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f041534d77d3..bcfa92f29407 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
> return CONSTRAINT_NONE;
> }
>
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> struct oom_control *oc = arg;
> unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
> goto next;
>
> /* Prefer thread group leaders for display purposes */
> - if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
> goto next;
> select:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> get_task_struct(task);
> - oc->chosen = task;
> + oc->chosen_task = task;
> oc->chosen_points = points;
> next:
> return 0;
> abort:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> - oc->chosen = (void *)-1UL;
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> + oc->chosen_task = INFLIGHT_VICTIM;
> return 1;
> }
>
> /*
> * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
> */
> static void select_bad_process(struct oom_control *oc)
> {
> @@ -895,7 +895,7 @@ static void __oom_kill_process(struct task_struct *victim)
>
> static void oom_kill_process(struct oom_control *oc, const char *message)
> {
> - struct task_struct *p = oc->chosen;
> + struct task_struct *p = oc->chosen_task;
> unsigned int points = oc->chosen_points;
> struct task_struct *victim = p;
> struct task_struct *child;
> @@ -956,6 +956,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> __oom_kill_process(victim);
> }
>
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Kill a task in the chosen memcg with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + __oom_kill_process(oc->chosen_task);
> +
> +out:
> + mem_cgroup_put(oc->chosen_memcg);
> + return oc->chosen_task;
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> @@ -1008,6 +1029,7 @@ bool out_of_memory(struct oom_control *oc)
> {
> unsigned long freed = 0;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> + bool delay = false; /* if set, delay next allocation attempt */
>
> if (oom_killer_disabled)
> return false;
> @@ -1055,11 +1077,26 @@ bool out_of_memory(struct oom_control *oc)
> if (oc->page)
> return true;
> get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
> oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> return true;
> }
>
> + if (mem_cgroup_select_oom_victim(oc)) {
> + oc->page = alloc_pages_before_oomkill(oc);
> + if (oc->page) {
> + if (oc->chosen_memcg &&
> + oc->chosen_memcg != INFLIGHT_VICTIM)
> + mem_cgroup_put(oc->chosen_memcg);
> + return true;
> + }
> +
> + if (oom_kill_memcg_victim(oc)) {
> + delay = true;
> + goto out;
> + }
> + }
> +
> select_bad_process(oc);
> /*
> * Try really last second allocation attempt after we selected an OOM
> @@ -1068,25 +1105,30 @@ bool out_of_memory(struct oom_control *oc)
> */
> oc->page = alloc_pages_before_oomkill(oc);
> if (oc->page) {
> - if (oc->chosen && oc->chosen != (void *)-1UL)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
> + put_task_struct(oc->chosen_task);
> return true;
> }
> /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> - if (oc->chosen && oc->chosen != (void *)-1UL) {
> + if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
> oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> "Memory cgroup out of memory");
> - /*
> - * Give the killed process a good chance to exit before trying
> - * to allocate memory again.
> - */
> - schedule_timeout_killable(1);
> + delay = true;
> }
> - return !!oc->chosen;
> +
> +out:
> + /*
> + * Give the killed process a good chance to exit before trying
> + * to allocate memory again.
> + */
> + if (delay)
> + schedule_timeout_killable(1);
> +
> + return !!oc->chosen_task;
> }
>
> /*
> --
> 2.14.3
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-12-01 8:35 ` Michal Hocko
@ 2017-12-07 1:24 ` Andrew Morton
2017-12-07 13:39 ` Roman Gushchin
1 sibling, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-12-07 1:24 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
As a result of the "stalled MM patches" discussion I've dropped these
three patches:
mm,oom: move last second allocation to inside the OOM killer
mm,oom: use ALLOC_OOM for OOM victim's last second allocation
mm,oom: remove oom_lock serialization from the OOM reaper
and I had to rework this patch as a result. Please carefully check (and
preferable test) my handiwork in out_of_memory()?
From: Roman Gushchin <guro@fb.com>
Subject: mm, oom: cgroup-aware OOM killer
Traditionally, the OOM killer is operating on a process level. Under oom
conditions, it finds a process with the highest oom score and kills it.
This behavior doesn't suit well the system with many running containers:
1) There is no fairness between containers. A small container with few
large processes will be chosen over a large one with huge number of
small processes.
2) Containers often do not expect that some random process inside will
be killed. In many cases much safer behavior is to kill all tasks in
the container. Traditionally, this was implemented in userspace, but
doing it in the kernel has some advantages, especially in a case of a
system-wide OOM.
To address these issues, the cgroup-aware OOM killer is introduced.
This patch introduces the core functionality: an ability to select a
memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks
for the biggest leaf memory cgroup and kills the biggest task belonging to
it.
The following patches will extend this functionality to consider non-leaf
memory cgroups as OOM victims, and also provide an ability to kill all
tasks belonging to the victim cgroup.
The root cgroup is treated as a leaf memory cgroup, so it's score is
compared with other leaf memory cgroups. Due to memcg statistics
implementation a special approximation is used for estimating oom_score of
root memory cgroup: we sum oom_score of the belonging processes (or, to be
more precise, tasks owning their mm structures).
Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/memcontrol.h | 17 +++
include/linux/oom.h | 12 ++
mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 72 ++++++++++---
4 files changed, 262 insertions(+), 20 deletions(-)
diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
+struct oom_control;
/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
return css ? container_of(css, struct mem_cgroup, css) : NULL;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+ css_put(&memcg->css);
+}
+
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
bool mem_cgroup_oom_synchronize(bool wait);
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
return true;
}
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
--- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
+++ a/include/linux/oom.h
@@ -10,6 +10,13 @@
#include <linux/sched/coredump.h> /* MMF_* */
#include <linux/mm.h> /* VM_FAULT* */
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
struct zonelist;
struct notifier_block;
struct mem_cgroup;
@@ -40,7 +47,8 @@ struct oom_control {
/* Used by oom implementation, do not set */
unsigned long totalpages;
- struct task_struct *chosen;
+ struct task_struct *chosen_task;
+ struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
};
@@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
/* sysctls */
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
--- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
return ret;
}
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;
+}
+
+/*
+ * Checks if the given memcg is a valid OOM victim and returns a number,
+ * which means the folowing:
+ * -1: there are inflight OOM victim tasks, belonging to the memcg
+ * 0: memcg is not eligible, e.g. all belonging tasks are protected
+ * by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * >0: memcg is eligible, and the returned value is an estimation
+ * of the memory footprint
+ */
+static long oom_evaluate_memcg(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+ struct css_task_iter it;
+ struct task_struct *task;
+ int eligible = 0;
+
+ /*
+ * Root memory cgroup is a special case:
+ * we don't have necessary stats to evaluate it exactly as
+ * leaf memory cgroups, so we approximate it's oom_score
+ * by summing oom_score of all belonging tasks, which are
+ * owners of their mm structs.
+ *
+ * If there are inflight OOM victim tasks inside
+ * the root memcg, we return -1.
+ */
+ if (memcg == root_mem_cgroup) {
+ struct css_task_iter it;
+ struct task_struct *task;
+ long score = 0;
+
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP,
+ &task->signal->oom_mm->flags)) {
+ score = -1;
+ break;
+ }
+
+ task_lock(task);
+ if (!task->mm || task->mm->owner != task) {
+ task_unlock(task);
+ continue;
+ }
+ task_unlock(task);
+
+ score += oom_badness(task, memcg, nodemask,
+ totalpages);
+ }
+ css_task_iter_end(&it);
+
+ return score;
+ }
+
+ /*
+ * Memcg is OOM eligible if there are OOM killable tasks inside.
+ *
+ * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * as unkillable.
+ *
+ * If there are inflight OOM victim tasks inside the memcg,
+ * we return -1.
+ */
+ css_task_iter_start(&memcg->css, 0, &it);
+ while ((task = css_task_iter_next(&it))) {
+ if (!eligible &&
+ task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+ eligible = 1;
+
+ if (tsk_is_oom_victim(task) &&
+ !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+ eligible = -1;
+ break;
+ }
+ }
+ css_task_iter_end(&it);
+
+ if (eligible <= 0)
+ return eligible;
+
+ return memcg_oom_badness(memcg, nodemask, totalpages);
+}
+
+static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
+{
+ struct mem_cgroup *iter;
+
+ oc->chosen_memcg = NULL;
+ oc->chosen_points = 0;
+
+ /*
+ * The oom_score is calculated for leaf memory cgroups (including
+ * the root memcg).
+ */
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, root) {
+ long score;
+
+ if (memcg_has_children(iter) && iter != root_mem_cgroup)
+ continue;
+
+ score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+
+ /*
+ * Ignore empty and non-eligible memory cgroups.
+ */
+ if (score == 0)
+ continue;
+
+ /*
+ * If there are inflight OOM victims, we don't need
+ * to look further for new victims.
+ */
+ if (score == -1) {
+ oc->chosen_memcg = INFLIGHT_VICTIM;
+ mem_cgroup_iter_break(root, iter);
+ break;
+ }
+
+ if (score > oc->chosen_points) {
+ oc->chosen_points = score;
+ oc->chosen_memcg = iter;
+ }
+ }
+
+ if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
+ css_get(&oc->chosen_memcg->css);
+
+ rcu_read_unlock();
+}
+
+bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+ struct mem_cgroup *root;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return false;
+
+ if (oc->memcg)
+ root = oc->memcg;
+ else
+ root = root_mem_cgroup;
+
+ select_victim_memcg(root, oc);
+
+ return oc->chosen_memcg;
+}
+
/*
* Reclaims as many pages from the given memcg as possible.
*
diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
--- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
+++ a/mm/oom_kill.c
@@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+int oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
@@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
goto next;
/* Prefer thread group leaders for display purposes */
- if (points == oc->chosen_points && thread_group_leader(oc->chosen))
+ if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
goto next;
select:
- if (oc->chosen)
- put_task_struct(oc->chosen);
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
get_task_struct(task);
- oc->chosen = task;
+ oc->chosen_task = task;
oc->chosen_points = points;
next:
return 0;
abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
+ if (oc->chosen_task)
+ put_task_struct(oc->chosen_task);
+ oc->chosen_task = INFLIGHT_VICTIM;
return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
static void oom_kill_process(struct oom_control *oc, const char *message)
{
- struct task_struct *p = oc->chosen;
+ struct task_struct *p = oc->chosen_task;
unsigned int points = oc->chosen_points;
struct task_struct *victim = p;
struct task_struct *child;
@@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
__oom_kill_process(victim);
}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
+
+ if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
+ return oc->chosen_memcg;
+
+ /* Kill a task in the chosen memcg with the biggest memory footprint */
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
+
+ __oom_kill_process(oc->chosen_task);
+
+out:
+ mem_cgroup_put(oc->chosen_memcg);
+ return oc->chosen_task;
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
{
unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
+ bool delay = false; /* if set, delay next allocation attempt */
if (oom_killer_disabled)
return false;
@@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
- oc->chosen = current;
+ oc->chosen_task = current;
oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
return true;
}
+ if (mem_cgroup_select_oom_victim(oc)) {
+ if (oom_kill_memcg_victim(oc)) {
+ delay = true;
+ goto out;
+ }
+ }
+
select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
+ if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {
oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
"Memory cgroup out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
+ delay = true;
}
- return !!oc->chosen;
+
+out:
+ /*
+ * Give the killed process a good chance to exit before trying
+ * to allocate memory again.
+ */
+ if (delay)
+ schedule_timeout_killable(1);
+
+ return !!oc->chosen_task;
}
/*
_
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer
2017-12-07 1:24 ` Andrew Morton
@ 2017-12-07 13:39 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-07 13:39 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed, Dec 06, 2017 at 05:24:13PM -0800, Andrew Morton wrote:
>
> As a result of the "stalled MM patches" discussion I've dropped these
> three patches:
>
> mm,oom: move last second allocation to inside the OOM killer
> mm,oom: use ALLOC_OOM for OOM victim's last second allocation
> mm,oom: remove oom_lock serialization from the OOM reaper
>
> and I had to rework this patch as a result. Please carefully check (and
> preferable test) my handiwork in out_of_memory()?
Hi, Andrew!
Reviewed and tested, looks good to me. Thank you!
A couple of small nits below.
>
>
>
> From: Roman Gushchin <guro@fb.com>
> Subject: mm, oom: cgroup-aware OOM killer
>
> Traditionally, the OOM killer is operating on a process level. Under oom
> conditions, it finds a process with the highest oom score and kills it.
>
> This behavior doesn't suit well the system with many running containers:
>
> 1) There is no fairness between containers. A small container with few
> large processes will be chosen over a large one with huge number of
> small processes.
>
> 2) Containers often do not expect that some random process inside will
> be killed. In many cases much safer behavior is to kill all tasks in
> the container. Traditionally, this was implemented in userspace, but
> doing it in the kernel has some advantages, especially in a case of a
> system-wide OOM.
>
> To address these issues, the cgroup-aware OOM killer is introduced.
>
> This patch introduces the core functionality: an ability to select a
> memory cgroup as an OOM victim. Under OOM conditions the OOM killer looks
> for the biggest leaf memory cgroup and kills the biggest task belonging to
> it.
>
> The following patches will extend this functionality to consider non-leaf
> memory cgroups as OOM victims, and also provide an ability to kill all
> tasks belonging to the victim cgroup.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score is
> compared with other leaf memory cgroups. Due to memcg statistics
> implementation a special approximation is used for estimating oom_score of
> root memory cgroup: we sum oom_score of the belonging processes (or, to be
> more precise, tasks owning their mm structures).
>
> Link: http://lkml.kernel.org/r/20171130152824.1591-4-guro@fb.com
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> include/linux/memcontrol.h | 17 +++
> include/linux/oom.h | 12 ++
> mm/memcontrol.c | 181 +++++++++++++++++++++++++++++++++++
> mm/oom_kill.c | 72 ++++++++++---
> 4 files changed, 262 insertions(+), 20 deletions(-)
>
> diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
> struct page;
> struct mm_struct;
> struct kmem_cache;
> +struct oom_control;
>
> /* Cgroup-specific page state, on top of universal node page state */
> enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
> return css ? container_of(css, struct mem_cgroup, css) : NULL;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
> #define mem_cgroup_from_counter(counter, member) \
> container_of(counter, struct mem_cgroup, member)
>
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
>
> bool mem_cgroup_oom_synchronize(bool wait);
>
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
> return true;
> }
>
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
> static inline struct mem_cgroup *
> mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
> void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
> {
> }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
> #endif /* CONFIG_MEMCG */
>
> /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer include/linux/oom.h
> --- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/oom.h
> @@ -10,6 +10,13 @@
> #include <linux/sched/coredump.h> /* MMF_* */
> #include <linux/mm.h> /* VM_FAULT* */
>
> +
> +/*
> + * Special value returned by victim selection functions to indicate
> + * that are inflight OOM victims.
> + */
> +#define INFLIGHT_VICTIM ((void *)-1UL)
> +
> struct zonelist;
> struct notifier_block;
> struct mem_cgroup;
> @@ -40,7 +47,8 @@ struct oom_control {
>
> /* Used by oom implementation, do not set */
> unsigned long totalpages;
> - struct task_struct *chosen;
> + struct task_struct *chosen_task;
> + struct mem_cgroup *chosen_memcg;
> unsigned long chosen_points;
> };
>
> @@ -102,6 +110,8 @@ extern void oom_killer_enable(void);
>
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> +extern int oom_evaluate_task(struct task_struct *task, void *arg);
> +
> /* sysctls */
> extern int sysctl_oom_dump_tasks;
> extern int sysctl_oom_kill_allocating_task;
> diff -puN mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer mm/memcontrol.c
> --- a/mm/memcontrol.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/memcontrol.c
> @@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(st
> return ret;
> }
>
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Root memory cgroup is a special case:
> + * we don't have necessary stats to evaluate it exactly as
> + * leaf memory cgroups, so we approximate it's oom_score
> + * by summing oom_score of all belonging tasks, which are
> + * owners of their mm structs.
> + *
> + * If there are inflight OOM victim tasks inside
> + * the root memcg, we return -1.
> + */
> + if (memcg == root_mem_cgroup) {
> + struct css_task_iter it;
> + struct task_struct *task;
> + long score = 0;
> +
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP,
> + &task->signal->oom_mm->flags)) {
> + score = -1;
> + break;
> + }
> +
> + task_lock(task);
> + if (!task->mm || task->mm->owner != task) {
> + task_unlock(task);
> + continue;
> + }
> + task_unlock(task);
> +
> + score += oom_badness(task, memcg, nodemask,
> + totalpages);
> + }
> + css_task_iter_end(&it);
> +
> + return score;
> + }
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> + * The oom_score is calculated for leaf memory cgroups (including
> + * the root memcg).
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;
> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need
> + * to look further for new victims.
> + */
> + if (score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (score > oc->chosen_points) {
> + oc->chosen_points = score;
> + oc->chosen_memcg = iter;
> + }
> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> + else
> + root = root_mem_cgroup;
> +
> + select_victim_memcg(root, oc);
> +
> + return oc->chosen_memcg;
> +}
> +
> /*
> * Reclaims as many pages from the given memcg as possible.
> *
> diff -puN mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer mm/oom_kill.c
> --- a/mm/oom_kill.c~mm-oom-cgroup-aware-oom-killer
> +++ a/mm/oom_kill.c
> @@ -309,7 +309,7 @@ static enum oom_constraint constrained_a
> return CONSTRAINT_NONE;
> }
>
> -static int oom_evaluate_task(struct task_struct *task, void *arg)
> +int oom_evaluate_task(struct task_struct *task, void *arg)
> {
> struct oom_control *oc = arg;
> unsigned long points;
> @@ -343,26 +343,26 @@ static int oom_evaluate_task(struct task
> goto next;
>
> /* Prefer thread group leaders for display purposes */
> - if (points == oc->chosen_points && thread_group_leader(oc->chosen))
> + if (points == oc->chosen_points && thread_group_leader(oc->chosen_task))
> goto next;
> select:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> get_task_struct(task);
> - oc->chosen = task;
> + oc->chosen_task = task;
> oc->chosen_points = points;
> next:
> return 0;
> abort:
> - if (oc->chosen)
> - put_task_struct(oc->chosen);
> - oc->chosen = (void *)-1UL;
> + if (oc->chosen_task)
> + put_task_struct(oc->chosen_task);
> + oc->chosen_task = INFLIGHT_VICTIM;
> return 1;
> }
>
> /*
> * Simple selection loop. We choose the process with the highest number of
> - * 'points'. In case scan was aborted, oc->chosen is set to -1.
> + * 'points'. In case scan was aborted, oc->chosen_task is set to -1.
> */
> static void select_bad_process(struct oom_control *oc)
> {
> @@ -912,7 +912,7 @@ static void __oom_kill_process(struct ta
>
> static void oom_kill_process(struct oom_control *oc, const char *message)
> {
> - struct task_struct *p = oc->chosen;
> + struct task_struct *p = oc->chosen_task;
> unsigned int points = oc->chosen_points;
> struct task_struct *victim = p;
> struct task_struct *child;
> @@ -973,6 +973,27 @@ static void oom_kill_process(struct oom_
> __oom_kill_process(victim);
> }
>
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Kill a task in the chosen memcg with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + __oom_kill_process(oc->chosen_task);
> +
> +out:
> + mem_cgroup_put(oc->chosen_memcg);
> + return oc->chosen_task;
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> @@ -1025,6 +1046,7 @@ bool out_of_memory(struct oom_control *o
> {
> unsigned long freed = 0;
> enum oom_constraint constraint = CONSTRAINT_NONE;
> + bool delay = false; /* if set, delay next allocation attempt */
>
> if (oom_killer_disabled)
> return false;
> @@ -1069,27 +1091,39 @@ bool out_of_memory(struct oom_control *o
> current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
> oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> return true;
> }
>
> + if (mem_cgroup_select_oom_victim(oc)) {
> + if (oom_kill_memcg_victim(oc)) {
> + delay = true;
> + goto out;
> + }
> + }
It's probably better to join two conditions with &&:
if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
delay = true;
goto out;
}
> +
> select_bad_process(oc);
> /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> - if (oc->chosen && oc->chosen != (void *)-1UL) {
> + if (oc->chosen_task && oc->chosen_task != (void *)-1UL) {
It's better to replace "(void *)-1UL" here with "INFLIGHT_VICTIM"
to be consistent:
if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) ...
> oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> "Memory cgroup out of memory");
--
Also, I've sent a couple of small fixes, requested by Michal:
1) https://lkml.org/lkml/2017/12/1/569
(mm, oom, docs: document groupoom mount option)
2) https://lkml.org/lkml/2017/12/1/568
(mm, oom: return error on access to memory.oom_group if groupoom is disabled)
Can, you, please, pull them?
Thank you!
Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v13 4/7] mm, oom: introduce memory.oom_group
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (2 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
` (5 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Vladimir Davydov, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.
But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.
To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.
If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.
Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.
The default value is 0.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/memcontrol.h | 17 +++++++++++
mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++++++++++---
mm/oom_kill.c | 47 +++++++++++++++++++++++------
3 files changed, 126 insertions(+), 13 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cb4db659a8b5..7b8bcdf6571d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /*
+ * Treat the sub-tree as an indivisible memory consumer,
+ * kill all belonging tasks if the memory cgroup selected
+ * as OOM victim.
+ */
+ bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
@@ -490,6 +497,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return memcg->oom_group;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -990,6 +1002,11 @@ static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
{
return false;
}
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+ return false;
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 592ffb1c98a7..5d27a4bbd478 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,19 +2779,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
{
- struct mem_cgroup *iter;
+ struct mem_cgroup *iter, *group = NULL;
+ long group_score = 0;
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
+ /*
+ * If OOM is memcg-wide, and the memcg has the oom_group flag set,
+ * all tasks belonging to the memcg should be killed.
+ * So, we mark the memcg as a victim.
+ */
+ if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+ oc->chosen_memcg = oc->memcg;
+ css_get(&oc->chosen_memcg->css);
+ return;
+ }
+
/*
* The oom_score is calculated for leaf memory cgroups (including
* the root memcg).
+ * Non-leaf oom_group cgroups accumulating score of descendant
+ * leaf memory cgroups.
*/
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
+ /*
+ * We don't consider non-leaf non-oom_group memory cgroups
+ * as OOM victims.
+ */
+ if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+ !mem_cgroup_oom_group(iter))
+ continue;
+
+ /*
+ * If group is not set or we've ran out of the group's sub-tree,
+ * we should set group and reset group_score.
+ */
+ if (!group || group == root_mem_cgroup ||
+ !mem_cgroup_is_descendant(iter, group)) {
+ group = iter;
+ group_score = 0;
+ }
+
if (memcg_has_children(iter) && iter != root_mem_cgroup)
continue;
@@ -2813,9 +2845,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
break;
}
- if (score > oc->chosen_points) {
- oc->chosen_points = score;
- oc->chosen_memcg = iter;
+ group_score += score;
+
+ if (group_score > oc->chosen_points) {
+ oc->chosen_points = group_score;
+ oc->chosen_memcg = group;
}
}
@@ -5440,6 +5474,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_oom_group_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ bool oom_group = memcg->oom_group;
+
+ seq_printf(m, "%d\n", oom_group);
+
+ return 0;
+}
+
+static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ int oom_group;
+ int err;
+
+ err = kstrtoint(strstrip(buf), 0, &oom_group);
+ if (err)
+ return err;
+
+ memcg->oom_group = oom_group;
+
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5559,6 +5620,12 @@ static struct cftype memory_files[] = {
.seq_show = memory_max_show,
.write = memory_max_write,
},
+ {
+ .name = "oom_group",
+ .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+ .seq_show = memory_oom_group_show,
+ .write = memory_oom_group_write,
+ },
{
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bcfa92f29407..4678468bae17 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,17 @@ static void __oom_kill_process(struct task_struct *victim)
struct mm_struct *mm;
bool can_oom_reap = true;
+ /*
+ * __oom_kill_process() is used to kill all tasks belonging to
+ * the selected memory cgroup, so we should check that we're not
+ * trying to kill an unkillable task.
+ */
+ if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
+ victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ put_task_struct(victim);
+ return;
+ }
+
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -956,21 +967,39 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
__oom_kill_process(victim);
}
-static bool oom_kill_memcg_victim(struct oom_control *oc)
+static int oom_kill_memcg_member(struct task_struct *task, void *unused)
{
+ get_task_struct(task);
+ __oom_kill_process(task);
+ return 0;
+}
+static bool oom_kill_memcg_victim(struct oom_control *oc)
+{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg;
- /* Kill a task in the chosen memcg with the biggest memory footprint */
- oc->chosen_points = 0;
- oc->chosen_task = NULL;
- mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
-
- if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
- goto out;
+ /*
+ * If memory.oom_group is set, kill all tasks belonging to the sub-tree
+ * of the chosen memory cgroup, otherwise kill the task with the biggest
+ * memory footprint.
+ */
+ if (mem_cgroup_oom_group(oc->chosen_memcg)) {
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
+ NULL);
+ /* We have one or more terminating processes at this point. */
+ oc->chosen_task = INFLIGHT_VICTIM;
+ } else {
+ oc->chosen_points = 0;
+ oc->chosen_task = NULL;
+ mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
+
+ if (oc->chosen_task == NULL ||
+ oc->chosen_task == INFLIGHT_VICTIM)
+ goto out;
- __oom_kill_process(oc->chosen_task);
+ __oom_kill_process(oc->chosen_task);
+ }
out:
mem_cgroup_put(oc->chosen_memcg);
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (3 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:41 ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
` (4 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Andrew Morton, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.
The behavior can be changed dynamically by remounting the cgroupfs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
include/linux/cgroup-defs.h | 5 +++++
kernel/cgroup/cgroup.c | 10 ++++++++++
mm/memcontrol.c | 3 +++
3 files changed, 18 insertions(+)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8b7fd8eeccee..9fb99e25d654 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,6 +81,11 @@ enum {
* Enable cpuset controller in v1 cgroup to use v2 behavior.
*/
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+ /*
+ * Enable cgroup-aware OOM killer.
+ */
+ CGRP_GROUP_OOM = (1 << 5),
};
/* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0b1ffe147f24..7338e12979e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+ } else if (!strcmp(token, "groupoom")) {
+ *root_flags |= CGRP_GROUP_OOM;
+ continue;
}
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+ if (root_flags & CGRP_GROUP_OOM)
+ cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+ else
+ cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
}
@@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
{
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+ if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+ seq_puts(seq, ",groupoom");
return 0;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d27a4bbd478..c76d5fb55c5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return false;
+
if (oc->memcg)
root = oc->memcg;
else
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-12-01 8:41 ` Michal Hocko
2017-12-01 13:15 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:41 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> OOM killer. If not set, the OOM selection is performed in
> a "traditional" per-process way.
>
> The behavior can be changed dynamically by remounting the cgroupfs.
Is it ok to create oom_group if the option is not enabled? This looks
confusing. I forgot all the details about how cgroup core creates file
so I do not have a good idea how to fix this.
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: kernel-team@fb.com
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
> include/linux/cgroup-defs.h | 5 +++++
> kernel/cgroup/cgroup.c | 10 ++++++++++
> mm/memcontrol.c | 3 +++
> 3 files changed, 18 insertions(+)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 8b7fd8eeccee..9fb99e25d654 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -81,6 +81,11 @@ enum {
> * Enable cpuset controller in v1 cgroup to use v2 behavior.
> */
> CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> + /*
> + * Enable cgroup-aware OOM killer.
> + */
> + CGRP_GROUP_OOM = (1 << 5),
> };
>
> /* cftype->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 0b1ffe147f24..7338e12979e1 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
> if (!strcmp(token, "nsdelegate")) {
> *root_flags |= CGRP_ROOT_NS_DELEGATE;
> continue;
> + } else if (!strcmp(token, "groupoom")) {
> + *root_flags |= CGRP_GROUP_OOM;
> + continue;
> }
>
> pr_err("cgroup2: unknown option \"%s\"\n", token);
> @@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
> cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
> else
> cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> + if (root_flags & CGRP_GROUP_OOM)
> + cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
> + else
> + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
> }
> }
>
> @@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
> {
> if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
> seq_puts(seq, ",nsdelegate");
> + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
> + seq_puts(seq, ",groupoom");
> return 0;
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d27a4bbd478..c76d5fb55c5c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> return false;
>
> + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
> + return false;
> +
> if (oc->memcg)
> root = oc->memcg;
> else
> --
> 2.14.3
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 8:41 ` Michal Hocko
@ 2017-12-01 13:15 ` Roman Gushchin
2017-12-01 13:31 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:15 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 09:41:13AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> >
> > The behavior can be changed dynamically by remounting the cgroupfs.
>
> Is it ok to create oom_group if the option is not enabled? This looks
> confusing. I forgot all the details about how cgroup core creates file
> so I do not have a good idea how to fix this.
I don't think we do show/hide interface files dynamically.
Even for things like socket memory which can be disabled by the boot option,
we don't hide the corresponding stats entry.
So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
attempt if option is not enabled?
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 13:15 ` Roman Gushchin
@ 2017-12-01 13:31 ` Michal Hocko
2017-12-01 17:00 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:31 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
[...]
> So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> attempt if option is not enabled?
Yes, that would work as well. ENOTSUP sounds better to me.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
2017-12-01 13:31 ` Michal Hocko
@ 2017-12-01 17:00 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:00 UTC (permalink / raw)
To: Michal Hocko
Cc: Vladimir Davydov, Johannes Weiner, Tetsuo Handa, David Rientjes,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 02:31:45PM +0100, Michal Hocko wrote:
> On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
> [...]
> > So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
> > attempt if option is not enabled?
>
> Yes, that would work as well. ENOTSUP sounds better to me.
> --
> Michal Hocko
> SUSE Labs
>From 78bf2c00abf450bcd993d02a7dc1783144005fbd Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 14:30:14 +0000
Subject: [PATCH] mm, oom: return error on access to memory.oom_group if
groupoom is disabled
Cgroup-aware OOM killer depends on cgroup mount option and is turned
off by default, despite the user interface (memory.oom_group file) is
always present. As it might be confusing to a user, let's return
-ENOTSUPP on an attempt to access to memory.oom_group if groupoom is not
enabled globally.
Example:
$ cd /sys/fs/cgroup/user.slice/
$ cat memory.oom_group
cat: memory.oom_group: Unknown error 524
$ echo 1 > memory.oom_group
-bash: echo: write error: Unknown error 524
$ mount -o remount,groupoom /sys/fs/cgroup
$ echo 1 > memory.oom_group
$ cat memory.oom_group
1
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
mm/memcontrol.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c76d5fb55c5c..b709ee4f914b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5482,6 +5482,9 @@ static int memory_oom_group_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return -ENOTSUPP;
+
seq_printf(m, "%d\n", oom_group);
return 0;
@@ -5495,6 +5498,9 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
int oom_group;
int err;
+ if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+ return -ENOTSUPP;
+
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (4 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-12-01 8:41 ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
` (3 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Vladimir Davydov,
Tetsuo Handa, Andrew Morton, David Rientjes, Tejun Heo,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
Document the cgroup-aware OOM killer.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
Documentation/cgroup-v2.txt | 58 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 779211fbb69f..c80a147f94b7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+ 5-2-4. OOM Killer
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1026,6 +1027,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.oom_group
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ If set, OOM killer will consider the memory cgroup as an
+ indivisible memory consumers and compare it with other memory
+ consumers by it's memory footprint.
+ If such memory cgroup is selected as an OOM victim, all
+ processes belonging to it or it's descendants will be killed.
+
+ This applies to system-wide OOM conditions and reaching
+ the hard memory limit of the cgroup and their ancestor.
+ If OOM condition happens in a descendant cgroup with it's own
+ memory limit, the memory cgroup can't be considered
+ as an OOM victim, and OOM killer will not kill all belonging
+ tasks.
+
+ Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+ and will never kill the unkillable task, even if memory.oom_group
+ is set.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
belonging to the affected files to ensure correct memory ownership.
+OOM Killer
+~~~~~~~~~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+Please, note that memory charges are not migrating if tasks
+are moved between different memory cgroups. Moving tasks with
+significant memory footprint may affect OOM victim selection logic.
+If it's a case, please, consider creating a common ancestor for
+the source and destination memory cgroups and enabling oom_group
+on ancestor layer.
+
IO
--
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-12-01 8:41 ` Michal Hocko
2017-12-01 17:01 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 8:41 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
> POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> belonging to the affected files to ensure correct memory ownership.
>
> +OOM Killer
> +~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats cgroups as first class OOM entities.
This should mention groupoom mount option to enable this functionality.
Other than that looks ok to me
Acked-by: Michal Hocko <mhocko@suse.com>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-12-01 8:41 ` Michal Hocko
@ 2017-12-01 17:01 ` Roman Gushchin
2017-12-01 17:13 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 17:01 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri, Dec 01, 2017 at 09:41:54AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> > @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make sense to use
> > POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> > belonging to the affected files to ensure correct memory ownership.
> >
> > +OOM Killer
> > +~~~~~~~~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats cgroups as first class OOM entities.
>
> This should mention groupoom mount option to enable this functionality.
>
> Other than that looks ok to me
> Acked-by: Michal Hocko <mhocko@suse.com>
> --
> Michal Hocko
> SUSE Labs
>From 1d9c87128897ee7f27f9651d75b80f73985373e8 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 1 Dec 2017 15:34:59 +0000
Subject: [PATCH] mm, oom, docs: document groupoom mount option
Add a note that cgroup-aware OOM logic is disabled by default
and describe how to enable it.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
Documentation/cgroup-v2.txt | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c80a147f94b7..ff8e92db636d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
and will never kill the unkillable task, even if memory.oom_group
is set.
+ If cgroup-aware OOM killer is not enabled, ENOTSUPP error
+ is returned on attempt to access the file.
+
memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined. Unless specified
@@ -1258,6 +1261,12 @@ OOM Killer
Cgroup v2 memory controller implements a cgroup-aware OOM killer.
It means that it treats cgroups as first class OOM entities.
+Cgroup-aware OOM logic is turned off by default and requires
+passing the "groupoom" option on mounting cgroupfs. It can also
+by remounting cgroupfs with the following command::
+
+ # mount -o remount,groupoom $MOUNT_POINT
+
Under OOM conditions the memory controller tries to make the best
choice of a victim, looking for a memory cgroup with the largest
memory footprint, considering leaf cgroups and cgroups with the
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer
2017-12-01 17:01 ` Roman Gushchin
@ 2017-12-01 17:13 ` Michal Hocko
0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 17:13 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Tetsuo Handa,
Andrew Morton, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 17:01:49, Roman Gushchin wrote:
[...]
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index c80a147f94b7..ff8e92db636d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
> and will never kill the unkillable task, even if memory.oom_group
> is set.
>
> + If cgroup-aware OOM killer is not enabled, ENOTSUPP error
> + is returned on attempt to access the file.
> +
> memory.events
> A read-only flat-keyed file which exists on non-root cgroups.
> The following entries are defined. Unless specified
> @@ -1258,6 +1261,12 @@ OOM Killer
> Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> It means that it treats cgroups as first class OOM entities.
>
> +Cgroup-aware OOM logic is turned off by default and requires
> +passing the "groupoom" option on mounting cgroupfs. It can also
> +by remounting cgroupfs with the following command::
> +
> + # mount -o remount,groupoom $MOUNT_POINT
> +
> Under OOM conditions the memory controller tries to make the best
> choice of a victim, looking for a memory cgroup with the largest
> memory footprint, considering leaf cgroups and cgroups with the
Looks good to me
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v13 7/7] cgroup: list groupoom in cgroup features
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (5 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
@ 2017-11-30 15:28 ` Roman Gushchin
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
` (2 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2017-11-30 15:28 UTC (permalink / raw)
To: linux-mm
Cc: Roman Gushchin, Tejun Heo, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, David Rientjes, Andrew Morton,
kernel-team, cgroups, linux-doc, linux-kernel, linux-mm
List groupoom in cgroup features list (exported via
/sys/kernel/cgroup/features), which can be used by a userspace
apps (most likely, systemd) to get an idea which cgroup features
are supported by kernel.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
kernel/cgroup/cgroup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7338e12979e1..693443282fc1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5922,7 +5922,8 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
+ return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
+ "groupoom\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
--
2.14.3
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (6 preceding siblings ...)
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
@ 2017-11-30 20:39 ` Andrew Morton
2018-01-10 0:57 ` David Rientjes
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
9 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2017-11-30 20:39 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, David Rientjes, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu, 30 Nov 2017 15:28:17 +0000 Roman Gushchin <guro@fb.com> wrote:
> This patchset makes the OOM killer cgroup-aware.
Thanks, I'll grab these.
There has been controversy over this patchset, to say the least. I
can't say that I followed it closely! Could those who still have
reservations please summarise their concerns and hopefully suggest a
way forward?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
@ 2018-01-10 0:57 ` David Rientjes
2018-01-10 13:11 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-10 0:57 UTC (permalink / raw)
To: Andrew Morton
Cc: Roman Gushchin, linux-mm, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu, 30 Nov 2017, Andrew Morton wrote:
> > This patchset makes the OOM killer cgroup-aware.
>
> Thanks, I'll grab these.
>
> There has been controversy over this patchset, to say the least. I
> can't say that I followed it closely! Could those who still have
> reservations please summarise their concerns and hopefully suggest a
> way forward?
>
Yes, I'll summarize what my concerns have been in the past and what they
are wrt the patchset as it stands in -mm. None of them originate from my
current usecase or anticipated future usecase of the oom killer for
system-wide or memcg-constrained oom conditions. They are based purely on
the patchset's use of an incomplete and unfair heuristic for deciding
which cgroup to target.
I'll also suggest simple changes to the patchset, which I have in the
past, that can be made to address all of these concerns.
1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
The patchset uses two different heuristics to compare root and leaf mem
cgroups and scores them based on number of pages. For the root mem
cgroup, it totals the /proc/pid/oom_score of all processes attached:
that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
unreclaimable slab, kernel stack, and swap counters. These can be wildly
different independent of /proc/pid/oom_score_adj, but the most obvious
unfairness comes from users who tune oom_score_adj.
An example: start a process that faults 1GB of anonymous memory and leave
it attached to the root mem cgroup. Start six more processes that each
fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
will always kill the 1GB process attached to the root mem cgroup. It's
because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
evaluate the root mem cgroup, and leaf mem cgroups completely disregard
it.
In this example, the leaf mem cgroup's score is 1,573,044, the number of
pages for the 6GB of faulted memory. The root mem cgroup's score is
12,652,907, eight times larger even though its usage is six times smaller.
This is caused by the patchset disregarding oom_score_adj entirely for
leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
the complete opposite result of what the cgroup aware oom killer
advertises.
It also works the other way, if a large memory hog is attached to the root
mem cgroup but has a negative oom_score_adj it is never killed and random
processes are nuked solely because they happened to be attached to a leaf
mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
so I can't presume that it is either known nor tested.
Solution: compare the root mem cgroup and leaf mem cgroups equally with
the same criteria by doing hierarchical accounting of usage and
subtracting from total system usage to find root usage.
2. Evading the oom killer by attaching processes to child cgroups
Any cgroup on the system can attach all their processes to individual
child cgroups. This is functionally the same as doing
for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
without the no internal process constraint introduced with cgroup v2. All
child cgroups are evaluated based on their own usage: all anon,
unevictable, and unreclaimable slab as described previously. It requires
an individual cgroup to be the single largest consumer to be targeted by
the oom killer.
An example: allow users to manage two different mem cgroup hierarchies
limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
memory in their respective hierarchies. On a system oom condition, we'd
expect at least one process from user B's hierarchy would always be oom
killed with the cgroup aware oom killer. In fact, the changelog
explicitly states it solves an issue where "1) There is no fairness
between containers. A small container with few large processes will be
chosen over a large one with huge number of small processes."
The opposite becomes true, however, if user B creates child cgroups and
distributes its processes such that each child cgroup's usage never
exceeds 10GB of memory. This can either be done intentionally to
purposefully have a low cgroup memory footprint to evade the oom killer or
unintentionally with cgroup v2 to allow those individual processes to be
constrained by other cgroups in a single hierarchy model. User A, using
10% of his memory limit, is always oom killed instead of user B, using 90%
of his memory limit.
Others have commented its still possible to do this with a per-process
model if users split their processes into many subprocesses with small
memory footprints.
Solution: comparing cgroups must be done hierarchically. Neither user A
nor user B can evade the oom killer because targeting is done based on the
total hierarchical usage rather than individual cgroups in their
hierarchies.
3. Userspace has zero control over oom kill selection in leaf mem cgroups
Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
from the oom killer, the cgroup aware oom killer does not provide any
solution for the user to protect leaf mem cgroups. This is a result of
leaf mem cgroups being evaluated based on their anon, unevictable, and
unreclaimable slab usage and disregarding any user tunable.
Absent the cgroup aware oom killer, users have the ability to strongly
prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
strongly bias against a process (/proc/pid/oom_score_adj = -999).
An example: a process knows its going to use a lot of memory, so it sets
/proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
distrupting any other process. If it's attached to the root mem cgroup,
it will be oom killed. If it's attached to a leaf mem cgroup by an admin
outside its control, it will never be oom killed unless that cgroup's
usage is the largest single cgroup usage on the system. The reverse also
is true for processes that the admin does not want to be oom killed: set
/proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
cgroup has the highest usage on the system.
The result is that both admins and users have lost all control over which
processes are oom killed. They are left with only one alternative: set
/proc/pid/oom_score_adj to -1000 to completely disable a process from oom
kill. It doesn't address the issue at all for memcg-constrained oom
conditions since no processes are killable anymore, and risks panicking
the system if it is the only process left on the system. A process
preferring that it is first in line for oom kill simply cannot volunteer
anymore.
Solution: allow users and admins to control oom kill selection by
introducing a memory.oom_score_adj to affect the oom score of that mem
cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
of a process.
I proposed a solution in
https://marc.info/?l=linux-kernel&m=150956897302725, which was never
responded to, for all of these issues. The idea is to do hierarchical
accounting of mem cgroup hierarchies so that the hierarchy is traversed
comparing total usage at each level to select target cgroups. Admins and
users can use memory.oom_score_adj to influence that decisionmaking at
each level.
This solves #1 because mem cgroups can be compared based on the same
classes of memory and the root mem cgroup's usage can be fairly compared
by subtracting top-level mem cgroup usage from system usage. All of the
criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
counterpart that can be used to do the simple subtraction.
This solves #2 because evaluation is done hierarchically so that
distributing processes over a set of child cgroups either intentionally
or unintentionally no longer evades the oom killer. Total usage is always
accounted to the parent and there is no escaping this criteria for users.
This solves #3 because it allows admins to protect important processes in
cgroups that are supposed to use, for example, 75% of system memory
without it unconditionally being selected for oom kill but still oom kill
if it exceeds a certain threshold. In this sense, the cgroup aware oom
killer, as currently implemented, is selling mem cgroups short by
requiring the user to accept that the important process will be oom killed
iff it uses mem cgroups and isn't attached to root. It also allows users
to actually volunteer to be oom killed first without majority usage.
It has come up time and time again that this support can be introduced on
top of the cgroup oom killer as implemented. It simply cannot. For
admins and users to have control over decisionmaking, it needs a
oom_score_adj type tunable that cannot change semantics from kernel
version to kernel version and without polluting the mem cgroup filesystem.
That, in my suggestion, is an adjustment on the amount of total
hierarchical usage of each mem cgroup at each level of the hierarchy.
That requires that the heuristic uses hierarchical usage rather than
considering each cgroup as independent consumers as it does today. We
need to implement that heuristic and introduce userspace influence over
oom kill selection now rather than later because its implementation
changes how this patchset is implemented.
I can implement these changes, if preferred, on top of the current
patchset, but I do not believe we want inconsistencies between kernel
versions that introduce user visible changes for the sole reason that this
current implementation is incomplete and unfair. We can implement and
introduce it once without behavior changing later because the core
heuristic has necessarily changed.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-10 0:57 ` David Rientjes
@ 2018-01-10 13:11 ` Roman Gushchin
2018-01-10 19:33 ` Andrew Morton
2018-01-10 20:50 ` David Rientjes
0 siblings, 2 replies; 51+ messages in thread
From: Roman Gushchin @ 2018-01-10 13:11 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
Hello, David!
On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> On Thu, 30 Nov 2017, Andrew Morton wrote:
>
> > > This patchset makes the OOM killer cgroup-aware.
> >
> > Thanks, I'll grab these.
> >
> > There has been controversy over this patchset, to say the least. I
> > can't say that I followed it closely! Could those who still have
> > reservations please summarise their concerns and hopefully suggest a
> > way forward?
> >
>
> Yes, I'll summarize what my concerns have been in the past and what they
> are wrt the patchset as it stands in -mm. None of them originate from my
> current usecase or anticipated future usecase of the oom killer for
> system-wide or memcg-constrained oom conditions. They are based purely on
> the patchset's use of an incomplete and unfair heuristic for deciding
> which cgroup to target.
>
> I'll also suggest simple changes to the patchset, which I have in the
> past, that can be made to address all of these concerns.
>
> 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
>
> The patchset uses two different heuristics to compare root and leaf mem
> cgroups and scores them based on number of pages. For the root mem
> cgroup, it totals the /proc/pid/oom_score of all processes attached:
> that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> unreclaimable slab, kernel stack, and swap counters. These can be wildly
> different independent of /proc/pid/oom_score_adj, but the most obvious
> unfairness comes from users who tune oom_score_adj.
>
> An example: start a process that faults 1GB of anonymous memory and leave
> it attached to the root mem cgroup. Start six more processes that each
> fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> will always kill the 1GB process attached to the root mem cgroup. It's
> because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> it.
>
> In this example, the leaf mem cgroup's score is 1,573,044, the number of
> pages for the 6GB of faulted memory. The root mem cgroup's score is
> 12,652,907, eight times larger even though its usage is six times smaller.
>
> This is caused by the patchset disregarding oom_score_adj entirely for
> leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> the complete opposite result of what the cgroup aware oom killer
> advertises.
>
> It also works the other way, if a large memory hog is attached to the root
> mem cgroup but has a negative oom_score_adj it is never killed and random
> processes are nuked solely because they happened to be attached to a leaf
> mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> so I can't presume that it is either known nor tested.
>
> Solution: compare the root mem cgroup and leaf mem cgroups equally with
> the same criteria by doing hierarchical accounting of usage and
> subtracting from total system usage to find root usage.
I find this problem quite minor, because I haven't seen any practical problems
caused by accounting of the root cgroup memory.
If it's a serious problem for you, it can be solved without switching to the
hierarchical accounting: it's possible to sum up all leaf cgroup stats and
substract them from global values. So, it can be a relatively small enhancement
on top of the current mm tree. This has nothing to do with global victim selection
approach.
>
> 2. Evading the oom killer by attaching processes to child cgroups
>
> Any cgroup on the system can attach all their processes to individual
> child cgroups. This is functionally the same as doing
>
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
>
> without the no internal process constraint introduced with cgroup v2. All
> child cgroups are evaluated based on their own usage: all anon,
> unevictable, and unreclaimable slab as described previously. It requires
> an individual cgroup to be the single largest consumer to be targeted by
> the oom killer.
>
> An example: allow users to manage two different mem cgroup hierarchies
> limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> memory in their respective hierarchies. On a system oom condition, we'd
> expect at least one process from user B's hierarchy would always be oom
> killed with the cgroup aware oom killer. In fact, the changelog
> explicitly states it solves an issue where "1) There is no fairness
> between containers. A small container with few large processes will be
> chosen over a large one with huge number of small processes."
>
> The opposite becomes true, however, if user B creates child cgroups and
> distributes its processes such that each child cgroup's usage never
> exceeds 10GB of memory. This can either be done intentionally to
> purposefully have a low cgroup memory footprint to evade the oom killer or
> unintentionally with cgroup v2 to allow those individual processes to be
> constrained by other cgroups in a single hierarchy model. User A, using
> 10% of his memory limit, is always oom killed instead of user B, using 90%
> of his memory limit.
>
> Others have commented its still possible to do this with a per-process
> model if users split their processes into many subprocesses with small
> memory footprints.
>
> Solution: comparing cgroups must be done hierarchically. Neither user A
> nor user B can evade the oom killer because targeting is done based on the
> total hierarchical usage rather than individual cgroups in their
> hierarchies.
We've discussed this a lot.
Hierarchical approach has their own issues, which we've discussed during
previous iterations of the patchset. If you know how to address them
(I've no idea), please, go on and suggest your version.
>
> 3. Userspace has zero control over oom kill selection in leaf mem cgroups
>
> Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> from the oom killer, the cgroup aware oom killer does not provide any
> solution for the user to protect leaf mem cgroups. This is a result of
> leaf mem cgroups being evaluated based on their anon, unevictable, and
> unreclaimable slab usage and disregarding any user tunable.
>
> Absent the cgroup aware oom killer, users have the ability to strongly
> prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> strongly bias against a process (/proc/pid/oom_score_adj = -999).
>
> An example: a process knows its going to use a lot of memory, so it sets
> /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> distrupting any other process. If it's attached to the root mem cgroup,
> it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> outside its control, it will never be oom killed unless that cgroup's
> usage is the largest single cgroup usage on the system. The reverse also
> is true for processes that the admin does not want to be oom killed: set
> /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> cgroup has the highest usage on the system.
>
> The result is that both admins and users have lost all control over which
> processes are oom killed. They are left with only one alternative: set
> /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> kill. It doesn't address the issue at all for memcg-constrained oom
> conditions since no processes are killable anymore, and risks panicking
> the system if it is the only process left on the system. A process
> preferring that it is first in line for oom kill simply cannot volunteer
> anymore.
>
> Solution: allow users and admins to control oom kill selection by
> introducing a memory.oom_score_adj to affect the oom score of that mem
> cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> of a process.
The per-process oom_score_adj interface is not the nicest one, and I'm not
sure we want to replicate it on cgroup level as is. If you have an idea of how
it should look like, please, propose a patch; otherwise it's hard to discuss
it without the code.
>
>
> I proposed a solution in
> https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> responded to, for all of these issues. The idea is to do hierarchical
> accounting of mem cgroup hierarchies so that the hierarchy is traversed
> comparing total usage at each level to select target cgroups. Admins and
> users can use memory.oom_score_adj to influence that decisionmaking at
> each level.
>
> This solves #1 because mem cgroups can be compared based on the same
> classes of memory and the root mem cgroup's usage can be fairly compared
> by subtracting top-level mem cgroup usage from system usage. All of the
> criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> counterpart that can be used to do the simple subtraction.
>
> This solves #2 because evaluation is done hierarchically so that
> distributing processes over a set of child cgroups either intentionally
> or unintentionally no longer evades the oom killer. Total usage is always
> accounted to the parent and there is no escaping this criteria for users.
>
> This solves #3 because it allows admins to protect important processes in
> cgroups that are supposed to use, for example, 75% of system memory
> without it unconditionally being selected for oom kill but still oom kill
> if it exceeds a certain threshold. In this sense, the cgroup aware oom
> killer, as currently implemented, is selling mem cgroups short by
> requiring the user to accept that the important process will be oom killed
> iff it uses mem cgroups and isn't attached to root. It also allows users
> to actually volunteer to be oom killed first without majority usage.
>
> It has come up time and time again that this support can be introduced on
> top of the cgroup oom killer as implemented. It simply cannot. For
> admins and users to have control over decisionmaking, it needs a
> oom_score_adj type tunable that cannot change semantics from kernel
> version to kernel version and without polluting the mem cgroup filesystem.
> That, in my suggestion, is an adjustment on the amount of total
> hierarchical usage of each mem cgroup at each level of the hierarchy.
> That requires that the heuristic uses hierarchical usage rather than
> considering each cgroup as independent consumers as it does today. We
> need to implement that heuristic and introduce userspace influence over
> oom kill selection now rather than later because its implementation
> changes how this patchset is implemented.
>
> I can implement these changes, if preferred, on top of the current
> patchset, but I do not believe we want inconsistencies between kernel
> versions that introduce user visible changes for the sole reason that this
> current implementation is incomplete and unfair. We can implement and
> introduce it once without behavior changing later because the core
> heuristic has necessarily changed.
David, I _had_ hierarchical accounting implemented in one of the previous
versions of this patchset. And there were _reasons_, why we went away from it.
You can't just ignore them and say that "there is a simple solution, which
Roman is not responding". If you know how to address these issues and
convince everybody that hierarchical approach is a way to go, please,
go on and send your version of the patchset.
Thanks!
Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-10 13:11 ` Roman Gushchin
@ 2018-01-10 19:33 ` Andrew Morton
2018-01-11 9:08 ` Michal Hocko
2018-01-13 17:14 ` Johannes Weiner
2018-01-10 20:50 ` David Rientjes
1 sibling, 2 replies; 51+ messages in thread
From: Andrew Morton @ 2018-01-10 19:33 UTC (permalink / raw)
To: Roman Gushchin
Cc: David Rientjes, linux-mm, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> Hello, David!
>
> On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > On Thu, 30 Nov 2017, Andrew Morton wrote:
> >
> > > > This patchset makes the OOM killer cgroup-aware.
> > >
> > > Thanks, I'll grab these.
> > >
> > > There has been controversy over this patchset, to say the least. I
> > > can't say that I followed it closely! Could those who still have
> > > reservations please summarise their concerns and hopefully suggest a
> > > way forward?
> > >
> >
> > Yes, I'll summarize what my concerns have been in the past and what they
> > are wrt the patchset as it stands in -mm. None of them originate from my
> > current usecase or anticipated future usecase of the oom killer for
> > system-wide or memcg-constrained oom conditions. They are based purely on
> > the patchset's use of an incomplete and unfair heuristic for deciding
> > which cgroup to target.
> >
> > I'll also suggest simple changes to the patchset, which I have in the
> > past, that can be made to address all of these concerns.
> >
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> >
> > The patchset uses two different heuristics to compare root and leaf mem
> > cgroups and scores them based on number of pages. For the root mem
> > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > different independent of /proc/pid/oom_score_adj, but the most obvious
> > unfairness comes from users who tune oom_score_adj.
> >
> > An example: start a process that faults 1GB of anonymous memory and leave
> > it attached to the root mem cgroup. Start six more processes that each
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > will always kill the 1GB process attached to the root mem cgroup. It's
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > it.
> >
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > 12,652,907, eight times larger even though its usage is six times smaller.
> >
> > This is caused by the patchset disregarding oom_score_adj entirely for
> > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > the complete opposite result of what the cgroup aware oom killer
> > advertises.
> >
> > It also works the other way, if a large memory hog is attached to the root
> > mem cgroup but has a negative oom_score_adj it is never killed and random
> > processes are nuked solely because they happened to be attached to a leaf
> > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > so I can't presume that it is either known nor tested.
> >
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > the same criteria by doing hierarchical accounting of usage and
> > subtracting from total system usage to find root usage.
>
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.
It sounds like a significant shortcoming to me - the oom-killing
decisions which David describes are clearly incorrect?
If this can be fixed against the -mm patchset with a "relatively small
enhancement" then please let's get that done so it can be reviewed and
tested.
> >
> > 2. Evading the oom killer by attaching processes to child cgroups
> >
> > Any cgroup on the system can attach all their processes to individual
> > child cgroups. This is functionally the same as doing
> >
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> >
> > without the no internal process constraint introduced with cgroup v2. All
> > child cgroups are evaluated based on their own usage: all anon,
> > unevictable, and unreclaimable slab as described previously. It requires
> > an individual cgroup to be the single largest consumer to be targeted by
> > the oom killer.
> >
> > An example: allow users to manage two different mem cgroup hierarchies
> > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > memory in their respective hierarchies. On a system oom condition, we'd
> > expect at least one process from user B's hierarchy would always be oom
> > killed with the cgroup aware oom killer. In fact, the changelog
> > explicitly states it solves an issue where "1) There is no fairness
> > between containers. A small container with few large processes will be
> > chosen over a large one with huge number of small processes."
> >
> > The opposite becomes true, however, if user B creates child cgroups and
> > distributes its processes such that each child cgroup's usage never
> > exceeds 10GB of memory. This can either be done intentionally to
> > purposefully have a low cgroup memory footprint to evade the oom killer or
> > unintentionally with cgroup v2 to allow those individual processes to be
> > constrained by other cgroups in a single hierarchy model. User A, using
> > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > of his memory limit.
> >
> > Others have commented its still possible to do this with a per-process
> > model if users split their processes into many subprocesses with small
> > memory footprints.
> >
> > Solution: comparing cgroups must be done hierarchically. Neither user A
> > nor user B can evade the oom killer because targeting is done based on the
> > total hierarchical usage rather than individual cgroups in their
> > hierarchies.
>
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.
Well, if a hierarchical approach isn't a workable fix for the problem
which David has identified then what *is* the fix?
> >
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> >
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > from the oom killer, the cgroup aware oom killer does not provide any
> > solution for the user to protect leaf mem cgroups. This is a result of
> > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > unreclaimable slab usage and disregarding any user tunable.
> >
> > Absent the cgroup aware oom killer, users have the ability to strongly
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> >
> > An example: a process knows its going to use a lot of memory, so it sets
> > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > distrupting any other process. If it's attached to the root mem cgroup,
> > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > outside its control, it will never be oom killed unless that cgroup's
> > usage is the largest single cgroup usage on the system. The reverse also
> > is true for processes that the admin does not want to be oom killed: set
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > cgroup has the highest usage on the system.
> >
> > The result is that both admins and users have lost all control over which
> > processes are oom killed. They are left with only one alternative: set
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > kill. It doesn't address the issue at all for memcg-constrained oom
> > conditions since no processes are killable anymore, and risks panicking
> > the system if it is the only process left on the system. A process
> > preferring that it is first in line for oom kill simply cannot volunteer
> > anymore.
> >
> > Solution: allow users and admins to control oom kill selection by
> > introducing a memory.oom_score_adj to affect the oom score of that mem
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > of a process.
>
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.
It does make sense to have some form of per-cgroup tunability. Why is
the oom_score_adj approach inappropriate and what would be better? How
hard is it to graft such a thing onto the -mm patchset?
> >
> >
> > I proposed a solution in
> > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > responded to, for all of these issues. The idea is to do hierarchical
> > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > comparing total usage at each level to select target cgroups. Admins and
> > users can use memory.oom_score_adj to influence that decisionmaking at
> > each level.
> >
> > This solves #1 because mem cgroups can be compared based on the same
> > classes of memory and the root mem cgroup's usage can be fairly compared
> > by subtracting top-level mem cgroup usage from system usage. All of the
> > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > counterpart that can be used to do the simple subtraction.
> >
> > This solves #2 because evaluation is done hierarchically so that
> > distributing processes over a set of child cgroups either intentionally
> > or unintentionally no longer evades the oom killer. Total usage is always
> > accounted to the parent and there is no escaping this criteria for users.
> >
> > This solves #3 because it allows admins to protect important processes in
> > cgroups that are supposed to use, for example, 75% of system memory
> > without it unconditionally being selected for oom kill but still oom kill
> > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > killer, as currently implemented, is selling mem cgroups short by
> > requiring the user to accept that the important process will be oom killed
> > iff it uses mem cgroups and isn't attached to root. It also allows users
> > to actually volunteer to be oom killed first without majority usage.
> >
> > It has come up time and time again that this support can be introduced on
> > top of the cgroup oom killer as implemented. It simply cannot. For
> > admins and users to have control over decisionmaking, it needs a
> > oom_score_adj type tunable that cannot change semantics from kernel
> > version to kernel version and without polluting the mem cgroup filesystem.
> > That, in my suggestion, is an adjustment on the amount of total
> > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > That requires that the heuristic uses hierarchical usage rather than
> > considering each cgroup as independent consumers as it does today. We
> > need to implement that heuristic and introduce userspace influence over
> > oom kill selection now rather than later because its implementation
> > changes how this patchset is implemented.
> >
> > I can implement these changes, if preferred, on top of the current
> > patchset, but I do not believe we want inconsistencies between kernel
> > versions that introduce user visible changes for the sole reason that this
> > current implementation is incomplete and unfair. We can implement and
> > introduce it once without behavior changing later because the core
> > heuristic has necessarily changed.
>
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.
Can you please summarize those issues for my understanding?
> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
>
> Thanks!
>
> Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-10 19:33 ` Andrew Morton
@ 2018-01-11 9:08 ` Michal Hocko
2018-01-11 13:18 ` Roman Gushchin
2018-01-11 21:57 ` David Rientjes
2018-01-13 17:14 ` Johannes Weiner
1 sibling, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2018-01-11 9:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Roman Gushchin, David Rientjes, linux-mm, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
>
> > Hello, David!
> >
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > >
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > >
> > > > Thanks, I'll grab these.
> > > >
> > > > There has been controversy over this patchset, to say the least. I
> > > > can't say that I followed it closely! Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > >
> > >
> > > Yes, I'll summarize what my concerns have been in the past and what they
> > > are wrt the patchset as it stands in -mm. None of them originate from my
> > > current usecase or anticipated future usecase of the oom killer for
> > > system-wide or memcg-constrained oom conditions. They are based purely on
> > > the patchset's use of an incomplete and unfair heuristic for deciding
> > > which cgroup to target.
> > >
> > > I'll also suggest simple changes to the patchset, which I have in the
> > > past, that can be made to address all of these concerns.
> > >
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > >
> > > The patchset uses two different heuristics to compare root and leaf mem
> > > cgroups and scores them based on number of pages. For the root mem
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > > different independent of /proc/pid/oom_score_adj, but the most obvious
> > > unfairness comes from users who tune oom_score_adj.
> > >
> > > An example: start a process that faults 1GB of anonymous memory and leave
> > > it attached to the root mem cgroup. Start six more processes that each
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > > will always kill the 1GB process attached to the root mem cgroup. It's
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > > it.
> > >
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > >
> > > This is caused by the patchset disregarding oom_score_adj entirely for
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > > the complete opposite result of what the cgroup aware oom killer
> > > advertises.
> > >
> > > It also works the other way, if a large memory hog is attached to the root
> > > mem cgroup but has a negative oom_score_adj it is never killed and random
> > > processes are nuked solely because they happened to be attached to a leaf
> > > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > > so I can't presume that it is either known nor tested.
> > >
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > > the same criteria by doing hierarchical accounting of usage and
> > > subtracting from total system usage to find root usage.
> >
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
>
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?
Well, I would rather look at that from the use case POV. The primary
user of the new OOM killer functionality are containers. I might be
wrong but I _assume_ that root cgroup will only contain the basic system
infrastructure there and all the workload will run containers (aka
cgroups). The only oom tuning inside the root cgroup would be to disable
oom for some of those processes. The current implementation should work
reasonably well for that configuration.
> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.
The root memcg will be always special and I would rather base the
semantic based on usecases rather than implement something based on
theoretical examples.
> > > 2. Evading the oom killer by attaching processes to child cgroups
> > >
> > > Any cgroup on the system can attach all their processes to individual
> > > child cgroups. This is functionally the same as doing
> > >
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > >
> > > without the no internal process constraint introduced with cgroup v2. All
> > > child cgroups are evaluated based on their own usage: all anon,
> > > unevictable, and unreclaimable slab as described previously. It requires
> > > an individual cgroup to be the single largest consumer to be targeted by
> > > the oom killer.
> > >
> > > An example: allow users to manage two different mem cgroup hierarchies
> > > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > > memory in their respective hierarchies. On a system oom condition, we'd
> > > expect at least one process from user B's hierarchy would always be oom
> > > killed with the cgroup aware oom killer. In fact, the changelog
> > > explicitly states it solves an issue where "1) There is no fairness
> > > between containers. A small container with few large processes will be
> > > chosen over a large one with huge number of small processes."
> > >
> > > The opposite becomes true, however, if user B creates child cgroups and
> > > distributes its processes such that each child cgroup's usage never
> > > exceeds 10GB of memory. This can either be done intentionally to
> > > purposefully have a low cgroup memory footprint to evade the oom killer or
> > > unintentionally with cgroup v2 to allow those individual processes to be
> > > constrained by other cgroups in a single hierarchy model. User A, using
> > > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > > of his memory limit.
> > >
> > > Others have commented its still possible to do this with a per-process
> > > model if users split their processes into many subprocesses with small
> > > memory footprints.
> > >
> > > Solution: comparing cgroups must be done hierarchically. Neither user A
> > > nor user B can evade the oom killer because targeting is done based on the
> > > total hierarchical usage rather than individual cgroups in their
> > > hierarchies.
> >
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
>
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?
Hierarchical approach basically hardcodes the oom decision into the
hierarchy structure and that is simply a no go because that would turn
into a massive configuration PITA (more on that below). I consider the above
example rather artificial to be honest. Anyway, if we _really_ have to
address it in the future we can do that by providing a mechanism to
prioritize cgroups. It seems that this is required for some usecases
anyway.
> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > >
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > > from the oom killer, the cgroup aware oom killer does not provide any
> > > solution for the user to protect leaf mem cgroups. This is a result of
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > > unreclaimable slab usage and disregarding any user tunable.
> > >
> > > Absent the cgroup aware oom killer, users have the ability to strongly
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > >
> > > An example: a process knows its going to use a lot of memory, so it sets
> > > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > > distrupting any other process. If it's attached to the root mem cgroup,
> > > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > > outside its control, it will never be oom killed unless that cgroup's
> > > usage is the largest single cgroup usage on the system. The reverse also
> > > is true for processes that the admin does not want to be oom killed: set
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > > cgroup has the highest usage on the system.
> > >
> > > The result is that both admins and users have lost all control over which
> > > processes are oom killed. They are left with only one alternative: set
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > > kill. It doesn't address the issue at all for memcg-constrained oom
> > > conditions since no processes are killable anymore, and risks panicking
> > > the system if it is the only process left on the system. A process
> > > preferring that it is first in line for oom kill simply cannot volunteer
> > > anymore.
> > >
> > > Solution: allow users and admins to control oom kill selection by
> > > introducing a memory.oom_score_adj to affect the oom score of that mem
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > > of a process.
> >
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
>
> It does make sense to have some form of per-cgroup tunability. Why is
> the oom_score_adj approach inappropriate and what would be better?
oom_score_adj is basically unusable for any fine tuning on the process
level for most setups except for very specialized ones. The only
reasonable usage I've seen so far was to disable OOM killer for a
process or make it a prime candidate. Using the same limited concept for
cgroups sounds like repeating the same error to me.
> How hard is it to graft such a thing onto the -mm patchset?
I think this should be thought through very well before we add another
tuning. Moreover the current usecase doesn't seem to require it so I am
not really sure why we should implement something right away and later
suffer from API mistakes.
> > > I proposed a solution in
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > responded to, for all of these issues. The idea is to do hierarchical
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > comparing total usage at each level to select target cgroups. Admins and
> > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > each level.
> > >
> > > This solves #1 because mem cgroups can be compared based on the same
> > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > counterpart that can be used to do the simple subtraction.
> > >
> > > This solves #2 because evaluation is done hierarchically so that
> > > distributing processes over a set of child cgroups either intentionally
> > > or unintentionally no longer evades the oom killer. Total usage is always
> > > accounted to the parent and there is no escaping this criteria for users.
> > >
> > > This solves #3 because it allows admins to protect important processes in
> > > cgroups that are supposed to use, for example, 75% of system memory
> > > without it unconditionally being selected for oom kill but still oom kill
> > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > killer, as currently implemented, is selling mem cgroups short by
> > > requiring the user to accept that the important process will be oom killed
> > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > to actually volunteer to be oom killed first without majority usage.
> > >
> > > It has come up time and time again that this support can be introduced on
> > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > admins and users to have control over decisionmaking, it needs a
> > > oom_score_adj type tunable that cannot change semantics from kernel
> > > version to kernel version and without polluting the mem cgroup filesystem.
> > > That, in my suggestion, is an adjustment on the amount of total
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > That requires that the heuristic uses hierarchical usage rather than
> > > considering each cgroup as independent consumers as it does today. We
> > > need to implement that heuristic and introduce userspace influence over
> > > oom kill selection now rather than later because its implementation
> > > changes how this patchset is implemented.
> > >
> > > I can implement these changes, if preferred, on top of the current
> > > patchset, but I do not believe we want inconsistencies between kernel
> > > versions that introduce user visible changes for the sole reason that this
> > > current implementation is incomplete and unfair. We can implement and
> > > introduce it once without behavior changing later because the core
> > > heuristic has necessarily changed.
> >
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
>
> Can you please summarize those issues for my understanding?
Because it makes the oom decision directly hardwired to the hierarchy
structure. Just take a simple example of the cgroup structure which
reflects a higher level organization
root
/ | \
admins | teachers
students
Now your students group will be most like the largest one. Why should we
kill tasks/cgroups from that cgroup just because it is cumulatively the
largest one. It might have been some of the teacher blowing up the
memory usage.
Another example is when you need a mid layer cgroups for other
controllers to better control resources.
root
/ \
cpuset1 cpuset2
/ \ / | \
A B C D E
You really do not want to select cpuset2 just because it has more
subgroups and potentially larger cumulative usage. The hierarchical
accounting works _only_ if higher level cgroups are semantically
_comparable_ which might be true for some workloads but by no means this
is true in general.
That all being said, I can see further improvements to happen on top of
the current work but I also think that the current implementation works
for the usecase which many users can use without those improvements.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-11 9:08 ` Michal Hocko
@ 2018-01-11 13:18 ` Roman Gushchin
2018-01-12 22:03 ` David Rientjes
2018-01-11 21:57 ` David Rientjes
1 sibling, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-01-11 13:18 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, David Rientjes, linux-mm, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu, Jan 11, 2018 at 10:08:09AM +0100, Michal Hocko wrote:
> On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> > On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> >
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> >
> > It does make sense to have some form of per-cgroup tunability. Why is
> > the oom_score_adj approach inappropriate and what would be better?
>
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.
My 2c here: current oom_score_adj semantics is really non-trivial for
cgroups. It defines an addition/substraction in 1/1000s of total memory or
OOMing cgroup's memory limit, depending on the the scope of the OOM event.
This is totally out of control for a user, because he/she can even have no
idea about the limit of an upper cgroup in the hierarchy. I've provided
an example earlier, in which one or another of two processes in the
same cgroup can be killed, depending on the scope of the OOM event.
>
> > How hard is it to graft such a thing onto the -mm patchset?
>
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>
> > > > I proposed a solution in
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > comparing total usage at each level to select target cgroups. Admins and
> > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > each level.
> > > >
> > > > This solves #1 because mem cgroups can be compared based on the same
> > > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > > counterpart that can be used to do the simple subtraction.
> > > >
> > > > This solves #2 because evaluation is done hierarchically so that
> > > > distributing processes over a set of child cgroups either intentionally
> > > > or unintentionally no longer evades the oom killer. Total usage is always
> > > > accounted to the parent and there is no escaping this criteria for users.
> > > >
> > > > This solves #3 because it allows admins to protect important processes in
> > > > cgroups that are supposed to use, for example, 75% of system memory
> > > > without it unconditionally being selected for oom kill but still oom kill
> > > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > > killer, as currently implemented, is selling mem cgroups short by
> > > > requiring the user to accept that the important process will be oom killed
> > > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > > to actually volunteer to be oom killed first without majority usage.
> > > >
> > > > It has come up time and time again that this support can be introduced on
> > > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > > admins and users to have control over decisionmaking, it needs a
> > > > oom_score_adj type tunable that cannot change semantics from kernel
> > > > version to kernel version and without polluting the mem cgroup filesystem.
> > > > That, in my suggestion, is an adjustment on the amount of total
> > > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > > That requires that the heuristic uses hierarchical usage rather than
> > > > considering each cgroup as independent consumers as it does today. We
> > > > need to implement that heuristic and introduce userspace influence over
> > > > oom kill selection now rather than later because its implementation
> > > > changes how this patchset is implemented.
> > > >
> > > > I can implement these changes, if preferred, on top of the current
> > > > patchset, but I do not believe we want inconsistencies between kernel
> > > > versions that introduce user visible changes for the sole reason that this
> > > > current implementation is incomplete and unfair. We can implement and
> > > > introduce it once without behavior changing later because the core
> > > > heuristic has necessarily changed.
> > >
> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> >
> > Can you please summarize those issues for my understanding?
>
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
> root
> / | \
> admins | teachers
> students
>
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
>
> Another example is when you need a mid layer cgroups for other
> controllers to better control resources.
> root
> / \
> cpuset1 cpuset2
> / \ / | \
> A B C D E
>
> You really do not want to select cpuset2 just because it has more
> subgroups and potentially larger cumulative usage. The hierarchical
> accounting works _only_ if higher level cgroups are semantically
> _comparable_ which might be true for some workloads but by no means this
> is true in general.
>
> That all being said, I can see further improvements to happen on top of
> the current work but I also think that the current implementation works
> for the usecase which many users can use without those improvements.
Thank you, Michal, you wrote exactly what I wanted to write here!
Summarizing all this, following the hierarchy is good when it reflects
the "importance" of cgroup's memory for a user, and bad otherwise.
In generic case with unified hierarchy it's not true, so following
the hierarchy unconditionally is bad.
Current version of the patchset allows common evaluation of cgroups
by setting memory.groupoom to true. The only limitation is that it
also changes the OOM action: all belonging processes will be killed.
If we really want to preserve an option to evaluate cgroups
together without forcing "kill all processes" action, we can
convert memory.groupoom from being boolean to the multi-value knob.
For instance, "disabled" and "killall". This will allow to add
a third state ("evaluate", for example) later without breaking the API.
Re root cgroup evaluation:
I believe that it's discussed here mostly as an argument towards
hierarchical approach. The current heuristics can definitely be improved,
but it doesn't require changing the whole semantics. For example,
we can ignore oom_score_adj (except -1000) in this particular case,
that will fix David's example.
Thank you!
Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-11 13:18 ` Roman Gushchin
@ 2018-01-12 22:03 ` David Rientjes
2018-01-15 11:54 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-12 22:03 UTC (permalink / raw)
To: Roman Gushchin
Cc: Michal Hocko, Andrew Morton, linux-mm, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Thu, 11 Jan 2018, Roman Gushchin wrote:
> Summarizing all this, following the hierarchy is good when it reflects
> the "importance" of cgroup's memory for a user, and bad otherwise.
> In generic case with unified hierarchy it's not true, so following
> the hierarchy unconditionally is bad.
>
> Current version of the patchset allows common evaluation of cgroups
> by setting memory.groupoom to true. The only limitation is that it
> also changes the OOM action: all belonging processes will be killed.
> If we really want to preserve an option to evaluate cgroups
> together without forcing "kill all processes" action, we can
> convert memory.groupoom from being boolean to the multi-value knob.
> For instance, "disabled" and "killall". This will allow to add
> a third state ("evaluate", for example) later without breaking the API.
>
No, this isn't how kernel features get introduced. We don't design a new
kernel feature with its own API for a highly specialized usecase and then
claim we'll fix the problems later. Users will work around the
constraints of the new feature, if possible, and then we can end up
breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
even more tunables to cover up for mistakes in earlier designs.
The key point to all three of my objections: extensibility.
Both you and Michal have acknowledged blantently obvious shortcomings in
the design. We do not need to merge an incomplete patchset that forces us
into a corner later so that we need to add more tunables and behaviors.
It's also an incompletely documented patchset: the user has *no* insight
other than reading the kernel code that with this functionality their
/proc/pid/oom_score_adj values are effective when attached to the root
cgroup and ineffective when attached to a non-root cgroup, regardless of
whether the memory controller is enabled or not.
Extensibility when designing a new kernel feature is important. If you
see shortcomings, which I've enumerated and have been acknowledged, it
needs to be fixed.
The main consideration here is two choices:
(1) any logic to determine process(es) to be oom killed belongs only in
userspace, and has no place in the kernel, or
(2) such logic is complete, documented, extensible, and is generally
useful.
This patchset obviously is neither. I believe we can move toward (2), but
it needs to be sanely designed and implemented such that it addresses the
concerns I've enumerated that we are all familiar with.
The solution is to define the oom policy as a trait of the mem cgroup
itself. That needs to be complemented with a per mem cgroup value member
to specify the policy for child mem cgroups. We don't need any new
memory.groupoom, which depends on only this policy decision and is
otherwise a complete no-op, and we don't need any new mount option.
It's quite obvious the system cannot have a single cgroup aware oom
policy. That's when you get crazy inconsistencies and functionally broken
behavior such as influencing oom kill selection based on whether a cgroup
distributed processes over a set of child cgroups, even if they do not
have memory in their memory.subtree_control. It's also where you can give
the ability to user to still prefer to protect important cgroups or bias
against non-important cgroups.
The point is that the functionality needs to be complete and it needs to
be extensible.
I'd propose to introduce a new memory.oom_policy and a new
memory.oom_value. All cgroups have these files, but memory.oom_value is a
no-op for the root mem cgroup, just as you can't limit the memory usage on
the root mem cgroup.
memory.oom_policy defines the policy for selection when that mem cgroup is
oom; system oom conditions refer to the root mem cgroup's
memory.oom_policy.
memory.oom_value defines any parameter that child mem cgroups should be
considered with when effecting that policy.
There's three memory.oom_policy values I can envision at this time, but it
can be extended in the future:
- "adj": cgroup aware just as you have implemented based on how large a
cgroup is. oom_value can be used to define any adjustment made to that
usage so that userspace can protect important cgroups or bias against
non-important cgroups (I'd suggest using that same semantics as
oom_score_adj for consistency, but wouldn't object to using a byte
value).
- "hierarchy": same as "adj" but based on hierarchical usage. oom_value
has the same semantics as "adj". This prevents users from
intentionally or unintentionally affecting oom kill selection by
limiting groups of their processes by the memory controller, or any
other cgroup controller.
- "priority": cgroup aware just as you implemented memory.oom_priority in
the past. oom_value is a strict priority value where usage is not
considered and only the priority of the subtree is compared.
With cgroup v2 sematics of no internal process constraint, this is
extremely straight forward. All of your oom evaluation function can be
reused with a simple comparison based on the policy to score individual
cgroups. In the simplest policy, "priority", this is like a 10 line
function, but extremely useful to userspace.
This allows users to have full power over the decisionmaking in every
subtree wrt oom kill selection and doesn't regress or allow for any
pitfalls of the current patchset. The goal is not to have one single oom
policy for the entire system, but define the policy that makes useful
sense. This is how an extensible feature is designed and does not require
any further pollution of the mem cgroup filesystem.
If you have additional features such as groupoom, you can make
memory.oom_policy comma delimited, just as vmpressure modes are comma
delimited. You would want to use "adj,groupoom". We don't need another
file that is pointless in other policy decisions. We don't need a mount
option to lock the entire system into a single methodology.
Cgroup v2 is a very clean interface and I think it's the responsibility of
every controller to maintain that. We should not fall into a cgroup v1
mentality which became very difficult to make extensible. Let's make a
feature that is generally useful, complete, and empowers the user rather
than push them into a corner with a system wide policy with obvious
downsides.
For these reasons, and the three concerns that I enumerated earlier which
have been acknowledged of obvious shortcoming with this approach:
Nacked-by: David Rientjes <rientjes@google.com>
I'll be happy to implement the core functionality that allows oom policies
to be written by the user and introduce memory.oom_value, and then rework
the logic defined in your patchset as "adj" by giving the user an optional
way of perferring or biasing that usage in a way that is clearly
documented and extended. Root mem cgroup usage is obviously wrong in this
current patchset since it uses oom_score_adj whereas leaf cgroups do not,
so that will be fixed. But I'll ask that the current patchset is removed
from -mm since it has obvious downsides, pollutes the mem cgroup v2
filesystem, is not extensible, is not documented wrt to oom_score_adj,
allows evasion of the heuristic, and doesn't allow the user to have any
power in the important decision of which of their important processes is
oom killed such that this feature is not useful outside very specialized
usecases.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-12 22:03 ` David Rientjes
@ 2018-01-15 11:54 ` Michal Hocko
2018-01-16 21:36 ` David Rientjes
0 siblings, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2018-01-15 11:54 UTC (permalink / raw)
To: David Rientjes
Cc: Roman Gushchin, Andrew Morton, linux-mm, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 12-01-18 14:03:03, David Rientjes wrote:
> On Thu, 11 Jan 2018, Roman Gushchin wrote:
>
> > Summarizing all this, following the hierarchy is good when it reflects
> > the "importance" of cgroup's memory for a user, and bad otherwise.
> > In generic case with unified hierarchy it's not true, so following
> > the hierarchy unconditionally is bad.
> >
> > Current version of the patchset allows common evaluation of cgroups
> > by setting memory.groupoom to true. The only limitation is that it
> > also changes the OOM action: all belonging processes will be killed.
> > If we really want to preserve an option to evaluate cgroups
> > together without forcing "kill all processes" action, we can
> > convert memory.groupoom from being boolean to the multi-value knob.
> > For instance, "disabled" and "killall". This will allow to add
> > a third state ("evaluate", for example) later without breaking the API.
> >
>
> No, this isn't how kernel features get introduced. We don't design a new
> kernel feature with its own API for a highly specialized usecase and then
> claim we'll fix the problems later. Users will work around the
> constraints of the new feature, if possible, and then we can end up
> breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> even more tunables to cover up for mistakes in earlier designs.
This is a blatant misinterpretation of the proposed changes. I haven't
heard _any_ single argument against the proposed user interface except
for complaints for missing tunables. This is not how the kernel
development works and should work. The usecase was clearly described and
far from limited to a single workload or company.
> The key point to all three of my objections: extensibility.
And it has been argued that further _features_ can be added on top. I am
absolutely fed up discussing those things again and again without any
progress. You simply keep _ignoring_ counter arguments and that is a
major PITA to be honest with you. You are basically blocking a useful
feature because it doesn't solve your particular workload. This is
simply not acceptable in the kernel development.
> Both you and Michal have acknowledged blantently obvious shortcomings in
> the design.
What you call blatant shortcomings we do not see affecting any
_existing_ workloads. If they turn out to be real issues then we can fix
them without deprecating any user APIs added by this patchset.
Look, I am not the one who is a heavy container user. The primary reason
I am supporting this is because a) it has a _real_ user and I can see
more to emerge and b) because it _makes_ sense in its current form and
it doesn't bring heavy maintenance burden in the OOM code base which I
do care about.
I am deliberately skipping the rest of your email because it is _yet_
_again_ a form of obstructing the current patchset which is what you
have been doing for quite some time. I will leave the final decision
for merging to Andrew. If you want to build a more fine grained control
on top, you are free to do so. I will be reviewing those like any other
upstream oom changes.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-15 11:54 ` Michal Hocko
@ 2018-01-16 21:36 ` David Rientjes
2018-01-16 22:09 ` Michal Hocko
0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-16 21:36 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, Andrew Morton, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Mon, 15 Jan 2018, Michal Hocko wrote:
> > No, this isn't how kernel features get introduced. We don't design a new
> > kernel feature with its own API for a highly specialized usecase and then
> > claim we'll fix the problems later. Users will work around the
> > constraints of the new feature, if possible, and then we can end up
> > breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> > even more tunables to cover up for mistakes in earlier designs.
>
> This is a blatant misinterpretation of the proposed changes. I haven't
> heard _any_ single argument against the proposed user interface except
> for complaints for missing tunables. This is not how the kernel
> development works and should work. The usecase was clearly described and
> far from limited to a single workload or company.
>
The complaint about the user interface is that it is not extensible, as my
next line states. This doesn't need to be opted into with a mount option
locking the entire system into a single oom policy. That, itself, is the
result of a poor design. What is needed is a way for users to define an
oom policy that is generally useful, not something that is locked in for
the whole system. We don't need several different cgroup mount options
only for mem cgroup oom policies. We also don't need random
memory.groupoom files being added to the mem cgroup v2 filesystem only for
one or two particular policies and being no-ops otherwise. It can easily
be specified as part of the policy itself. My suggestion adds two new
files to the mem cgroup v2 filesystem and no mount option, and allows any
policy to be added later that only uses these two files. I see you've
ignored all of that in this email, so perhaps reading it would be
worthwhile so that you can provide constructive feedback.
> > The key point to all three of my objections: extensibility.
>
> And it has been argued that further _features_ can be added on top. I am
> absolutely fed up discussing those things again and again without any
> progress. You simply keep _ignoring_ counter arguments and that is a
> major PITA to be honest with you. You are basically blocking a useful
> feature because it doesn't solve your particular workload. This is
> simply not acceptable in the kernel development.
>
As the thread says, this has nothing to do with my own particular
workload, it has to do with three obvious shortcomings in the design that
the user has no control over. We can't add features on top if the
heuristic itself changes as a result of the proposal, it needs to be
introduced in an extensible way so that additional changes can be made
later, if necessary, while still working around the very obvious problems
with this current implementation. My suggestion is that we introduce a
way to define the oom policy once so that we don't have to change it later
and are left with needless mount options or mem cgroup v2 files that
become no-ops with the suggested design. I hope that you will read the
proposal for that extensible interface and comment on it about any
concerns that you have, because that feedback would generally be useful.
> > Both you and Michal have acknowledged blantently obvious shortcomings in
> > the design.
>
> What you call blatant shortcomings we do not see affecting any
> _existing_ workloads. If they turn out to be real issues then we can fix
> them without deprecating any user APIs added by this patchset.
>
There are existing workloads that use mem cgroup subcontainers purely for
tracking charging and vmscan stats, which results in this logic being
evaded. It's a real issue, and a perfectly acceptable usecase for mem
cgroup. It's a result of the entire oom policy either being opted into or
opted out of for the entire system and impossible for the user to
configure or avoid. That can be done better by enabling the oom policy
only for a subtree, as I've suggested, but you've ignored. It would also
render both the mount option and the additional file in the mem cgroup v2
filesystem added by this patchset to be no-ops.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-16 21:36 ` David Rientjes
@ 2018-01-16 22:09 ` Michal Hocko
0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2018-01-16 22:09 UTC (permalink / raw)
To: David Rientjes
Cc: Roman Gushchin, Andrew Morton, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Tue 16-01-18 13:36:21, David Rientjes wrote:
> On Mon, 15 Jan 2018, Michal Hocko wrote:
>
> > > No, this isn't how kernel features get introduced. We don't design a new
> > > kernel feature with its own API for a highly specialized usecase and then
> > > claim we'll fix the problems later. Users will work around the
> > > constraints of the new feature, if possible, and then we can end up
> > > breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> > > even more tunables to cover up for mistakes in earlier designs.
> >
> > This is a blatant misinterpretation of the proposed changes. I haven't
> > heard _any_ single argument against the proposed user interface except
> > for complaints for missing tunables. This is not how the kernel
> > development works and should work. The usecase was clearly described and
> > far from limited to a single workload or company.
> >
>
> The complaint about the user interface is that it is not extensible, as my
> next line states.
I disagree and will not repeat argument why.
> This doesn't need to be opted into with a mount option
> locking the entire system into a single oom policy. That, itself, is the
> result of a poor design. What is needed is a way for users to define an
> oom policy that is generally useful, not something that is locked in for
> the whole system.
We have been discussing general oom policies for years now and there was
_no_ _single_ useful/acceptable approach suggested. Nor is your sketch
I am afraid because we could argue how that one doesn't address other
usecases out there which need a more specific control. All that without
having _no code_ merged. The current one is a policy that addresses a
reasonably large class of usecases out there based on containers without
forcing everybody else to use the same policy.
> We don't need several different cgroup mount options
> only for mem cgroup oom policies.
cgroup mount option sounds like a reasonable approach already used for
the unified hierarchy in early stages.
> We also don't need random
> memory.groupoom files being added to the mem cgroup v2 filesystem only for
> one or two particular policies and being no-ops otherwise.
This groupoom is a fundamental API allowing to kill the whole cgroup
which is a reasonable thing to do and also sane user API regardless of
implementation details. Any oom selection policy can be built on top.
> It can easily
> be specified as part of the policy itself.
No it cannot, because it would conflate oom selection _and_ oom action
together. And that would be wrong _semantically_, I believe. And I am quite
sure we can discuss what kind of policies we need to death and won't
move on. Exactly like, ehm, until now.
So let me repeat. There are users for the functionality. Users have to
explicitly opt-in so existing users are not in a risk of regressions.
Further more fine grained oom selection policies can be implemented on top
without breaking new users.
In short: There is no single reason to block this to be merged.
If your usecase is not covered yet then feel free to extend the existing
code/APIs to do so. I will happily review and discuss them like I've
been doing here even though I am myself not a user of this new
functionality.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-11 9:08 ` Michal Hocko
2018-01-11 13:18 ` Roman Gushchin
@ 2018-01-11 21:57 ` David Rientjes
1 sibling, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-11 21:57 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, Roman Gushchin, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Thu, 11 Jan 2018, Michal Hocko wrote:
> > > I find this problem quite minor, because I haven't seen any practical problems
> > > caused by accounting of the root cgroup memory.
> > > If it's a serious problem for you, it can be solved without switching to the
> > > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > > substract them from global values. So, it can be a relatively small enhancement
> > > on top of the current mm tree. This has nothing to do with global victim selection
> > > approach.
> >
> > It sounds like a significant shortcoming to me - the oom-killing
> > decisions which David describes are clearly incorrect?
>
> Well, I would rather look at that from the use case POV. The primary
> user of the new OOM killer functionality are containers. I might be
> wrong but I _assume_ that root cgroup will only contain the basic system
> infrastructure there and all the workload will run containers (aka
> cgroups). The only oom tuning inside the root cgroup would be to disable
> oom for some of those processes. The current implementation should work
> reasonably well for that configuration.
>
It's a decision made on the system level when cgroup2 is mounted, it
affects all processes that get attached to a leaf cgroup, even regardless
of whether or not it has the memory controller enabled for that subtree.
In other words, mounting cgroup2 with the cgroup aware oom killer mount
option immediately:
- makes /proc/pid/oom_score_adj effective for all processes attached
to the root cgroup only, and
- makes /proc/pid/oom_score_adj a no-op for all processes attached to a
non-root cgroup.
Note that there may be no active usage of any memory controller at the
time of oom, yet this tunable inconsistency still exists for any process.
The initial response is correct: it clearly produces incorrect oom killing
decisions. This implementation detail either requires the entire system
is not containerized at all (no processes attached to any leaf cgroup), or
fully containerized (all processes attached to leaf cgroups). It's a
highly specialized usecase with very limited limited scope and is wrought
with pitfalls if any oom_score_adj is tuned because it strictly depends on
the cgroup to which those processes are attached at any given time to
determine whether it is effective or not.
> > > We've discussed this a lot.
> > > Hierarchical approach has their own issues, which we've discussed during
> > > previous iterations of the patchset. If you know how to address them
> > > (I've no idea), please, go on and suggest your version.
> >
> > Well, if a hierarchical approach isn't a workable fix for the problem
> > which David has identified then what *is* the fix?
>
> Hierarchical approach basically hardcodes the oom decision into the
> hierarchy structure and that is simply a no go because that would turn
> into a massive configuration PITA (more on that below). I consider the above
> example rather artificial to be honest. Anyway, if we _really_ have to
> address it in the future we can do that by providing a mechanism to
> prioritize cgroups. It seems that this is required for some usecases
> anyway.
>
I'll address the hierarchical accounting suggestion below.
The example is not artificial, it's not theoretical, it's a real-world
example. Users can attach processes to subcontainers purely for memory
stats from a mem cgroup point of view without limiting usage, or to
subcontainers for use with other controllers other than the memory
controller. We do this all the time: it's helpful to assign groups of
processes to subcontainers simply to track statistics while the actual
limitation is enforced by an ancestor.
So without hierarchical accounting, we can extend the above restriction,
that a system is either fully containerized or not containerized at all,
by saying that a fully containerized system must not use subcontainers to
avoid the cgroup aware oom killer heuristic. In other words, using this
feature requires:
- the entire system is not containerized at all, or
- the entire system is fully containerized and no cgroup (any controller,
not just memory) uses subcontainers to intentionally/unintentionally
distribute usage to evade this heuristic.
Of course the second restriction severely limits the flexibility that
cgroup v2 introduces as a whole as a caveat of an implementation detail of
the memory cgroup aware oom killer. Why not simply avoid using the cgroup
aware oom killer? It's not so easy since the it's a property of the
machine itself: users probably have no control over it themselves and, in
the worst case, can trivially evade ever being oom killed if used.
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > > it should look like, please, propose a patch; otherwise it's hard to discuss
> > > it without the code.
> >
> > It does make sense to have some form of per-cgroup tunability. Why is
> > the oom_score_adj approach inappropriate and what would be better?
>
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.
>
oom_score_adj is based on the heuristic that the oom killer uses to decide
which process to kill, it kills the largest memory hogging process. Being
able to tune that based on a proportion of available memory, whether for
the system as a whole or for a memory cgroup hierarchy makes sense for
varying RAM capacities and memory cgroup limits. It works quite well in
practice, I'm not sure your experience with it based on the needs you have
with overcommit.
The ability to protect important cgroups and bias against non-important
cgroups is vital to any selection implementation. Strictly requiring that
important cgroups do not have majority usage is not a solution, which is
what this patchset implements. It's also insufficient to say that this
can be added on later, even though the need is acknowledged, because any
prioritization or userspace influence will require a change to the cgroup
aware oom killer's core heuristic itself. That needs to be decided now
rather than later to avoid differing behavior between kernel versions for
those who adopt this feature and carefully arrange their cgroup
hierarchies to fit with the highly specialized usecase before it's
completely changed out from under them.
> > How hard is it to graft such a thing onto the -mm patchset?
>
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>
The current usecase is so highly specialized that it's not usable for
anybody else and will break that highly specialized usecase if we make it
usable for anybody else in the future.
It's so highly specialized that you can't even protect an important
process from oom kill, there is no remedy provided to userspace that still
allows local mem cgroup oom to actually work.
# echo "+memory" > cgroup.subtree_control
# mkdir cg1
# echo $$ > cg1/cgroup.procs
<fork important process>
# echo -999 > /proc/$!/oom_score_adj
# echo f > /proc/sysrq-trigger
This kills the important process if cg1 has the highest usage, the user
has no control other than purposefully evading the oom killer, if
possible, by distributing the process's threads over subcontainers.
Because a highly specialized user is proposing this and it works well for
them right now does not mean that it is in the best interest of Linux to
merge, especially if it cannot become generally useful to others without
core changes that affect the original highly specialized user. This is
why the patchset, as is, is currently incomplete.
> > > David, I _had_ hierarchical accounting implemented in one of the previous
> > > versions of this patchset. And there were _reasons_, why we went away from it.
> >
> > Can you please summarize those issues for my understanding?
>
> Because it makes the oom decision directly hardwired to the hierarchy
> structure. Just take a simple example of the cgroup structure which
> reflects a higher level organization
> root
> / | \
> admins | teachers
> students
>
> Now your students group will be most like the largest one. Why should we
> kill tasks/cgroups from that cgroup just because it is cumulatively the
> largest one. It might have been some of the teacher blowing up the
> memory usage.
>
There's one thing missing here: the use of the proposed
memory.oom_score_adj. You can use it to either polarize the decision so
that oom kills always originate from any of /admins, /students, or
/teachers, or bias against them. You can also proportionally prefer
/admins or /teachers, if desired, so they are selected if their usage
exceeds a certain threshold based on the limit of the ancestor even though
that hierarchical usage does not exceed, or even approach the hierarchical
usage of /students.
This requires that the entity setting up this hierarchy know only one
thing: the allowed memory usage of /admins or /teachers compared to the
very large usage of /students. It can actually be tuned to a great
detail. I only suggest memory.oom_score_adj because it can work for all
/root capacities, regardless of RAM capacity, as a proportion of the
system that should be set aside for /admins, /students, and/or /teachers
rather than hardcoded bytes which would change depending on the amount of
RAM you have.
FWIW, we do this exact thing and it works quite well. This doesn't mean
that I'm advocating for my own usecase, or anticipated usecase for the
cgroup aware oom killer, but that you've picked an example that I have
personally worked with and the bias can work quite well for oom kill
selection.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-10 19:33 ` Andrew Morton
2018-01-11 9:08 ` Michal Hocko
@ 2018-01-13 17:14 ` Johannes Weiner
2018-01-14 23:44 ` David Rientjes
1 sibling, 1 reply; 51+ messages in thread
From: Johannes Weiner @ 2018-01-13 17:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Roman Gushchin, David Rientjes, linux-mm, Michal Hocko,
Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed, Jan 10, 2018 at 11:33:45AM -0800, Andrew Morton wrote:
> On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:
> > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > > > > This patchset makes the OOM killer cgroup-aware.
> > > >
> > > > Thanks, I'll grab these.
> > > >
> > > > There has been controversy over this patchset, to say the least. I
> > > > can't say that I followed it closely! Could those who still have
> > > > reservations please summarise their concerns and hopefully suggest a
> > > > way forward?
> > > >
> > >
> > > Yes, I'll summarize what my concerns have been in the past and what they
> > > are wrt the patchset as it stands in -mm. None of them originate from my
> > > current usecase or anticipated future usecase of the oom killer for
> > > system-wide or memcg-constrained oom conditions. They are based purely on
> > > the patchset's use of an incomplete and unfair heuristic for deciding
> > > which cgroup to target.
> > >
> > > I'll also suggest simple changes to the patchset, which I have in the
> > > past, that can be made to address all of these concerns.
> > >
> > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > >
> > > The patchset uses two different heuristics to compare root and leaf mem
> > > cgroups and scores them based on number of pages. For the root mem
> > > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > > different independent of /proc/pid/oom_score_adj, but the most obvious
> > > unfairness comes from users who tune oom_score_adj.
> > >
> > > An example: start a process that faults 1GB of anonymous memory and leave
> > > it attached to the root mem cgroup. Start six more processes that each
> > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > > will always kill the 1GB process attached to the root mem cgroup. It's
> > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > > it.
> > >
> > > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > > 12,652,907, eight times larger even though its usage is six times smaller.
> > >
> > > This is caused by the patchset disregarding oom_score_adj entirely for
> > > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > > the complete opposite result of what the cgroup aware oom killer
> > > advertises.
> > >
> > > It also works the other way, if a large memory hog is attached to the root
> > > mem cgroup but has a negative oom_score_adj it is never killed and random
> > > processes are nuked solely because they happened to be attached to a leaf
> > > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > > so I can't presume that it is either known nor tested.
> > >
> > > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > > the same criteria by doing hierarchical accounting of usage and
> > > subtracting from total system usage to find root usage.
> >
> > I find this problem quite minor, because I haven't seen any practical problems
> > caused by accounting of the root cgroup memory.
> > If it's a serious problem for you, it can be solved without switching to the
> > hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> > substract them from global values. So, it can be a relatively small enhancement
> > on top of the current mm tree. This has nothing to do with global victim selection
> > approach.
>
> It sounds like a significant shortcoming to me - the oom-killing
> decisions which David describes are clearly incorrect?
As others have pointed out, it's an academic problem.
You don't have any control and no accounting of the stuff situated
inside the root cgroup, so it doesn't make sense to leave anything in
there while also using sophisticated containerization mechanisms like
this group oom setting.
In fact, the laptop I'm writing this email on runs an unmodified
mainstream Linux distribution. The only thing in the root cgroup are
kernel threads.
The decisions are good enough for the rare cases you forget something
in there and it explodes.
> If this can be fixed against the -mm patchset with a "relatively small
> enhancement" then please let's get that done so it can be reviewed and
> tested.
You'd have to sum up all the memory consumed by first-level cgroups
and then subtract from the respective system-wide counters. They're
implemented differently than the cgroup tracking, so it's kind of
messy. But it wouldn't be impossible.
> > > 2. Evading the oom killer by attaching processes to child cgroups
> > >
> > > Any cgroup on the system can attach all their processes to individual
> > > child cgroups. This is functionally the same as doing
> > >
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > >
> > > without the no internal process constraint introduced with cgroup v2. All
> > > child cgroups are evaluated based on their own usage: all anon,
> > > unevictable, and unreclaimable slab as described previously. It requires
> > > an individual cgroup to be the single largest consumer to be targeted by
> > > the oom killer.
> > >
> > > An example: allow users to manage two different mem cgroup hierarchies
> > > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > > memory in their respective hierarchies. On a system oom condition, we'd
> > > expect at least one process from user B's hierarchy would always be oom
> > > killed with the cgroup aware oom killer. In fact, the changelog
> > > explicitly states it solves an issue where "1) There is no fairness
> > > between containers. A small container with few large processes will be
> > > chosen over a large one with huge number of small processes."
> > >
> > > The opposite becomes true, however, if user B creates child cgroups and
> > > distributes its processes such that each child cgroup's usage never
> > > exceeds 10GB of memory. This can either be done intentionally to
> > > purposefully have a low cgroup memory footprint to evade the oom killer or
> > > unintentionally with cgroup v2 to allow those individual processes to be
> > > constrained by other cgroups in a single hierarchy model. User A, using
> > > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > > of his memory limit.
> > >
> > > Others have commented its still possible to do this with a per-process
> > > model if users split their processes into many subprocesses with small
> > > memory footprints.
> > >
> > > Solution: comparing cgroups must be done hierarchically. Neither user A
> > > nor user B can evade the oom killer because targeting is done based on the
> > > total hierarchical usage rather than individual cgroups in their
> > > hierarchies.
> >
> > We've discussed this a lot.
> > Hierarchical approach has their own issues, which we've discussed during
> > previous iterations of the patchset. If you know how to address them
> > (I've no idea), please, go on and suggest your version.
>
> Well, if a hierarchical approach isn't a workable fix for the problem
> which David has identified then what *is* the fix?
This assumes you even need one. Right now, the OOM killer picks the
biggest MM, so you can evade selection by forking your MM. This patch
allows picking the biggest cgroup, so you can evade by forking groups.
It's not a new vector, and clearly nobody cares. This has never been
brought up against the current design that I know of.
Note, however, that there actually *is* a way to guard against it: in
cgroup2 there is a hierarchical limit you can configure for the number
of cgroups that are allowed to be created in the subtree. See
1a926e0bbab8 ("cgroup: implement hierarchy limits").
> > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > >
> > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > > from the oom killer, the cgroup aware oom killer does not provide any
> > > solution for the user to protect leaf mem cgroups. This is a result of
> > > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > > unreclaimable slab usage and disregarding any user tunable.
> > >
> > > Absent the cgroup aware oom killer, users have the ability to strongly
> > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > >
> > > An example: a process knows its going to use a lot of memory, so it sets
> > > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > > distrupting any other process. If it's attached to the root mem cgroup,
> > > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > > outside its control, it will never be oom killed unless that cgroup's
> > > usage is the largest single cgroup usage on the system. The reverse also
> > > is true for processes that the admin does not want to be oom killed: set
> > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > > cgroup has the highest usage on the system.
> > >
> > > The result is that both admins and users have lost all control over which
> > > processes are oom killed. They are left with only one alternative: set
> > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > > kill. It doesn't address the issue at all for memcg-constrained oom
> > > conditions since no processes are killable anymore, and risks panicking
> > > the system if it is the only process left on the system. A process
> > > preferring that it is first in line for oom kill simply cannot volunteer
> > > anymore.
> > >
> > > Solution: allow users and admins to control oom kill selection by
> > > introducing a memory.oom_score_adj to affect the oom score of that mem
> > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > > of a process.
> >
> > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > sure we want to replicate it on cgroup level as is. If you have an idea of how
> > it should look like, please, propose a patch; otherwise it's hard to discuss
> > it without the code.
>
> It does make sense to have some form of per-cgroup tunability. Why is
> the oom_score_adj approach inappropriate and what would be better? How
> hard is it to graft such a thing onto the -mm patchset?
It could be useful, but we have no concensus on the desired
semantics. And it's not clear why we couldn't add it later as long as
the default settings of a new knob maintain the default behavior
(which would have to be preserved anyway, since we rely on it).
> > > I proposed a solution in
> > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > responded to, for all of these issues. The idea is to do hierarchical
> > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > comparing total usage at each level to select target cgroups. Admins and
> > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > each level.
We did respond repeatedly: this doesn't work for a lot of setups.
Take for example our systems. They have two big top level cgroups:
system stuff - logging, monitoring, package management, etc. - and
then the workload subgroup.
Your scheme would always descend the workload group by default. Why
would that be obviously correct? Something in the system group could
have blown up.
What you propose could be done, but it would have to be opt-in - just
like this oom group feature is opt-in.
> > > This solves #1 because mem cgroups can be compared based on the same
> > > classes of memory and the root mem cgroup's usage can be fairly compared
> > > by subtracting top-level mem cgroup usage from system usage. All of the
> > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide
> > > counterpart that can be used to do the simple subtraction.
> > >
> > > This solves #2 because evaluation is done hierarchically so that
> > > distributing processes over a set of child cgroups either intentionally
> > > or unintentionally no longer evades the oom killer. Total usage is always
> > > accounted to the parent and there is no escaping this criteria for users.
> > >
> > > This solves #3 because it allows admins to protect important processes in
> > > cgroups that are supposed to use, for example, 75% of system memory
> > > without it unconditionally being selected for oom kill but still oom kill
> > > if it exceeds a certain threshold. In this sense, the cgroup aware oom
> > > killer, as currently implemented, is selling mem cgroups short by
> > > requiring the user to accept that the important process will be oom killed
> > > iff it uses mem cgroups and isn't attached to root. It also allows users
> > > to actually volunteer to be oom killed first without majority usage.
> > >
> > > It has come up time and time again that this support can be introduced on
> > > top of the cgroup oom killer as implemented. It simply cannot. For
> > > admins and users to have control over decisionmaking, it needs a
> > > oom_score_adj type tunable that cannot change semantics from kernel
> > > version to kernel version and without polluting the mem cgroup filesystem.
> > > That, in my suggestion, is an adjustment on the amount of total
> > > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > > That requires that the heuristic uses hierarchical usage rather than
> > > considering each cgroup as independent consumers as it does today. We
> > > need to implement that heuristic and introduce userspace influence over
> > > oom kill selection now rather than later because its implementation
> > > changes how this patchset is implemented.
> > >
> > > I can implement these changes, if preferred, on top of the current
> > > patchset, but I do not believe we want inconsistencies between kernel
> > > versions that introduce user visible changes for the sole reason that this
> > > current implementation is incomplete and unfair. We can implement and
> > > introduce it once without behavior changing later because the core
> > > heuristic has necessarily changed.
> >
> > David, I _had_ hierarchical accounting implemented in one of the previous
> > versions of this patchset. And there were _reasons_, why we went away from it.
>
> Can you please summarize those issues for my understanding?
Comparing siblings at each level to find OOM victims is not
universally meaningful. As per above, our workload subgroup will
always be bigger than the system subgroup. Does that mean nothing
should *ever* be killed in the system group per default? That's silly.
In such a scheme, configuring priorities on all cgroups down the tree
is not just something you *can* do; it's something you *must* do to
get reasonable behavior.
That's fine for David, who does that already anyway. But we actually
*like* the way the current OOM killer heuristics work: find the
biggest indivisible memory consumer in the system and kill it (we
merely want to extend the concept of "indivisible memory consumer"
from individual MMs to cgroups of interdependent MMs).
Such a heuristic is mutually exclusive with a hierarchy walk. You
can't know, looking at the consumption of system group vs. workload
group, which subtree has the biggest indivisible consumer. The huge
workload group can consist of hundreds of independent jobs.
And a scoring system can't fix this. Once you're in manual mode, you
cannot configure your way back to letting the OOM killer decide.
A hierarchy mode would simply make the existing OOM heuristics
impossible. And we want to use them. So it's not an alternative
proposal, it's in direct opposition of what we're trying to do here.
A way to bias oom_group configured cgroups in their selection could be
useful. But 1) it's not necessary for these patches to be useful (we
certainly don't care). 2) It's absolutely possible to add such knobs
later, as long as you leave the *default* behavior alone. 3) David's
proposal is something entirely different and conflicts with existing
OOM heuristics and our usecase, so it couldn't possibly be the answer.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-13 17:14 ` Johannes Weiner
@ 2018-01-14 23:44 ` David Rientjes
2018-01-15 16:25 ` Johannes Weiner
0 siblings, 1 reply; 51+ messages in thread
From: David Rientjes @ 2018-01-14 23:44 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Sat, 13 Jan 2018, Johannes Weiner wrote:
> You don't have any control and no accounting of the stuff situated
> inside the root cgroup, so it doesn't make sense to leave anything in
> there while also using sophisticated containerization mechanisms like
> this group oom setting.
>
> In fact, the laptop I'm writing this email on runs an unmodified
> mainstream Linux distribution. The only thing in the root cgroup are
> kernel threads.
>
> The decisions are good enough for the rare cases you forget something
> in there and it explodes.
>
It's quite trivial to allow the root mem cgroup to be compared exactly the
same as another cgroup. Please see
https://marc.info/?l=linux-kernel&m=151579459920305.
> This assumes you even need one. Right now, the OOM killer picks the
> biggest MM, so you can evade selection by forking your MM. This patch
> allows picking the biggest cgroup, so you can evade by forking groups.
>
It's quite trivial to prevent any cgroup from evading the oom killer by
either forking their mm or attaching all their processes to subcontainers.
Please see https://marc.info/?l=linux-kernel&m=151579459920305.
> It's not a new vector, and clearly nobody cares. This has never been
> brought up against the current design that I know of.
>
As cgroup v2 becomes more popular, people will organize their cgroup
hierarchies for all controllers they need to use. We do this today, for
example, by attaching some individual consumers to child mem cgroups
purely for the rich statistics and vmscan stats that mem cgroup provides
without any limitation on those cgroups.
> Note, however, that there actually *is* a way to guard against it: in
> cgroup2 there is a hierarchical limit you can configure for the number
> of cgroups that are allowed to be created in the subtree. See
> 1a926e0bbab8 ("cgroup: implement hierarchy limits").
>
Not allowing the user to create subcontainers to track statistics to paper
over an obvious and acknowledged shortcoming in the design of the cgroup
aware oom killer seems like a pretty nasty shortcoming itself.
> It could be useful, but we have no concensus on the desired
> semantics. And it's not clear why we couldn't add it later as long as
> the default settings of a new knob maintain the default behavior
> (which would have to be preserved anyway, since we rely on it).
>
The active proposal is
https://marc.info/?l=linux-kernel&m=151579459920305, which describes an
extendable interface and one that covers all the shortcomings of this
patchset without polluting the mem cgroup filesystem. The default oom
policy in that proposal would be "none", i.e. we do what we do today,
based on process usage. You can configure that, without the mount option
this patchset introduces for local or hierarchical cgroup targeting.
> > > > I proposed a solution in
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > comparing total usage at each level to select target cgroups. Admins and
> > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > each level.
>
> We did respond repeatedly: this doesn't work for a lot of setups.
>
We need to move this discussion to the active proposal at
https://marc.info/?l=linux-kernel&m=151579459920305, because it does
address your setup, so it's not good use of anyones time to further
discuss simply memory.oom_score_adj.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-14 23:44 ` David Rientjes
@ 2018-01-15 16:25 ` Johannes Weiner
2018-01-16 21:21 ` David Rientjes
0 siblings, 1 reply; 51+ messages in thread
From: Johannes Weiner @ 2018-01-15 16:25 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Sun, Jan 14, 2018 at 03:44:09PM -0800, David Rientjes wrote:
> On Sat, 13 Jan 2018, Johannes Weiner wrote:
>
> > You don't have any control and no accounting of the stuff situated
> > inside the root cgroup, so it doesn't make sense to leave anything in
> > there while also using sophisticated containerization mechanisms like
> > this group oom setting.
> >
> > In fact, the laptop I'm writing this email on runs an unmodified
> > mainstream Linux distribution. The only thing in the root cgroup are
> > kernel threads.
> >
> > The decisions are good enough for the rare cases you forget something
> > in there and it explodes.
>
> It's quite trivial to allow the root mem cgroup to be compared exactly the
> same as another cgroup. Please see
> https://marc.info/?l=linux-kernel&m=151579459920305.
This only says "that will be fixed" and doesn't address why I care.
> > This assumes you even need one. Right now, the OOM killer picks the
> > biggest MM, so you can evade selection by forking your MM. This patch
> > allows picking the biggest cgroup, so you can evade by forking groups.
>
> It's quite trivial to prevent any cgroup from evading the oom killer by
> either forking their mm or attaching all their processes to subcontainers.
> Please see https://marc.info/?l=linux-kernel&m=151579459920305.
This doesn't address anything I wrote.
> > It's not a new vector, and clearly nobody cares. This has never been
> > brought up against the current design that I know of.
>
> As cgroup v2 becomes more popular, people will organize their cgroup
> hierarchies for all controllers they need to use. We do this today, for
> example, by attaching some individual consumers to child mem cgroups
> purely for the rich statistics and vmscan stats that mem cgroup provides
> without any limitation on those cgroups.
There is no connecting tissue between what I wrote and what you wrote.
> > Note, however, that there actually *is* a way to guard against it: in
> > cgroup2 there is a hierarchical limit you can configure for the number
> > of cgroups that are allowed to be created in the subtree. See
> > 1a926e0bbab8 ("cgroup: implement hierarchy limits").
>
> Not allowing the user to create subcontainers to track statistics to paper
> over an obvious and acknowledged shortcoming in the design of the cgroup
> aware oom killer seems like a pretty nasty shortcoming itself.
It's not what I proposed. There is a big difference between cgroup
fork bombs and having a couple of groups for statistics.
> > > > > I proposed a solution in
> > > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never
> > > > > responded to, for all of these issues. The idea is to do hierarchical
> > > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed
> > > > > comparing total usage at each level to select target cgroups. Admins and
> > > > > users can use memory.oom_score_adj to influence that decisionmaking at
> > > > > each level.
> >
> > We did respond repeatedly: this doesn't work for a lot of setups.
>
> We need to move this discussion to the active proposal at
> https://marc.info/?l=linux-kernel&m=151579459920305, because it does
> address your setup, so it's not good use of anyones time to further
> discuss simply memory.oom_score_adj.
No, we don't.
We have a patch set that was developed, iterated and improved over 10+
revisions, based on evaluating and weighing trade-offs. It's reached a
state where the memcg maintainers are happy with it and don't have any
concern about future extendabilty to cover more specialized setups.
You've had nine months to contribute and shape this patch series
productively, and instead resorted to a cavalcade of polemics,
evasion, and repetition of truisms and refuted points. A ten paragraph
proposal of vague ideas at this point is simply not a valid argument.
You can send patches to replace or improve on Roman's code and make
the case for them.
Thanks
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-15 16:25 ` Johannes Weiner
@ 2018-01-16 21:21 ` David Rientjes
0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-16 21:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Roman Gushchin, linux-mm, Michal Hocko,
Vladimir Davydov, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Mon, 15 Jan 2018, Johannes Weiner wrote:
> > It's quite trivial to allow the root mem cgroup to be compared exactly the
> > same as another cgroup. Please see
> > https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This only says "that will be fixed" and doesn't address why I care.
>
Because the design of the cgroup aware oom killer requires the entire
system to be fully containerized to be useful and select the
correct/anticipated victim. If anything is left behind in the root mem
cgroup, or on systems that use mem cgroups in ways that you do not, it
returns wildly unpredictable and downright wrong results; it factors
oom_score_adj into the selection criteria only for processes attached to
the root mem cgroup and ignores it otherwise. If we treated root and leaf
cgroups as comparable, this problem wouldn't arise.
> > > This assumes you even need one. Right now, the OOM killer picks the
> > > biggest MM, so you can evade selection by forking your MM. This patch
> > > allows picking the biggest cgroup, so you can evade by forking groups.
> >
> > It's quite trivial to prevent any cgroup from evading the oom killer by
> > either forking their mm or attaching all their processes to subcontainers.
> > Please see https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This doesn't address anything I wrote.
>
It prevents both problems if you are attached to a mem cgroup subtree.
You can't fork the mm and you can't fork groups to evade the selection
logic. If the root mem cgroup and leaf cgroups were truly comparable, it
also prevents both problems regardless of which cgroup the processes
attached to.
> > > It's not a new vector, and clearly nobody cares. This has never been
> > > brought up against the current design that I know of.
> >
> > As cgroup v2 becomes more popular, people will organize their cgroup
> > hierarchies for all controllers they need to use. We do this today, for
> > example, by attaching some individual consumers to child mem cgroups
> > purely for the rich statistics and vmscan stats that mem cgroup provides
> > without any limitation on those cgroups.
>
> There is no connecting tissue between what I wrote and what you wrote.
>
You're completely ignoring other usecases other than your own highly
specialized usecase. You may attach every user process on the system to
leaf cgroups and you may not have any users who have control over a
subtree. Other people do. And this patchset, as implemented, returns
very inaccurate results or allows users to intentionally or
unintentionally evade the oom killer just because they want to use
subcontainers.
Creating and attaching a subset of processes to either top-level
containers or subcontainers for either limitation by mem cgroup or for
statistics is a valid usecase, and one that is used in practice. Your
suggestion of avoiding that problem to work around the shortcomings of
this patchset by limiting how many subcontainers can be created by the
user is utterly ridiculous.
> We have a patch set that was developed, iterated and improved over 10+
> revisions, based on evaluating and weighing trade-offs. It's reached a
> state where the memcg maintainers are happy with it and don't have any
> concern about future extendabilty to cover more specialized setups.
>
It's also obviously untested in those 10+ revisions since it uses
oom_badness() for the root mem cgroup and not leaf mem cgroups which is
why it breaks any system where user processes are attached to the root mem
cgroup. See my example where a 1GB worker attached to the root mem cgroup
is preferred over a cgroup with 6GB workers. It may have been tested with
your own highly specialized usecase, but not with anything else, and the
author obviously admits its shortcomings.
> You've had nine months to contribute and shape this patch series
> productively, and instead resorted to a cavalcade of polemics,
> evasion, and repetition of truisms and refuted points. A ten paragraph
> proposal of vague ideas at this point is simply not a valid argument.
>
The patch series has gone through massive changes over the past nine
months and I've brought up three very specific concerns with its current
form that makes it non-extensible. I know the patchset has very
significant changes or rewrites in its history, but I'm only concerned
with its present form because it's not something that can easily be
extended later. We don't need useless files polutting the cgroup v2
filesystem for things that don't matter with other oom policies and we
don't need the mount option, actually.
> You can send patches to replace or improve on Roman's code and make
> the case for them.
>
I volunteered in the email thread where I proposed an alternative to
replace it, I'm actively seeking any feedback on that proposal to address
anybody else's concerns early on. Your participation in that would be
very useful.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-01-10 13:11 ` Roman Gushchin
2018-01-10 19:33 ` Andrew Morton
@ 2018-01-10 20:50 ` David Rientjes
1 sibling, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-01-10 20:50 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, linux-mm, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tetsuo Handa, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Wed, 10 Jan 2018, Roman Gushchin wrote:
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> >
> > The patchset uses two different heuristics to compare root and leaf mem
> > cgroups and scores them based on number of pages. For the root mem
> > cgroup, it totals the /proc/pid/oom_score of all processes attached:
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable,
> > unreclaimable slab, kernel stack, and swap counters. These can be wildly
> > different independent of /proc/pid/oom_score_adj, but the most obvious
> > unfairness comes from users who tune oom_score_adj.
> >
> > An example: start a process that faults 1GB of anonymous memory and leave
> > it attached to the root mem cgroup. Start six more processes that each
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set
> > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill
> > will always kill the 1GB process attached to the root mem cgroup. It's
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard
> > it.
> >
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of
> > pages for the 6GB of faulted memory. The root mem cgroup's score is
> > 12,652,907, eight times larger even though its usage is six times smaller.
> >
> > This is caused by the patchset disregarding oom_score_adj entirely for
> > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's
> > the complete opposite result of what the cgroup aware oom killer
> > advertises.
> >
> > It also works the other way, if a large memory hog is attached to the root
> > mem cgroup but has a negative oom_score_adj it is never killed and random
> > processes are nuked solely because they happened to be attached to a leaf
> > mem cgroup. This behavior wrt oom_score_adj is completely undocumented,
> > so I can't presume that it is either known nor tested.
> >
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with
> > the same criteria by doing hierarchical accounting of usage and
> > subtracting from total system usage to find root usage.
>
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.
>
It's not surprising that this problem is considered minor, since it was
quite obviously never tested when developing the series.
/proc/pid/oom_score_adj being effective when the process is attached to
the root mem cgroup and not when attached to a leaf mem cgroup is a major
problem, and one that was never documented or explained. I'm somewhat
stunned this is considered minor.
Allow me to put a stake in the ground: admins and users need to be able to
influence oom kill decisions. Killing a process matters. The ability to
protect important processes matters, such as processes that are really
supposed to use 50% of a system's memory. As implemented, there is no
control over this.
If a process's oom_score_adj = 1000 (always kill it), this always happens
when attached to the root mem cgroup. If attached to a leaf mem cgroup,
it never happens unless that individual mem cgroup is the next largest
single consumer of memory. It makes it impossible for a process to be the
preferred oom victim if it is ever attached to any non-root mem cgroup.
Where it is attached can be outside the control of the application itself.
Perhaps worse, an important process with oom_score_adj = -999 (try to kill
everything before it) is always oom killed first if attached to a leaf mem
cgroup with majority usage.
> > 2. Evading the oom killer by attaching processes to child cgroups
> >
> > Any cgroup on the system can attach all their processes to individual
> > child cgroups. This is functionally the same as doing
> >
> > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> >
> > without the no internal process constraint introduced with cgroup v2. All
> > child cgroups are evaluated based on their own usage: all anon,
> > unevictable, and unreclaimable slab as described previously. It requires
> > an individual cgroup to be the single largest consumer to be targeted by
> > the oom killer.
> >
> > An example: allow users to manage two different mem cgroup hierarchies
> > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of
> > memory in their respective hierarchies. On a system oom condition, we'd
> > expect at least one process from user B's hierarchy would always be oom
> > killed with the cgroup aware oom killer. In fact, the changelog
> > explicitly states it solves an issue where "1) There is no fairness
> > between containers. A small container with few large processes will be
> > chosen over a large one with huge number of small processes."
> >
> > The opposite becomes true, however, if user B creates child cgroups and
> > distributes its processes such that each child cgroup's usage never
> > exceeds 10GB of memory. This can either be done intentionally to
> > purposefully have a low cgroup memory footprint to evade the oom killer or
> > unintentionally with cgroup v2 to allow those individual processes to be
> > constrained by other cgroups in a single hierarchy model. User A, using
> > 10% of his memory limit, is always oom killed instead of user B, using 90%
> > of his memory limit.
> >
> > Others have commented its still possible to do this with a per-process
> > model if users split their processes into many subprocesses with small
> > memory footprints.
> >
> > Solution: comparing cgroups must be done hierarchically. Neither user A
> > nor user B can evade the oom killer because targeting is done based on the
> > total hierarchical usage rather than individual cgroups in their
> > hierarchies.
>
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.
>
I asked in https://marc.info/?l=linux-kernel&m=150956897302725 for a clear
example of these issues when there is userspace control over target
selection available. It has received no response, but I can ask again:
please enumerate the concerns you have with hierarchical accounting in the
presence of userspace control over target selection.
Also note that even though userspace control over target selection is
available (through memory.oom_score_adj in my suggestion), it does not
need to be tuned. It can be left to the default value of 0 if there
should be no preference or bias of mem cgroups, just as
/proc/pid/oom_score_adj does not need to be tuned.
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> >
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes
> > from the oom killer, the cgroup aware oom killer does not provide any
> > solution for the user to protect leaf mem cgroups. This is a result of
> > leaf mem cgroups being evaluated based on their anon, unevictable, and
> > unreclaimable slab usage and disregarding any user tunable.
> >
> > Absent the cgroup aware oom killer, users have the ability to strongly
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> >
> > An example: a process knows its going to use a lot of memory, so it sets
> > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid
> > distrupting any other process. If it's attached to the root mem cgroup,
> > it will be oom killed. If it's attached to a leaf mem cgroup by an admin
> > outside its control, it will never be oom killed unless that cgroup's
> > usage is the largest single cgroup usage on the system. The reverse also
> > is true for processes that the admin does not want to be oom killed: set
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its
> > cgroup has the highest usage on the system.
> >
> > The result is that both admins and users have lost all control over which
> > processes are oom killed. They are left with only one alternative: set
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom
> > kill. It doesn't address the issue at all for memcg-constrained oom
> > conditions since no processes are killable anymore, and risks panicking
> > the system if it is the only process left on the system. A process
> > preferring that it is first in line for oom kill simply cannot volunteer
> > anymore.
> >
> > Solution: allow users and admins to control oom kill selection by
> > introducing a memory.oom_score_adj to affect the oom score of that mem
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score
> > of a process.
>
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.
>
Please read the rest of the email; we cannot introduce
memory.oom_score_adj without hierarchical accounting, which requires a
change to the core heuristic used by this patchset. Because hierarchical
accounting is needed for #1 and #2, the values userspace uses for
memory.oom_score_adj would be radically different for hierarchical usage
vs the current heuristic of local usage. We simply cannot push this
inconsistency onto the end user that depends on kernel version. The core
heuristic needs to change for this to be possible while still addressing
#1 and #2. We do not want to be in a situation where the heuristic
changes after users have constructed their mem cgroup hierarchies to use
the current heuristic; doing so would be a maintenance nightmare and,
frankly, irresponsible.
> > It has come up time and time again that this support can be introduced on
> > top of the cgroup oom killer as implemented. It simply cannot. For
> > admins and users to have control over decisionmaking, it needs a
> > oom_score_adj type tunable that cannot change semantics from kernel
> > version to kernel version and without polluting the mem cgroup filesystem.
> > That, in my suggestion, is an adjustment on the amount of total
> > hierarchical usage of each mem cgroup at each level of the hierarchy.
> > That requires that the heuristic uses hierarchical usage rather than
> > considering each cgroup as independent consumers as it does today. We
> > need to implement that heuristic and introduce userspace influence over
> > oom kill selection now rather than later because its implementation
> > changes how this patchset is implemented.
> >
> > I can implement these changes, if preferred, on top of the current
> > patchset, but I do not believe we want inconsistencies between kernel
> > versions that introduce user visible changes for the sole reason that this
> > current implementation is incomplete and unfair. We can implement and
> > introduce it once without behavior changing later because the core
> > heuristic has necessarily changed.
>
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.
> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
>
I am very much aware that you had hierarchical accounting implemented and
you'll remember that I strongly advocated for it, because it is the
correct way to address cgroup target selection when mem cgroups are
hierarchical by nature. The reasons you're referring to did not account
for the proposed memory.oom_score_adj, but I cannot continue to ask for
those reasons to be enumerated if there is no response. If there are
real-world concerns to what is proposed in my email, I'd love those
concerns to be described as I've described the obvious shortcomings with
this patchset. I'm eager to debate that.
Absent any repsonse that shows #1, #2, and #3 above are incorrect or that
shows what is proposed in my email makes real-world usecases impossible or
somehow worse than what is proposed here, I'll ask that the cgroup aware
oom killer is removed from -mm and, if you'd like me to do it, I'd
repropose a modified version of your earlier hierarchical accounting
patchset with the added memory.oom_score_adj.
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (7 preceding siblings ...)
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
@ 2017-12-01 9:14 ` Michal Hocko
2017-12-01 13:26 ` Tetsuo Handa
2017-12-01 13:32 ` Roman Gushchin
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 9:14 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
Recently added alloc_pages_before_oomkill gained new caller with this
patchset and I think it just grown to deserve a simpler code flow.
What do you think about this on top of the series?
---
>From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 1 Dec 2017 10:05:25 +0100
Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
alloc_pages_before_oomkill is the last attempt to allocate memory before
we go and try to kill a process or a memcg. It's success path always has
to properly clean up the oc state (namely victim reference count). Let's
pull this into alloc_pages_before_oomkill directly rather than risk
somebody will forget to do it in future. Also document that we _know_
alloc_pages_before_oomkill violates proper layering and that is a
pragmatic decision.
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
include/linux/oom.h | 2 +-
mm/oom_kill.c | 21 +++------------------
mm/page_alloc.c | 24 ++++++++++++++++++++++--
3 files changed, 26 insertions(+), 21 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 10f495c8454d..7052e0a20e13 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
extern struct task_struct *find_lock_task_mm(struct task_struct *p);
-extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
+extern bool alloc_pages_before_oomkill(struct oom_control *oc);
extern int oom_evaluate_task(struct task_struct *task, void *arg);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4678468bae17..5c2cd299757b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page)
+ if (alloc_pages_before_oomkill(oc))
return true;
get_task_struct(current);
oc->chosen_task = current;
@@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
}
if (mem_cgroup_select_oom_victim(oc)) {
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page) {
- if (oc->chosen_memcg &&
- oc->chosen_memcg != INFLIGHT_VICTIM)
- mem_cgroup_put(oc->chosen_memcg);
+ if (alloc_pages_before_oomkill(oc))
return true;
- }
if (oom_kill_memcg_victim(oc)) {
delay = true;
@@ -1127,17 +1121,8 @@ bool out_of_memory(struct oom_control *oc)
}
select_bad_process(oc);
- /*
- * Try really last second allocation attempt after we selected an OOM
- * victim, for somebody might have managed to free memory while we were
- * selecting an OOM victim which can take quite some time.
- */
- oc->page = alloc_pages_before_oomkill(oc);
- if (oc->page) {
- if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
- put_task_struct(oc->chosen_task);
+ if (alloc_pages_before_oomkill(oc))
return true;
- }
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
dump_header(oc, NULL);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d518e9b2ee8..9e65fa06ee10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4146,7 +4146,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
return page;
}
-struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
+/*
+ * Try really last second allocation attempt after we selected an OOM victim,
+ * for somebody might have managed to free memory while we were selecting an
+ * OOM victim which can take quite some time.
+ *
+ * This function is a blatant layer violation example because we cross the page
+ * allocator and reclaim (OOM killer) layers. Doing this right from the design
+ * POV would be much more complex though so let's close our eyes and pretend
+ * everything is allright.
+ */
+bool alloc_pages_before_oomkill(struct oom_control *oc)
{
/*
* Go through the zonelist yet one more time, keep very high watermark
@@ -4167,7 +4177,17 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
alloc_flags = reserve_flags;
- return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+ oc->page = get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
+ if (!oc->page)
+ return false;
+
+ /*
+ * we are skipping the remaining oom steps so clean up the pending oc
+ * state
+ */
+ if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
+ put_task_struct(oc->chosen_task);
+ return true;
}
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
--
2.15.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2017-12-01 13:26 ` Tetsuo Handa
2017-12-01 13:32 ` Roman Gushchin
1 sibling, 0 replies; 51+ messages in thread
From: Tetsuo Handa @ 2017-12-01 13:26 UTC (permalink / raw)
To: mhocko, guro
Cc: linux-mm, vdavydov.dev, hannes, rientjes, akpm, tj, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
I'm planning to post below patch in order to mitigate OOM lockup problem
caused by scheduling priority. But I'm OK with your patch because your patch
will not conflict with below patch.
----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2746a7..ef6e951 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3332,13 +3332,14 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
*did_some_progress = 0;
/*
- * Acquire the oom lock. If that fails, somebody else is
- * making progress for us.
+ * Acquire the oom lock. If that fails, give enough CPU time to the
+ * owner of the oom lock in order to help reclaiming memory.
*/
- if (!mutex_trylock(&oom_lock)) {
- *did_some_progress = 1;
+ while (!mutex_trylock(&oom_lock)) {
+ page = alloc_pages_before_oomkill(oc);
+ if (page)
+ return page;
schedule_timeout_uninterruptible(1);
- return NULL;
}
/* Coredumps can quickly deplete all memory reserves */
----------
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
2017-12-01 13:26 ` Tetsuo Handa
@ 2017-12-01 13:32 ` Roman Gushchin
2017-12-01 13:54 ` Michal Hocko
1 sibling, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2017-12-01 13:32 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
Hi, Michal!
I totally agree that out_of_memory() function deserves some refactoring.
But I think there is an issue with your patch (see below):
On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
>
> ---
> From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 1 Dec 2017 10:05:25 +0100
> Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
>
> alloc_pages_before_oomkill is the last attempt to allocate memory before
> we go and try to kill a process or a memcg. It's success path always has
> to properly clean up the oc state (namely victim reference count). Let's
> pull this into alloc_pages_before_oomkill directly rather than risk
> somebody will forget to do it in future. Also document that we _know_
> alloc_pages_before_oomkill violates proper layering and that is a
> pragmatic decision.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/oom.h | 2 +-
> mm/oom_kill.c | 21 +++------------------
> mm/page_alloc.c | 24 ++++++++++++++++++++++--
> 3 files changed, 26 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 10f495c8454d..7052e0a20e13 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
>
> extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>
> -extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
> +extern bool alloc_pages_before_oomkill(struct oom_control *oc);
>
> extern int oom_evaluate_task(struct task_struct *task, void *arg);
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4678468bae17..5c2cd299757b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
> if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page)
> + if (alloc_pages_before_oomkill(oc))
> return true;
> get_task_struct(current);
> oc->chosen_task = current;
> @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
> }
>
> if (mem_cgroup_select_oom_victim(oc)) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page) {
> - if (oc->chosen_memcg &&
> - oc->chosen_memcg != INFLIGHT_VICTIM)
> - mem_cgroup_put(oc->chosen_memcg);
You're removing chosen_memcg releasing here, but I don't see where you
do this instead. And I'm not sure that putting mem_cgroup_put() into
alloc_pages_before_oomkill() is a way towards simpler code.
I was thinking about a bit larger refactoring: splitting out_of_memory()
into the following parts (defined as separate functions): victim selection
(per-process, memcg-aware or just allocating task), last allocation attempt,
OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
but I don't have code yet.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
2017-12-01 13:32 ` Roman Gushchin
@ 2017-12-01 13:54 ` Michal Hocko
0 siblings, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2017-12-01 13:54 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 01-12-17 13:32:15, Roman Gushchin wrote:
> Hi, Michal!
>
> I totally agree that out_of_memory() function deserves some refactoring.
>
> But I think there is an issue with your patch (see below):
>
> On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> > Recently added alloc_pages_before_oomkill gained new caller with this
> > patchset and I think it just grown to deserve a simpler code flow.
> > What do you think about this on top of the series?
> >
> > ---
[...]
> > @@ -1112,13 +1111,8 @@ bool out_of_memory(struct oom_control *oc)
> > }
> >
> > if (mem_cgroup_select_oom_victim(oc)) {
> > - oc->page = alloc_pages_before_oomkill(oc);
> > - if (oc->page) {
> > - if (oc->chosen_memcg &&
> > - oc->chosen_memcg != INFLIGHT_VICTIM)
> > - mem_cgroup_put(oc->chosen_memcg);
>
> You're removing chosen_memcg releasing here, but I don't see where you
> do this instead. And I'm not sure that putting mem_cgroup_put() into
> alloc_pages_before_oomkill() is a way towards simpler code.
Dohh, I though I did. But obviously it is not there.
> I was thinking about a bit larger refactoring: splitting out_of_memory()
> into the following parts (defined as separate functions): victim selection
> (per-process, memcg-aware or just allocating task), last allocation attempt,
> OOM action (kill process, kill memcg, panic). Hopefully it can simplify the things,
> but I don't have code yet.
OK, I will not push if you have further plans of course. This just hit
my eyes...
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
` (8 preceding siblings ...)
2017-12-01 9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
@ 2018-06-05 11:47 ` Michal Hocko
2018-06-05 12:13 ` Michal Hocko
2018-07-13 21:59 ` David Rientjes
9 siblings, 2 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 11:47 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
It seems that this is still in limbo mostly because of David's concerns.
So let me reiterate them and provide my POV once more (and the last
time) just to help Andrew make a decision:
1) comparision root with tail memcgs during the OOM killer is not fair
because we are comparing tasks with memcgs.
This is true, but I do not think this matters much for workloads which
are going to use the feature. Why? Because the main consumers of the new
feature seem to be containers which really need some fairness when
comparing _workloads_ rather than processes. Those are unlikely to
contain any significant memory consumers in the root memcg. That would
be mostly common infrastructure.
Is this is fixable? Yes, we would need to account in the root memcgs.
Why are we not doing that now? Because it has some negligible
performance overhead. Are there other ways? Yes we can approximate root
memcg memory consumption but I would rather wait for somebody seeing
that as a real problem rather than add hacks now without a strong
reason.
2) Evading the oom killer by attaching processes to child cgroups which
basically means that a task can split up the workload into smaller
memcgs to hide their real memory consumption.
Again true but not really anything new. Processes can already fork and
split up the memory consumption. Moreover it doesn't even require any
special privileges to do so unlike creating a sub memcg. Is this
fixable? Yes, untrusted workloads can setup group oom evaluation at the
delegation layer so all subgroups would be considered together.
3) Userspace has zero control over oom kill selection in leaf mem
cgroups.
Again true but this is something that needs a good evaluation to not end
up in the fiasko we have seen with oom_score*. Current users demanding
this feature can live without any prioritization so blocking the whole
feature seems unreasonable.
4) Future extensibility to be backward compatible.
David is wrong here IMHO. Any prioritization or oom selection policy
controls added in future are orthogonal to the oom_group concept added
by this patchset. Allowing memcg to be an oom entity is something that
we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
semantic and softening it will be possible by a adding a new knob to
tell whether a memcg/hierarchy is a workload or a set of tasks.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
@ 2018-06-05 12:13 ` Michal Hocko
2018-07-13 21:59 ` David Rientjes
1 sibling, 0 replies; 51+ messages in thread
From: Michal Hocko @ 2018-06-05 12:13 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, Vladimir Davydov, Johannes Weiner, Tetsuo Handa,
David Rientjes, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Tue 05-06-18 13:47:29, Michal Hocko wrote:
> It seems that this is still in limbo mostly because of David's concerns.
> So let me reiterate them and provide my POV once more (and the last
> time) just to help Andrew make a decision:
Sorry, I forgot to add reference to the email with the full David's
reasoning. Here it is http://lkml.kernel.org/r/alpine.DEB.2.10.1801091556490.173445@chino.kir.corp.google.com
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
>
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
2018-06-05 12:13 ` Michal Hocko
@ 2018-07-13 21:59 ` David Rientjes
2018-07-14 1:55 ` Tetsuo Handa
2018-07-16 9:36 ` Michal Hocko
1 sibling, 2 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-13 21:59 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Tue, 5 Jun 2018, Michal Hocko wrote:
> 1) comparision root with tail memcgs during the OOM killer is not fair
> because we are comparing tasks with memcgs.
>
> This is true, but I do not think this matters much for workloads which
> are going to use the feature. Why? Because the main consumers of the new
> feature seem to be containers which really need some fairness when
> comparing _workloads_ rather than processes. Those are unlikely to
> contain any significant memory consumers in the root memcg. That would
> be mostly common infrastructure.
>
There are users (us) who want to use the feature and not all processes are
attached to leaf mem cgroups. The functionality can be provided in a
generally useful way that doesn't require any specific hierarchy, and I
implemented this in my patch series at
https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
*all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
in it's entirety, only extends it so it is generally useful and does not
remove any functionality. I'm not sure why we are discussing ways of
moving forward when that patchset has been waiting for review for almost
four months and, to date, I haven't seen an objection to.
I don't know why we cannot agree on making solutions generally useful nor
why that patchset has not been merged into -mm so that the whole feature
can be merged. It's baffling. This is the first time I've encountered a
perceived stalemate when there is a patchset sitting, unreviewed, that
fixes all of the concerns that there are about the implementation sitting
in -mm.
This isn't a criticism just of comparing processes attached to root
differently than leaf mem cgroups, it's how oom_score_adj influences that
decision. It's trivial for a very small consumer (not "significant"
memory consumer, as you put it) to require an oom kill from root instead
of a leaf mem cgroup. I show in
https://marc.info/?l=linux-mm&m=152175564104468&w=2 that changing the
oom_score_adj of my bash shell attached to the root mem cgroup is
considered equal to a 95GB leaf mem cgroup with the current
implementation.
> Is this is fixable? Yes, we would need to account in the root memcgs.
> Why are we not doing that now? Because it has some negligible
> performance overhead. Are there other ways? Yes we can approximate root
> memcg memory consumption but I would rather wait for somebody seeing
> that as a real problem rather than add hacks now without a strong
> reason.
>
I fixed this in https://marc.info/?t=152175564500007&r=1&w=2, and from
what I remmeber Roman actually liked it.
> 2) Evading the oom killer by attaching processes to child cgroups which
> basically means that a task can split up the workload into smaller
> memcgs to hide their real memory consumption.
>
> Again true but not really anything new. Processes can already fork and
> split up the memory consumption. Moreover it doesn't even require any
> special privileges to do so unlike creating a sub memcg. Is this
> fixable? Yes, untrusted workloads can setup group oom evaluation at the
> delegation layer so all subgroups would be considered together.
>
Processes being able to fork to split up memory consumption is also fixed
by https://marc.info/?l=linux-mm&m=152175564104467 just as creating
subcontainers to intentionally or unintentionally subverting the oom
policy is fixed. It solves both problems.
> 3) Userspace has zero control over oom kill selection in leaf mem
> cgroups.
>
> Again true but this is something that needs a good evaluation to not end
> up in the fiasko we have seen with oom_score*. Current users demanding
> this feature can live without any prioritization so blocking the whole
> feature seems unreasonable.
>
One objection here is how the oom_score_adj of a process means something
or doesn't mean something depending on what cgroup it is attached to. The
cgroup-aware oom killer is cgroup aware. oom_score_adj should play no
part. I fixed this with https://marc.info/?t=152175564500007&r=1&w=2.
The other objection is that users do have cgroups that shouldn't be oom
killed because they are important, either because they are required to
provide a service for a smaller cgroup or because of business goals. We
have cgroups that use more than half of system memory and they are allowed
to do so because they are important. We would love to be able to bias
against that cgroup to prefer others, or prefer cgroups for oom kill
because they are less important. This was done for processes with
oom_score_adj, we need it for a cgroup aware oom killer for the same
reason.
But notice even in https://marc.info/?l=linux-mm&m=152175563004458&w=2
that I said priority or adjustment can be added on top of the feature
after it's merged. This itself is not precluding anything from being
merged.
> 4) Future extensibility to be backward compatible.
>
> David is wrong here IMHO. Any prioritization or oom selection policy
> controls added in future are orthogonal to the oom_group concept added
> by this patchset. Allowing memcg to be an oom entity is something that
> we really want longterm. Global CGRP_GROUP_OOM is the most restrictive
> semantic and softening it will be possible by a adding a new knob to
> tell whether a memcg/hierarchy is a workload or a set of tasks.
I've always said that the mechanism and policy in this patchset should be
separated. I do that exact thing in
https://marc.info/?l=linux-mm&m=152175564304469&w=2. I suggest that
different subtrees will want (or the admin will require) different
behaviors with regard to the mechanism.
I've stated the problems (and there are others wrt mempolicy selection)
that the current implementation has and given a full solution at
https://marc.info/?l=linux-mm&m=152175563004458&w=2 that has not been
reviewed. I would love feedback from anybody on this thread on that. I'm
not trying to preclude the cgroup-aware oom killer from being merged, I'm
the only person actively trying to get it merged.
Thanks.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-13 21:59 ` David Rientjes
@ 2018-07-14 1:55 ` Tetsuo Handa
2018-07-16 21:13 ` Tetsuo Handa
2018-07-16 9:36 ` Michal Hocko
1 sibling, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-14 1:55 UTC (permalink / raw)
To: David Rientjes, Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Andrew Morton, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On 2018/07/14 6:59, David Rientjes wrote:
> I'm not trying to preclude the cgroup-aware oom killer from being merged,
> I'm the only person actively trying to get it merged.
Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
and my cleanup? The gap between linux.git and linux-next.git keeps us unable
to use agreed baseline.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-14 1:55 ` Tetsuo Handa
@ 2018-07-16 21:13 ` Tetsuo Handa
2018-07-16 22:09 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-16 21:13 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Michal Hocko, Roman Gushchin, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
No response from Roman and David...
Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
Roman's series has a bug which I mentioned and which can be avoided by my patch.
David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
next OOM victim without trying to reclaim any memory.
Since they are not responding to my mail, I suggest once dropping from linux-next.
https://www.spinics.net/lists/linux-mm/msg153212.html
https://lore.kernel.org/lkml/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp/T/#u
On 2018/07/14 10:55, Tetsuo Handa wrote:
> On 2018/07/14 6:59, David Rientjes wrote:
>> I'm not trying to preclude the cgroup-aware oom killer from being merged,
>> I'm the only person actively trying to get it merged.
>
> Before merging the cgroup-aware oom killer, can we merge OOM lockup fixes
> and my cleanup? The gap between linux.git and linux-next.git keeps us unable
> to use agreed baseline.
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 21:13 ` Tetsuo Handa
@ 2018-07-16 22:09 ` Roman Gushchin
2018-07-17 0:55 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-07-16 22:09 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> No response from Roman and David...
>
> Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> Roman's series has a bug which I mentioned and which can be avoided by my patch.
> David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> next OOM victim without trying to reclaim any memory.
>
> Since they are not responding to my mail, I suggest once dropping from linux-next.
I was in cc, and didn't thought that you're expecting something from me.
I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
I'm happy to help with rebasing and everything else.
Thanks,
Roman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 22:09 ` Roman Gushchin
@ 2018-07-17 0:55 ` Tetsuo Handa
2018-07-31 14:14 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-17 0:55 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> >
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start selecting
> > next OOM victim without trying to reclaim any memory.
> >
> > Since they are not responding to my mail, I suggest once dropping from linux-next.
>
> I was in cc, and didn't thought that you're expecting something from me.
Oops. I was waiting for your response. ;-)
But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch
in linux-next. And it seems to me that your patch contains a bug which leads to
premature memory allocation failure explained below.
Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
(which eliminates "delay" and "out:" from your patch) so that people can easily
backport my patch? Or, do you want to apply a fix (which eliminates "delay" and
"out:" from linux-next) prior to my patch?
>
> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> I'm happy to help with rebasing and everything else.
Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
and easy to cleanly backport (if applied before your series).
Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-17 0:55 ` Tetsuo Handa
@ 2018-07-31 14:14 ` Tetsuo Handa
2018-08-01 16:37 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-07-31 14:14 UTC (permalink / raw)
To: Roman Gushchin, Andrew Morton
Cc: David Rientjes, Michal Hocko, linux-mm, Vladimir Davydov,
Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On 2018/07/17 9:55, Tetsuo Handa wrote:
>> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
>> I'm happy to help with rebasing and everything else.
>
> Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> and easy to cleanly backport (if applied before your series).
>
> Also, I expect you to check whether my cleanup patch which removes "abort" path
> ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> simplifying your series. I don't know detailed behavior of your series, but I
> assume that your series do not kill threads which current thread should not wait
> for MMF_OOM_SKIP.
syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258
I can't tell what change is triggering this race. Maybe removal of oom_lock from
the oom reaper made more likely to hit. But anyway I suspect that
static bool oom_kill_memcg_victim(struct oom_control *oc)
{
if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
return oc->chosen_memcg; // <= This line is still broken
because
/* We have one or more terminating processes at this point. */
oc->chosen_task = INFLIGHT_VICTIM;
is not called.
Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
with oom_lock held.
Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
apply my cleanup patch? Since the merge window is approaching, I really want to
see how next -rc1 would look like...
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-31 14:14 ` Tetsuo Handa
@ 2018-08-01 16:37 ` Roman Gushchin
2018-08-01 22:01 ` Tetsuo Handa
0 siblings, 1 reply; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 16:37 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, linux-mm,
Vladimir Davydov, Johannes Weiner, Tejun Heo, kernel-team,
cgroups, linux-doc, linux-kernel, linux-mm
On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> On 2018/07/17 9:55, Tetsuo Handa wrote:
> >> I don't get, why it's necessary to drop the cgroup oom killer to merge your fix?
> >> I'm happy to help with rebasing and everything else.
> >
> > Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) mitigation
> > patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ). It is a trivial change
> > and easy to cleanly backport (if applied before your series).
> >
> > Also, I expect you to check whether my cleanup patch which removes "abort" path
> > ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) helps
> > simplifying your series. I don't know detailed behavior of your series, but I
> > assume that your series do not kill threads which current thread should not wait
> > for MMF_OOM_SKIP.
>
> syzbot is hitting WARN(1) due to mem_cgroup_out_of_memory() == false.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__syzkaller.appspot.com_bug-3Fid-3Dea8c7912757d253537375e981b61749b2da69258&d=DwICJg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=h9FJRAWtCmDLT-cVwvXKCYIUVRSrD--0XFJE-OnNY64&s=If6eFu6MlYjnfLXeg5_S-3tuhCZhSMv8_qfSrMfwOQ0&e=
>
> I can't tell what change is triggering this race. Maybe removal of oom_lock from
> the oom reaper made more likely to hit. But anyway I suspect that
>
> static bool oom_kill_memcg_victim(struct oom_control *oc)
> {
> if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> return oc->chosen_memcg; // <= This line is still broken
>
> because
>
> /* We have one or more terminating processes at this point. */
> oc->chosen_task = INFLIGHT_VICTIM;
>
> is not called.
>
> Also, that patch is causing confusion by reviving schedule_timeout_killable(1)
> with oom_lock held.
>
> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> apply my cleanup patch? Since the merge window is approaching, I really want to
> see how next -rc1 would look like...
Hi Tetsuo!
Has this cleanup patch been acked by somebody?
Which problem does it solve?
Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
Anyway, there is a good chance that current cgroup-aware OOM killer
implementation will be replaced by a lightweight version (memory.oom.group).
Please, take a look at it, probably your cleanup will not conflict with it
at all.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-08-01 16:37 ` Roman Gushchin
@ 2018-08-01 22:01 ` Tetsuo Handa
2018-08-01 22:55 ` Roman Gushchin
0 siblings, 1 reply; 51+ messages in thread
From: Tetsuo Handa @ 2018-08-01 22:01 UTC (permalink / raw)
To: Roman Gushchin, Andrew Morton
Cc: David Rientjes, Michal Hocko, Vladimir Davydov, Johannes Weiner,
Tejun Heo, kernel-team, cgroups, linux-doc, linux-kernel,
linux-mm
On 2018/08/02 1:37, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
>> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
>> apply my cleanup patch? Since the merge window is approaching, I really want to
>> see how next -rc1 would look like...
>
> Hi Tetsuo!
>
> Has this cleanup patch been acked by somebody?
Not yet. But since Michal considers this cleanup as "a nice shortcut"
( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
I will get an ACK regarding this cleanup.
> Which problem does it solve?
It simplifies tricky out_of_memory() return value decision, and
it also fixes a bug in your series which syzbot is pointing out.
> Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
What I need is a git tree which I can use as a baseline for making this cleanup.
linux.git is not suitable because it does not include Michal's fix, but
linux-next.git is not suitable because Michal's fix is overwritten by your series.
I want a git tree which includes Michal's fix and does not include your series.
>
> Anyway, there is a good chance that current cgroup-aware OOM killer
> implementation will be replaced by a lightweight version (memory.oom.group).
> Please, take a look at it, probably your cleanup will not conflict with it
> at all.
Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
I want to see how next -rc1 would look like (for testing purpose) and want to use
linux-next.git as a baseline (for making this cleanup).
>
> Thanks!
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-08-01 22:01 ` Tetsuo Handa
@ 2018-08-01 22:55 ` Roman Gushchin
0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2018-08-01 22:55 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Andrew Morton, David Rientjes, Michal Hocko, Vladimir Davydov,
Johannes Weiner, Tejun Heo, kernel-team, cgroups, linux-doc,
linux-kernel, linux-mm
On Thu, Aug 02, 2018 at 07:01:28AM +0900, Tetsuo Handa wrote:
> On 2018/08/02 1:37, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:14:01PM +0900, Tetsuo Handa wrote:
> >> Can we temporarily drop cgroup-aware OOM killer from linux-next.git and
> >> apply my cleanup patch? Since the merge window is approaching, I really want to
> >> see how next -rc1 would look like...
> >
> > Hi Tetsuo!
> >
> > Has this cleanup patch been acked by somebody?
>
> Not yet. But since Michal considers this cleanup as "a nice shortcut"
> ( https://marc.info/?i=20180607112836.GN32433@dhcp22.suse.cz ), I assume that
> I will get an ACK regarding this cleanup.
>
> > Which problem does it solve?
>
> It simplifies tricky out_of_memory() return value decision, and
> it also fixes a bug in your series which syzbot is pointing out.
>
> > Dropping patches for making a cleanup (if it's a cleanup) sounds a bit strange.
>
> What I need is a git tree which I can use as a baseline for making this cleanup.
> linux.git is not suitable because it does not include Michal's fix, but
> linux-next.git is not suitable because Michal's fix is overwritten by your series.
> I want a git tree which includes Michal's fix and does not include your series.
>
> >
> > Anyway, there is a good chance that current cgroup-aware OOM killer
> > implementation will be replaced by a lightweight version (memory.oom.group).
> > Please, take a look at it, probably your cleanup will not conflict with it
> > at all.
>
> Then, please drop current cgroup-aware OOM killer implementation from linux-next.git .
> I want to see how next -rc1 would look like (for testing purpose) and want to use
> linux-next.git as a baseline (for making this cleanup).
I'll post memory.oom.group v2 later today, and if there will be no objections,
I'll ask Andrew to drop current memcg-aware OOM killer and replace it
with lightweight memory.oom.group.
These changes will be picked by linux-next in few days.
Thanks!
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-13 21:59 ` David Rientjes
2018-07-14 1:55 ` Tetsuo Handa
@ 2018-07-16 9:36 ` Michal Hocko
2018-07-17 3:59 ` David Rientjes
1 sibling, 1 reply; 51+ messages in thread
From: Michal Hocko @ 2018-07-16 9:36 UTC (permalink / raw)
To: David Rientjes
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Fri 13-07-18 14:59:59, David Rientjes wrote:
> On Tue, 5 Jun 2018, Michal Hocko wrote:
>
> > 1) comparision root with tail memcgs during the OOM killer is not fair
> > because we are comparing tasks with memcgs.
> >
> > This is true, but I do not think this matters much for workloads which
> > are going to use the feature. Why? Because the main consumers of the new
> > feature seem to be containers which really need some fairness when
> > comparing _workloads_ rather than processes. Those are unlikely to
> > contain any significant memory consumers in the root memcg. That would
> > be mostly common infrastructure.
> >
>
> There are users (us) who want to use the feature and not all processes are
> attached to leaf mem cgroups. The functionality can be provided in a
> generally useful way that doesn't require any specific hierarchy, and I
> implemented this in my patch series at
> https://marc.info/?l=linux-mm&m=152175563004458&w=2. That proposal to fix
> *all* of my concerns with the cgroup-aware oom killer as it sits in -mm,
> in it's entirety, only extends it so it is generally useful and does not
> remove any functionality. I'm not sure why we are discussing ways of
> moving forward when that patchset has been waiting for review for almost
> four months and, to date, I haven't seen an objection to.
Well, I didn't really get to your patches yet. The last time I've
checked I had some pretty serious concerns about the consistency of your
proposal. Those might have been fixed in the lastest version of your
patchset I haven't seen. But I still strongly suspect that you are
largerly underestimating the complexity of more generic oom policies
which you are heading to.
Considering user API failures from the past (oom_*adj fiasco for
example) suggests that we should start with smaller steps and only
provide a clear and simple API. oom_group is such a simple and
semantically consistent thing which is the reason I am OK with it much
more than your "we can be more generic" approach. I simply do not trust
we can agree on sane and consistent api in a reasonable time.
And it is quite mind boggling that a simpler approach has been basically
blocked for months because there are some concerns for workloads which
are not really asking for the feature. Sure your usecase might need to
handle root memcg differently. That is a fair point but that shouldn't
really block containers users who can use the proposed solution without
any further changes. If we ever decide to handle root memcg differently
we are free to do so because the oom selection policy is not carved in
stone by any api.
[...]
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v13 0/7] cgroup-aware OOM killer
2018-07-16 9:36 ` Michal Hocko
@ 2018-07-17 3:59 ` David Rientjes
0 siblings, 0 replies; 51+ messages in thread
From: David Rientjes @ 2018-07-17 3:59 UTC (permalink / raw)
To: Michal Hocko
Cc: Roman Gushchin, linux-mm, Vladimir Davydov, Johannes Weiner,
Tetsuo Handa, Andrew Morton, Tejun Heo, kernel-team, cgroups,
linux-doc, linux-kernel, linux-mm
On Mon, 16 Jul 2018, Michal Hocko wrote:
> Well, I didn't really get to your patches yet. The last time I've
> checked I had some pretty serious concerns about the consistency of your
> proposal. Those might have been fixed in the lastest version of your
> patchset I haven't seen. But I still strongly suspect that you are
> largerly underestimating the complexity of more generic oom policies
> which you are heading to.
>
I don't believe it's underestimated since it's used. It's perfectly valid
the lock an entire hierarchy or individual subtrees into a single policy
if that's what is preferred. Any use of a different policy at a subtree
root is a conscious decision made by the owner of that subtree. If they
prefer to kill the largest process, the largest descendant cgroup, or the
largest subtree, it is up to them. All three have valid usecases, the
goal is not to lock the entire hierarchy into a single policy: this
introduces the ability for users to subvert the selection policy either
intentionally or unintentionally because they are using a unified single
hierarchy with cgroup v2 and they are using controllers other than mem
cgroup.
> Considering user API failures from the past (oom_*adj fiasco for
> example) suggests that we should start with smaller steps and only
> provide a clear and simple API. oom_group is such a simple and
> semantically consistent thing which is the reason I am OK with it much
> more than your "we can be more generic" approach. I simply do not trust
> we can agree on sane and consistent api in a reasonable time.
>
> And it is quite mind boggling that a simpler approach has been basically
> blocked for months because there are some concerns for workloads which
> are not really asking for the feature. Sure your usecase might need to
> handle root memcg differently. That is a fair point but that shouldn't
> really block containers users who can use the proposed solution without
> any further changes. If we ever decide to handle root memcg differently
> we are free to do so because the oom selection policy is not carved in
> stone by any api.
>
Please respond directly to the patchset which clearly enumerates the
problems with the current implementation in -mm.
^ permalink raw reply [flat|nested] 51+ messages in thread