From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 5968E6B025F for ; Mon, 14 Aug 2017 14:32:53 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id t80so149217946pgb.0 for ; Mon, 14 Aug 2017 11:32:53 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id f87si4496708pfe.137.2017.08.14.11.32.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 11:32:52 -0700 (PDT) From: Roman Gushchin Subject: [v5 0/4] cgroup-aware OOM killer Date: Mon, 14 Aug 2017 19:32:10 +0100 Message-ID: <20170814183213.12319-2-guro@fb.com> In-Reply-To: <20170814183213.12319-1-guro@fb.com> References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org This patchset makes the OOM killer cgroup-aware. v5: - Rebased on top of Michal Hocko's patches, which have changed the way how OOM victims becoming an access to the memory reserves. Dropped corresponding part of this patchset - Separated the oom_kill_process() splitting into a standalone commit - Added debug output (suggested by David Rientjes) - Some minor fixes v4: - Reworked per-cgroup oom_score_adj into oom_priority (based on ideas by David Rientjes) - Tasks with oom_score_adj -1000 are never selected if oom_kill_all_tasks is not set - Memcg victim selection code is reworked, and synchronization is based on finding tasks with OOM victim marker, rather then on global counter - Debug output is dropped - Refactored TIF_MEMDIE usage v3: - Merged commits 1-4 into 6 - Separated oom_score_adj logic and debug output into separate commits - Fixed swap accounting v2: - Reworked victim selection based on feedback from Michal Hocko, Vladimir Davydov and Johannes Weiner - "Kill all tasks" is now an opt-in option, by default only one process will be killed - Added per-cgroup oom_score_adj - Refined oom score calculations, suggested by Vladimir Davydov - Converted to a patchset v1: https://lkml.org/lkml/2017/5/18/969 Roman Gushchin (4): mm, oom: refactor the oom_kill_process() function mm, oom: cgroup-aware OOM killer mm, oom: introduce oom_priority for memory cgroups mm, oom, docs: describe the cgroup-aware OOM killer Documentation/cgroup-v2.txt | 62 +++++++++++ include/linux/memcontrol.h | 36 ++++++ include/linux/oom.h | 3 + mm/memcontrol.c | 259 ++++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 181 +++++++++++++++++++++---------- 5 files changed, 484 insertions(+), 57 deletions(-) -- 2.13.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 3F6436B0292 for ; Mon, 14 Aug 2017 14:32:55 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id y192so147618980pgd.12 for ; Mon, 14 Aug 2017 11:32:55 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id n2si4483331pfj.485.2017.08.14.11.32.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 11:32:53 -0700 (PDT) From: Roman Gushchin Subject: [v5 2/4] mm, oom: cgroup-aware OOM killer Date: Mon, 14 Aug 2017 19:32:11 +0100 Message-ID: <20170814183213.12319-3-guro@fb.com> In-Reply-To: <20170814183213.12319-1-guro@fb.com> References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. 3) Per-process oom_score_adj affects global OOM, so it's a breache in the isolation. To address these issues, cgroup-aware OOM killer is introduced. Under OOM conditions, it tries to find the biggest memory consumer, and free memory by killing corresponding task(s). The difference the "traditional" OOM killer is that it can treat memory cgroups as memory consumers as well as single processes. By default, it will look for the biggest leaf cgroup, and kill the largest task inside. But a user can change this behavior by enabling the per-cgroup oom_kill_all_tasks option. If set, it causes the OOM killer treat the whole cgroup as an indivisible memory consumer. In case if it's selected as on OOM victim, all belonging tasks will be killed. Tasks in the root cgroup are treated as independent memory consumers, and are compared with other memory consumers (e.g. leaf cgroups). The root cgroup doesn't support the oom_kill_all_tasks feature. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 33 +++++++ include/linux/oom.h | 3 + mm/memcontrol.c | 208 +++++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 62 +++++++++++++- 4 files changed, 305 insertions(+), 1 deletion(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8556f1b86d40..796666dc3282 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -199,6 +200,12 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + /* kill all tasks in the subtree in case of OOM */ + bool oom_kill_all_tasks; + + /* cached OOM score */ + long oom_score; + /* handle for "memory.events" */ struct cgroup_file events_file; @@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + +static inline bool mem_cgroup_oom_kill_all_tasks(struct mem_cgroup *memcg) +{ + return memcg->oom_kill_all_tasks; +} + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -743,6 +762,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -930,6 +953,16 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} + +static inline bool mem_cgroup_oom_kill_all_tasks(struct mem_cgroup *memcg) +{ + return false; +} #endif /* CONFIG_MEMCG */ /* idx can be of type enum memcg_stat_item or node_stat_item */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 8a266e2be5a6..b7ec3bd441be 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -39,6 +39,7 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + struct mem_cgroup *chosen_memcg; }; extern struct mutex oom_lock; @@ -79,6 +80,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df6f63ee95d6..0b81dc55c6ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) return ret; } +static long memcg_oom_badness(struct mem_cgroup *memcg, + const nodemask_t *nodemask) +{ + long points = 0; + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; + + points += mem_cgroup_node_nr_lru_pages(memcg, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + } + + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE); + points += memcg_page_state(memcg, MEMCG_SOCK); + points += memcg_page_state(memcg, MEMCG_SWAP); + + return points; +} + +static long oom_evaluate_memcg(struct mem_cgroup *memcg, + const nodemask_t *nodemask) +{ + struct css_task_iter it; + struct task_struct *task; + int elegible = 0; + + css_task_iter_start(&memcg->css, 0, &it); + while ((task = css_task_iter_next(&it))) { + /* + * If there are no tasks, or all tasks have oom_score_adj set + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, + * don't select this memory cgroup. + */ + if (!elegible && + (memcg->oom_kill_all_tasks || + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) + elegible = 1; + + /* + * If there are previously selected OOM victims, + * abort memcg selection. + */ + if (tsk_is_oom_victim(task) && + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { + elegible = -1; + break; + } + } + css_task_iter_end(&it); + + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible; +} + +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) +{ + struct mem_cgroup *iter, *parent; + + for_each_mem_cgroup_tree(iter, root) { + if (memcg_has_children(iter)) { + iter->oom_score = 0; + continue; + } + + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask); + if (iter->oom_score == -1) { + oc->chosen_memcg = (void *)-1UL; + mem_cgroup_iter_break(root, iter); + return; + } + + if (!iter->oom_score) + continue; + + for (parent = parent_mem_cgroup(iter); parent && parent != root; + parent = parent_mem_cgroup(parent)) + parent->oom_score += iter->oom_score; + } + + for (;;) { + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg = NULL; + long score = LONG_MIN; + + css_for_each_child(css, &root->css) { + struct mem_cgroup *iter = mem_cgroup_from_css(css); + + if (iter->oom_score > score) { + memcg = iter; + score = iter->oom_score; + } + } + + if (!memcg) { + if (oc->memcg && root == oc->memcg) { + oc->chosen_memcg = oc->memcg; + css_get(&oc->chosen_memcg->css); + oc->chosen_points = oc->memcg->oom_score; + } + break; + } + + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) { + oc->chosen_memcg = memcg; + css_get(&oc->chosen_memcg->css); + oc->chosen_points = score; + break; + } + + root = memcg; + } +} + +static void select_victim_root_cgroup_task(struct oom_control *oc) +{ + struct css_task_iter it; + struct task_struct *task; + int ret = 0; + + css_task_iter_start(&root_mem_cgroup->css, 0, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = oom_evaluate_task(task, oc); + css_task_iter_end(&it); +} + +bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + struct mem_cgroup *root = root_mem_cgroup; + + oc->chosen = NULL; + oc->chosen_memcg = NULL; + + if (mem_cgroup_disabled()) + return false; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return false; + + if (oc->memcg) + root = oc->memcg; + + rcu_read_lock(); + + select_victim_memcg(root, oc); + if (oc->chosen_memcg == (void *)-1UL) { + /* Existing OOM victims are found. */ + rcu_read_unlock(); + return true; + } + + /* + * For system-wide OOMs we should consider tasks in the root cgroup + * with oom_score larger than oc->chosen_points. + */ + if (!oc->memcg) { + select_victim_root_cgroup_task(oc); + + if (oc->chosen && oc->chosen_memcg) { + /* + * If we've decided to kill a task in the root memcg, + * release chosen_memcg. + */ + css_put(&oc->chosen_memcg->css); + oc->chosen_memcg = NULL; + } + } + + rcu_read_unlock(); + + return !!oc->chosen || !!oc->chosen_memcg; +} + /* * Reclaims as many pages from the given memcg as possible. * @@ -5190,6 +5365,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + bool oom_kill_all_tasks = memcg->oom_kill_all_tasks; + + seq_printf(m, "%d\n", oom_kill_all_tasks); + + return 0; +} + +static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_kill_all_tasks; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks); + if (err) + return err; + + memcg->oom_kill_all_tasks = !!oom_kill_all_tasks; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = { .write = memory_max_write, }, { + .name = "oom_kill_all_tasks", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_kill_all_tasks_show, + .write = memory_oom_kill_all_tasks_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5c29a3dd591b..28e42a0d5eee 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } -static int oom_evaluate_task(struct task_struct *task, void *arg) +int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; unsigned long points; @@ -823,6 +823,9 @@ static void __oom_kill_process(struct task_struct *victim) struct mm_struct *mm; bool can_oom_reap = true; + if (is_global_init(victim) || (victim->flags & PF_KTHREAD)) + return; + p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); @@ -958,6 +961,60 @@ static void oom_kill_process(struct oom_control *oc, const char *message) put_task_struct(victim); } +static int oom_kill_memcg_member(struct task_struct *task, void *unused) +{ + if (!tsk_is_oom_victim(task)) + __oom_kill_process(task); + return 0; +} + +static bool oom_kill_memcg_victim(struct oom_control *oc) +{ + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + if (oc->chosen) { + if (oc->chosen != (void *)-1UL) { + if (__ratelimit(&oom_rs)) + dump_header(oc, oc->chosen); + + __oom_kill_process(oc->chosen); + put_task_struct(oc->chosen); + schedule_timeout_killable(1); + } + return true; + + } else if (oc->chosen_memcg) { + if (oc->chosen_memcg == (void *)-1UL) + return true; + + /* Always begin with the biggest task */ + oc->chosen_points = 0; + oc->chosen = NULL; + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc); + if (oc->chosen && oc->chosen != (void *)-1UL) { + if (__ratelimit(&oom_rs)) + dump_header(oc, oc->chosen); + + __oom_kill_process(oc->chosen); + put_task_struct(oc->chosen); + + if (mem_cgroup_oom_kill_all_tasks(oc->chosen_memcg)) + mem_cgroup_scan_tasks(oc->chosen_memcg, + oom_kill_memcg_member, + NULL); + } + + mem_cgroup_put(oc->chosen_memcg); + oc->chosen_memcg = NULL; + return !!oc->chosen; + + } else { + oc->chosen_points = 0; + return false; + } +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ @@ -1059,6 +1116,9 @@ bool out_of_memory(struct oom_control *oc) return true; } + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) + return true; + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.13.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f200.google.com (mail-io0-f200.google.com [209.85.223.200]) by kanga.kvack.org (Postfix) with ESMTP id 606476B02B4 for ; Mon, 14 Aug 2017 14:32:55 -0400 (EDT) Received: by mail-io0-f200.google.com with SMTP id m88so112078615iod.0 for ; Mon, 14 Aug 2017 11:32:55 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id t3si4510638pfl.88.2017.08.14.11.32.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 11:32:54 -0700 (PDT) From: Roman Gushchin Subject: [v5 3/4] mm, oom: introduce oom_priority for memory cgroups Date: Mon, 14 Aug 2017 19:32:12 +0100 Message-ID: <20170814183213.12319-4-guro@fb.com> In-Reply-To: <20170814183213.12319-1-guro@fb.com> References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Introduce a per-memory-cgroup oom_priority setting: an integer number within the [-10000, 10000] range, which defines the order in which the OOM killer selects victim memory cgroups. OOM killer prefers memory cgroups with larger priority if they are populated with elegible tasks. The oom_priority value is compared within sibling cgroups. The root cgroup has the oom_priority 0, which cannot be changed. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: David Rientjes Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 3 +++ mm/memcontrol.c | 55 ++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 796666dc3282..3c1ab3aedebe 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -206,6 +206,9 @@ struct mem_cgroup { /* cached OOM score */ long oom_score; + /* OOM killer priority */ + short oom_priority; + /* handle for "memory.events" */ struct cgroup_file events_file; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0b81dc55c6ac..f61e9a9c8bdc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2724,12 +2724,21 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) for (;;) { struct cgroup_subsys_state *css; struct mem_cgroup *memcg = NULL; + short prio = SHRT_MIN; long score = LONG_MIN; css_for_each_child(css, &root->css) { struct mem_cgroup *iter = mem_cgroup_from_css(css); - if (iter->oom_score > score) { + if (iter->oom_score == 0) + continue; + + if (iter->oom_priority > prio) { + memcg = iter; + prio = iter->oom_priority; + score = iter->oom_score; + } else if (iter->oom_priority == prio && + iter->oom_score > score) { memcg = iter; score = iter->oom_score; } @@ -2796,7 +2805,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) * For system-wide OOMs we should consider tasks in the root cgroup * with oom_score larger than oc->chosen_points. */ - if (!oc->memcg) { + if (!oc->memcg && !(oc->chosen_memcg && + oc->chosen_memcg->oom_priority > 0)) { + /* + * Root memcg has priority 0, so if chosen memcg has lower + * priority, any task in root cgroup is preferable. + */ + if (oc->chosen_memcg && oc->chosen_memcg->oom_priority < 0) + oc->chosen_points = 0; + select_victim_root_cgroup_task(oc); if (oc->chosen && oc->chosen_memcg) { @@ -5392,6 +5409,34 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_priority_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + + seq_printf(m, "%d\n", memcg->oom_priority); + + return 0; +} + +static ssize_t memory_oom_priority_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_priority; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_priority); + if (err) + return err; + + if (oom_priority < -10000 || oom_priority > 10000) + return -EINVAL; + + memcg->oom_priority = (short)oom_priority; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5518,6 +5563,12 @@ static struct cftype memory_files[] = { .write = memory_oom_kill_all_tasks_write, }, { + .name = "oom_priority", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_priority_show, + .write = memory_oom_priority_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), -- 2.13.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id AF4736B02F3 for ; Mon, 14 Aug 2017 14:32:55 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id g71so14832632wmg.13 for ; Mon, 14 Aug 2017 11:32:55 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id y33si5920661wrc.553.2017.08.14.11.32.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 11:32:54 -0700 (PDT) From: Roman Gushchin Subject: [v5 1/4] mm, oom: refactor the oom_kill_process() function Date: Mon, 14 Aug 2017 19:32:09 +0100 Message-ID: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org The oom_kill_process() function consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits the oom_kill_process() function with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/oom_kill.c | 123 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 65 insertions(+), 58 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 53b44425ef35..5c29a3dd591b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task) return ret; } -static void oom_kill_process(struct oom_control *oc, const char *message) +static void __oom_kill_process(struct task_struct *victim) { - struct task_struct *p = oc->chosen; - unsigned int points = oc->chosen_points; - struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; + struct task_struct *p; struct mm_struct *mm; - unsigned int victim_points = 0; - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly - */ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); - - if (__ratelimit(&oom_rs)) - dump_header(oc, p); - - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); - p = find_lock_task_mm(victim); if (!p) { put_task_struct(victim); @@ -947,10 +892,72 @@ static void oom_kill_process(struct oom_control *oc, const char *message) wake_oom_reaper(victim); mmdrop(mm); - put_task_struct(victim); } #undef K +static void oom_kill_process(struct oom_control *oc, const char *message) +{ + struct task_struct *p = oc->chosen; + unsigned int points = oc->chosen_points; + struct task_struct *victim = p; + struct task_struct *child; + struct task_struct *t; + unsigned int victim_points = 0; + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + /* + * If the task is already exiting, don't alarm the sysadmin or kill + * its children or threads, just set TIF_MEMDIE so it can die quickly + */ + task_lock(p); + if (task_will_free_mem(p)) { + mark_oom_victim(p); + wake_oom_reaper(p); + task_unlock(p); + put_task_struct(p); + return; + } + task_unlock(p); + + if (__ratelimit(&oom_rs)) + dump_header(oc, p); + + pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + + /* + * If any of p's children has a different mm and is eligible for kill, + * the one with the highest oom_badness() score is sacrificed for its + * parent. This attempts to lose the minimal amount of work done while + * still freeing memory. + */ + read_lock(&tasklist_lock); + for_each_thread(p, t) { + list_for_each_entry(child, &t->children, sibling) { + unsigned int child_points; + + if (process_shares_mm(child, p->mm)) + continue; + /* + * oom_badness() returns 0 if the thread is unkillable + */ + child_points = oom_badness(child, + oc->memcg, oc->nodemask, oc->totalpages); + if (child_points > victim_points) { + put_task_struct(victim); + victim = child; + victim_points = child_points; + get_task_struct(victim); + } + } + } + read_unlock(&tasklist_lock); + + __oom_kill_process(victim); + put_task_struct(victim); +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ -- 2.13.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 4BB3D6B02F3 for ; Mon, 14 Aug 2017 14:33:00 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id 123so148663179pga.5 for ; Mon, 14 Aug 2017 11:33:00 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id j1si4419193pgf.548.2017.08.14.11.32.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 11:32:59 -0700 (PDT) From: Roman Gushchin Subject: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Date: Mon, 14 Aug 2017 19:32:13 +0100 Message-ID: <20170814183213.12319-5-guro@fb.com> In-Reply-To: <20170814183213.12319-1-guro@fb.com> References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Update cgroups v2 docs. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- Documentation/cgroup-v2.txt | 62 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index dec5afdaa36d..22108f31e09d 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines 5-2-3. Memory Ownership + 5-2-4. Cgroup-aware OOM Killer 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom_kill_all_tasks + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + Defines whether the OOM killer should treat the cgroup + as a single entity during the victim selection. + + If set, OOM killer will kill all belonging tasks in + corresponding cgroup is selected as an OOM victim. + + Be default, OOM killer respect /proc/pid/oom_score_adj value + -1000, and will never kill the task, unless oom_kill_all_tasks + is set. + + memory.oom_priority + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + An integer number within the [-10000, 10000] range, + which defines the order in which the OOM killer selects victim + memory cgroups. + + OOM killer prefers memory cgroups with larger priority if they + are populated with elegible tasks. + + The oom_priority value is compared within sibling cgroups. + + The root cgroup has the oom_priority 0, which cannot be changed. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. +Cgroup-aware OOM Killer +~~~~~~~~~~~~~~~~~~~~~~~ + +Cgroup v2 memory controller implements a cgroup-aware OOM killer. +It means that it treats memory cgroups as first class OOM entities. + +Under OOM conditions the memory controller tries to make the best +choise of a victim, hierarchically looking for the largest memory +consumer. By default, it will look for the biggest task in the +biggest leaf cgroup. + +Be default, all cgroups have oom_priority 0, and OOM killer will +chose the largest cgroup recursively on each level. For non-root +cgroups it's possible to change the oom_priority, and it will cause +the OOM killer to look athe the priority value first, and compare +sizes only of cgroups with equal priority. + +But a user can change this behavior by enabling the per-cgroup +oom_kill_all_tasks option. If set, it causes the OOM killer treat +the whole cgroup as an indivisible memory consumer. In case if it's +selected as on OOM victim, all belonging tasks will be killed. + +Tasks in the root cgroup are treated as independent memory consumers, +and are compared with other memory consumers (e.g. leaf cgroups). +The root cgroup doesn't support the oom_kill_all_tasks feature. + +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM +the memory controller considers only cgroups belonging to the sub-tree +of the OOM'ing cgroup. + IO -- -- 2.13.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 1A1256B025F for ; Mon, 14 Aug 2017 18:00:51 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id y129so158711484pgy.1 for ; Mon, 14 Aug 2017 15:00:51 -0700 (PDT) Received: from mail-pg0-x231.google.com (mail-pg0-x231.google.com. [2607:f8b0:400e:c05::231]) by mx.google.com with ESMTPS id d9si5289931pln.943.2017.08.14.15.00.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 15:00:45 -0700 (PDT) Received: by mail-pg0-x231.google.com with SMTP id u185so55161564pgb.1 for ; Mon, 14 Aug 2017 15:00:45 -0700 (PDT) Date: Mon, 14 Aug 2017 15:00:43 -0700 (PDT) From: David Rientjes Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function In-Reply-To: <20170814183213.12319-1-guro@fb.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 14 Aug 2017, Roman Gushchin wrote: > The oom_kill_process() function consists of two logical parts: > the first one is responsible for considering task's children as > a potential victim and printing the debug information. > The second half is responsible for sending SIGKILL to all > tasks sharing the mm struct with the given victim. > > This commit splits the oom_kill_process() function with > an intention to re-use the the second half: __oom_kill_process(). > > The cgroup-aware OOM killer will kill multiple tasks > belonging to the victim cgroup. We don't need to print > the debug information for the each task, as well as play > with task selection (considering task's children), > so we can't use the existing oom_kill_process(). > > Signed-off-by: Roman Gushchin > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Tetsuo Handa > Cc: David Rientjes > Cc: Tejun Heo > Cc: kernel-team@fb.com > Cc: cgroups@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org Acked-by: David Rientjes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 093386B025F for ; Mon, 14 Aug 2017 18:42:58 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id r133so158659654pgr.6 for ; Mon, 14 Aug 2017 15:42:58 -0700 (PDT) Received: from mail-pg0-x231.google.com (mail-pg0-x231.google.com. [2607:f8b0:400e:c05::231]) by mx.google.com with ESMTPS id p4si4618954pga.958.2017.08.14.15.42.56 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 15:42:56 -0700 (PDT) Received: by mail-pg0-x231.google.com with SMTP id u185so55709182pgb.1 for ; Mon, 14 Aug 2017 15:42:56 -0700 (PDT) Date: Mon, 14 Aug 2017 15:42:54 -0700 (PDT) From: David Rientjes Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170814183213.12319-3-guro@fb.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 14 Aug 2017, Roman Gushchin wrote: > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 8a266e2be5a6..b7ec3bd441be 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -39,6 +39,7 @@ struct oom_control { > unsigned long totalpages; > struct task_struct *chosen; > unsigned long chosen_points; > + struct mem_cgroup *chosen_memcg; > }; > > extern struct mutex oom_lock; > @@ -79,6 +80,8 @@ extern void oom_killer_enable(void); > > extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > +extern int oom_evaluate_task(struct task_struct *task, void *arg); > + > /* sysctls */ > extern int sysctl_oom_dump_tasks; > extern int sysctl_oom_kill_allocating_task; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index df6f63ee95d6..0b81dc55c6ac 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) > return ret; > } > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > + const nodemask_t *nodemask) > +{ > + long points = 0; > + int nid; > + > + for_each_node_state(nid, N_MEMORY) { > + if (nodemask && !node_isset(nid, *nodemask)) > + continue; > + > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + } > + > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > + (PAGE_SIZE / 1024); > + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE); > + points += memcg_page_state(memcg, MEMCG_SOCK); > + points += memcg_page_state(memcg, MEMCG_SWAP); > + > + return points; > +} I'm indifferent to the memcg evaluation criteria used to determine which memcg should be selected over others with the same priority, others may feel differently. > + > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > + const nodemask_t *nodemask) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int elegible = 0; > + > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + /* > + * If there are no tasks, or all tasks have oom_score_adj set > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > + * don't select this memory cgroup. > + */ > + if (!elegible && > + (memcg->oom_kill_all_tasks || > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > + elegible = 1; I'm curious about the decision made in this conditional and how oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it should otherwise be disabled. It's undocumented in the changelog, but I'm questioning whether it's the right decision. Doesn't it make sense to kill all tasks that are not oom disabled, and allow the user to still protect certain processes by their /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that protection without a sibling memcg and its own reservation of memory. I'm thinking about a process that governs jobs inside the memcg and if there is an oom kill, it wants to do logging and any cleanup necessary before exiting itself. It seems like a powerful combination if coupled with oom notification. Also, s/elegible/eligible/ Otherwise, looks good! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 723076B02B4 for ; Mon, 14 Aug 2017 18:44:07 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id u199so159268013pgb.13 for ; Mon, 14 Aug 2017 15:44:07 -0700 (PDT) Received: from mail-pg0-x234.google.com (mail-pg0-x234.google.com. [2607:f8b0:400e:c05::234]) by mx.google.com with ESMTPS id e8si5253724pln.721.2017.08.14.15.44.06 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 15:44:06 -0700 (PDT) Received: by mail-pg0-x234.google.com with SMTP id l64so55665248pge.5 for ; Mon, 14 Aug 2017 15:44:06 -0700 (PDT) Date: Mon, 14 Aug 2017 15:44:04 -0700 (PDT) From: David Rientjes Subject: Re: [v5 3/4] mm, oom: introduce oom_priority for memory cgroups In-Reply-To: <20170814183213.12319-4-guro@fb.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-4-guro@fb.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 14 Aug 2017, Roman Gushchin wrote: > Introduce a per-memory-cgroup oom_priority setting: an integer number > within the [-10000, 10000] range, which defines the order in which > the OOM killer selects victim memory cgroups. > > OOM killer prefers memory cgroups with larger priority if they are > populated with elegible tasks. > > The oom_priority value is compared within sibling cgroups. > > The root cgroup has the oom_priority 0, which cannot be changed. > > Signed-off-by: Roman Gushchin > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: David Rientjes > Cc: Tejun Heo > Cc: Tetsuo Handa > Cc: kernel-team@fb.com > Cc: cgroups@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org Tested-by: David Rientjes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 3E3A66B025F for ; Mon, 14 Aug 2017 18:52:29 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id s14so158886719pgs.4 for ; Mon, 14 Aug 2017 15:52:29 -0700 (PDT) Received: from mail-pg0-x22f.google.com (mail-pg0-x22f.google.com. [2607:f8b0:400e:c05::22f]) by mx.google.com with ESMTPS id 2si4623891pgb.364.2017.08.14.15.52.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Aug 2017 15:52:28 -0700 (PDT) Received: by mail-pg0-x22f.google.com with SMTP id l64so55768998pge.5 for ; Mon, 14 Aug 2017 15:52:28 -0700 (PDT) Date: Mon, 14 Aug 2017 15:52:26 -0700 (PDT) From: David Rientjes Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer In-Reply-To: <20170814183213.12319-5-guro@fb.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 14 Aug 2017, Roman Gushchin wrote: > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > index dec5afdaa36d..22108f31e09d 100644 > --- a/Documentation/cgroup-v2.txt > +++ b/Documentation/cgroup-v2.txt > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. > 5-2-1. Memory Interface Files > 5-2-2. Usage Guidelines > 5-2-3. Memory Ownership > + 5-2-4. Cgroup-aware OOM Killer Random curiousness, why cgroup-aware oom killer and not memcg-aware oom killer? > 5-3. IO > 5-3-1. IO Interface Files > 5-3-2. Writeback > @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back. > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net. > > + memory.oom_kill_all_tasks > + > + A read-write single value file which exits on non-root s/exits/exists/ > + cgroups. The default is "0". > + > + Defines whether the OOM killer should treat the cgroup > + as a single entity during the victim selection. Isn't this true independent of the memory.oom_kill_all_tasks setting? The cgroup aware oom killer will consider memcg's as logical units when deciding what to kill with or without memory.oom_kill_all_tasks, right? I think you cover this fact in the cgroup aware oom killer section below so this might result in confusion if described alongside a setting of memory.oom_kill_all_tasks. > + > + If set, OOM killer will kill all belonging tasks in > + corresponding cgroup is selected as an OOM victim. Maybe "If set, the OOM killer will kill all threads attached to the memcg if selected as an OOM victim." is better? > + > + Be default, OOM killer respect /proc/pid/oom_score_adj value > + -1000, and will never kill the task, unless oom_kill_all_tasks > + is set. > + > + memory.oom_priority > + > + A read-write single value file which exits on non-root s/exits/exists/ > + cgroups. The default is "0". > + > + An integer number within the [-10000, 10000] range, > + which defines the order in which the OOM killer selects victim > + memory cgroups. > + > + OOM killer prefers memory cgroups with larger priority if they > + are populated with elegible tasks. s/elegible/eligible/ > + > + The oom_priority value is compared within sibling cgroups. > + > + The root cgroup has the oom_priority 0, which cannot be changed. > + > memory.events > A read-only flat-keyed file which exists on non-root cgroups. > The following entries are defined. Unless specified > @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas > belonging to the affected files to ensure correct memory ownership. > > > +Cgroup-aware OOM Killer > +~~~~~~~~~~~~~~~~~~~~~~~ > + > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > +It means that it treats memory cgroups as first class OOM entities. > + > +Under OOM conditions the memory controller tries to make the best > +choise of a victim, hierarchically looking for the largest memory > +consumer. By default, it will look for the biggest task in the > +biggest leaf cgroup. > + > +Be default, all cgroups have oom_priority 0, and OOM killer will > +chose the largest cgroup recursively on each level. For non-root > +cgroups it's possible to change the oom_priority, and it will cause > +the OOM killer to look athe the priority value first, and compare > +sizes only of cgroups with equal priority. Maybe some description of "largest" would be helpful here? I think you could briefly describe what is accounted for in the decisionmaking. s/athe/at the/ Reading through this, it makes me wonder if doing s/cgroup/memcg/ over most of it would be better. > + > +But a user can change this behavior by enabling the per-cgroup > +oom_kill_all_tasks option. If set, it causes the OOM killer treat > +the whole cgroup as an indivisible memory consumer. In case if it's > +selected as on OOM victim, all belonging tasks will be killed. > + > +Tasks in the root cgroup are treated as independent memory consumers, > +and are compared with other memory consumers (e.g. leaf cgroups). > +The root cgroup doesn't support the oom_kill_all_tasks feature. > + > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM > +the memory controller considers only cgroups belonging to the sub-tree > +of the OOM'ing cgroup. > + > IO > -- > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 6D1AC6B025F for ; Tue, 15 Aug 2017 08:16:25 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id 123so13206405pga.5 for ; Tue, 15 Aug 2017 05:16:25 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id u20si5480284pfa.678.2017.08.15.05.16.23 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Aug 2017 05:16:24 -0700 (PDT) Date: Tue, 15 Aug 2017 13:15:58 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170815121558.GA15892@castle.dhcp.TheFacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Aug 14, 2017 at 03:42:54PM -0700, David Rientjes wrote: > On Mon, 14 Aug 2017, Roman Gushchin wrote: > > + > > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > > + const nodemask_t *nodemask) > > +{ > > + struct css_task_iter it; > > + struct task_struct *task; > > + int elegible = 0; > > + > > + css_task_iter_start(&memcg->css, 0, &it); > > + while ((task = css_task_iter_next(&it))) { > > + /* > > + * If there are no tasks, or all tasks have oom_score_adj set > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > > + * don't select this memory cgroup. > > + */ > > + if (!elegible && > > + (memcg->oom_kill_all_tasks || > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > > + elegible = 1; > > I'm curious about the decision made in this conditional and how > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it > should otherwise be disabled. > > It's undocumented in the changelog, but I'm questioning whether it's the > right decision. Doesn't it make sense to kill all tasks that are not oom > disabled, and allow the user to still protect certain processes by their > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that > protection without a sibling memcg and its own reservation of memory. I'm > thinking about a process that governs jobs inside the memcg and if there > is an oom kill, it wants to do logging and any cleanup necessary before > exiting itself. It seems like a powerful combination if coupled with oom > notification. Good question! I think, that an ability to override any oom_score_adj value and get all tasks killed is more important, than an ability to kill all processes with some exceptions. In your example someone still needs to look after the remaining process, and kill it after some timeout, if it will not quit by itself, right? The special treatment of the -1000 value (without oom_kill_all_tasks) is required only to not to break the existing setups. Generally, oom_score_adj should have a meaning only on a cgroup level, so extending it to the system level doesn't sound as a good idea. > > Also, s/elegible/eligible/ Shame on me :) Will fix, thanks! > > Otherwise, looks good! Great! Thank you for the reviewing and testing. Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 634796B02B4 for ; Tue, 15 Aug 2017 08:20:28 -0400 (EDT) Received: by mail-wm0-f70.google.com with SMTP id d24so1395200wmi.0 for ; Tue, 15 Aug 2017 05:20:28 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x88si1192128wma.200.2017.08.15.05.20.27 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 15 Aug 2017 05:20:27 -0700 (PDT) Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> From: Aleksa Sarai Message-ID: Date: Tue, 15 Aug 2017 22:20:18 +1000 MIME-Version: 1.0 In-Reply-To: <20170815121558.GA15892@castle.dhcp.TheFacebook.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin , David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On 08/15/2017 10:15 PM, Roman Gushchin wrote: > Generally, oom_score_adj should have a meaning only on a cgroup level, > so extending it to the system level doesn't sound as a good idea. But wasn't the original purpose of oom_score (and oom_score_adj) to work on a system level, aka "normal" OOM? Is there some peculiarity about memcg OOM that I'm missing? -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 7CB766B025F for ; Tue, 15 Aug 2017 08:58:23 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id n88so1171995wrb.0 for ; Tue, 15 Aug 2017 05:58:23 -0700 (PDT) Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id u62si1233928wmb.159.2017.08.15.05.58.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Aug 2017 05:58:22 -0700 (PDT) Date: Tue, 15 Aug 2017 13:57:50 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170815125750.GB15892@castle.dhcp.TheFacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Aleksa Sarai Cc: David Rientjes , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Aug 15, 2017 at 10:20:18PM +1000, Aleksa Sarai wrote: > On 08/15/2017 10:15 PM, Roman Gushchin wrote: > > Generally, oom_score_adj should have a meaning only on a cgroup level, > > so extending it to the system level doesn't sound as a good idea. > > But wasn't the original purpose of oom_score (and oom_score_adj) to work on > a system level, aka "normal" OOM? Is there some peculiarity about memcg OOM > that I'm missing? I'm sorry, if it wasn't clear from my message, it's not about the system-wide OOM vs the memcg-wide OOM, it's about the isolation. In general, decision is made on memcg level first (based on oom_priority and size), and only then on a task level (based on size and oom_score_adj). Oom_score_adj affects which task inside the cgroup will be killed, but we never compare tasks from different cgroups. This is what I mean, when I'm saying, that oom_score_adj should not have a system-wide meaning. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 5D4406B025F for ; Tue, 15 Aug 2017 10:14:20 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id f23so5736933pgn.15 for ; Tue, 15 Aug 2017 07:14:20 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id k1si5595541pfc.321.2017.08.15.07.14.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Aug 2017 07:14:19 -0700 (PDT) Date: Tue, 15 Aug 2017 15:13:50 +0100 From: Roman Gushchin Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Message-ID: <20170815141350.GA4510@castle.DHCP.thefacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Aug 14, 2017 at 03:52:26PM -0700, David Rientjes wrote: > On Mon, 14 Aug 2017, Roman Gushchin wrote: > > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > > index dec5afdaa36d..22108f31e09d 100644 > > --- a/Documentation/cgroup-v2.txt > > +++ b/Documentation/cgroup-v2.txt > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. > > 5-2-1. Memory Interface Files > > 5-2-2. Usage Guidelines > > 5-2-3. Memory Ownership > > + 5-2-4. Cgroup-aware OOM Killer > > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom > killer? I don't think we use the term "memcg" somewhere in v2 docs. Do you think that "Memory cgroup-aware OOM killer" is better? > > > 5-3. IO > > 5-3-1. IO Interface Files > > 5-3-2. Writeback > > @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.oom_kill_all_tasks > > + > > + A read-write single value file which exits on non-root > > s/exits/exists/ Fixed. Thanks! > > > + cgroups. The default is "0". > > + > > + Defines whether the OOM killer should treat the cgroup > > + as a single entity during the victim selection. > > Isn't this true independent of the memory.oom_kill_all_tasks setting? > The cgroup aware oom killer will consider memcg's as logical units when > deciding what to kill with or without memory.oom_kill_all_tasks, right? > > I think you cover this fact in the cgroup aware oom killer section below > so this might result in confusion if described alongside a setting of > memory.oom_kill_all_tasks. > > > + > > + If set, OOM killer will kill all belonging tasks in > > + corresponding cgroup is selected as an OOM victim. > > Maybe > > "If set, the OOM killer will kill all threads attached to the memcg if > selected as an OOM victim." > > is better? Fixed to the following (to conform with core v2 concepts): If set, OOM killer will kill all processes attached to the cgroup if selected as an OOM victim. > > > + > > + Be default, OOM killer respect /proc/pid/oom_score_adj value > > + -1000, and will never kill the task, unless oom_kill_all_tasks > > + is set. > > + > > + memory.oom_priority > > + > > + A read-write single value file which exits on non-root > > s/exits/exists/ Fixed. > > > + cgroups. The default is "0". > > + > > + An integer number within the [-10000, 10000] range, > > + which defines the order in which the OOM killer selects victim > > + memory cgroups. > > + > > + OOM killer prefers memory cgroups with larger priority if they > > + are populated with elegible tasks. > > s/elegible/eligible/ Fixed. > > > + > > + The oom_priority value is compared within sibling cgroups. > > + > > + The root cgroup has the oom_priority 0, which cannot be changed. > > + > > memory.events > > A read-only flat-keyed file which exists on non-root cgroups. > > The following entries are defined. Unless specified > > @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas > > belonging to the affected files to ensure correct memory ownership. > > > > > > +Cgroup-aware OOM Killer > > +~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > +It means that it treats memory cgroups as first class OOM entities. > > + > > +Under OOM conditions the memory controller tries to make the best > > +choise of a victim, hierarchically looking for the largest memory > > +consumer. By default, it will look for the biggest task in the > > +biggest leaf cgroup. > > + > > +Be default, all cgroups have oom_priority 0, and OOM killer will > > +chose the largest cgroup recursively on each level. For non-root > > +cgroups it's possible to change the oom_priority, and it will cause > > +the OOM killer to look athe the priority value first, and compare > > +sizes only of cgroups with equal priority. > > Maybe some description of "largest" would be helpful here? I think you > could briefly describe what is accounted for in the decisionmaking. I'm afraid that it's too implementation-defined to be described. Do you have an idea, how to describe it without going too much into details? > s/athe/at the/ Fixed. > > Reading through this, it makes me wonder if doing s/cgroup/memcg/ over > most of it would be better. I don't think memcg is a good user term, but I agree, that it's necessary to highlight the fact that a user should enable memory controller to get this functionality. Added a corresponding note. Thanks! Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 413526B025F for ; Tue, 15 Aug 2017 16:56:28 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id a186so35905638pge.7 for ; Tue, 15 Aug 2017 13:56:28 -0700 (PDT) Received: from mail-pg0-x236.google.com (mail-pg0-x236.google.com. [2607:f8b0:400e:c05::236]) by mx.google.com with ESMTPS id m15si6754830plk.400.2017.08.15.13.56.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Aug 2017 13:56:26 -0700 (PDT) Received: by mail-pg0-x236.google.com with SMTP id i12so12566656pgr.3 for ; Tue, 15 Aug 2017 13:56:26 -0700 (PDT) Date: Tue, 15 Aug 2017 13:56:24 -0700 (PDT) From: David Rientjes Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer In-Reply-To: <20170815141350.GA4510@castle.DHCP.thefacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> <20170815141350.GA4510@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, 15 Aug 2017, Roman Gushchin wrote: > > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > > > index dec5afdaa36d..22108f31e09d 100644 > > > --- a/Documentation/cgroup-v2.txt > > > +++ b/Documentation/cgroup-v2.txt > > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. > > > 5-2-1. Memory Interface Files > > > 5-2-2. Usage Guidelines > > > 5-2-3. Memory Ownership > > > + 5-2-4. Cgroup-aware OOM Killer > > > > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom > > killer? > > I don't think we use the term "memcg" somewhere in v2 docs. > Do you think that "Memory cgroup-aware OOM killer" is better? > I think it would be better to not describe it as its own entity, but rather a part of how the memory cgroup works, so simply describing it in section 5-2, perhaps as its own subsection, as how the oom killer works when using the memory cgroup is sufficient. I wouldn't separate it out as a distinct cgroup feature in the documentation. > > > + cgroups. The default is "0". > > > + > > > + Defines whether the OOM killer should treat the cgroup > > > + as a single entity during the victim selection. > > > > Isn't this true independent of the memory.oom_kill_all_tasks setting? > > The cgroup aware oom killer will consider memcg's as logical units when > > deciding what to kill with or without memory.oom_kill_all_tasks, right? > > > > I think you cover this fact in the cgroup aware oom killer section below > > so this might result in confusion if described alongside a setting of > > memory.oom_kill_all_tasks. > > I assume this is fixed so that it's documented that memory cgroups are considered logical units by the oom killer and that memory.oom_kill_all_tasks is separate? The former defines the policy on how a memory cgroup is targeted and the latter defines the mechanism it uses to free memory. > > > + If set, OOM killer will kill all belonging tasks in > > > + corresponding cgroup is selected as an OOM victim. > > > > Maybe > > > > "If set, the OOM killer will kill all threads attached to the memcg if > > selected as an OOM victim." > > > > is better? > > Fixed to the following (to conform with core v2 concepts): > If set, OOM killer will kill all processes attached to the cgroup > if selected as an OOM victim. > Thanks. > > > +Cgroup-aware OOM Killer > > > +~~~~~~~~~~~~~~~~~~~~~~~ > > > + > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > > +It means that it treats memory cgroups as first class OOM entities. > > > + > > > +Under OOM conditions the memory controller tries to make the best > > > +choise of a victim, hierarchically looking for the largest memory > > > +consumer. By default, it will look for the biggest task in the > > > +biggest leaf cgroup. > > > + > > > +Be default, all cgroups have oom_priority 0, and OOM killer will > > > +chose the largest cgroup recursively on each level. For non-root > > > +cgroups it's possible to change the oom_priority, and it will cause > > > +the OOM killer to look athe the priority value first, and compare > > > +sizes only of cgroups with equal priority. > > > > Maybe some description of "largest" would be helpful here? I think you > > could briefly describe what is accounted for in the decisionmaking. > > I'm afraid that it's too implementation-defined to be described. > Do you have an idea, how to describe it without going too much into details? > The point is that "largest cgroup" is ambiguous here: largest in what sense? The cgroup with the largest number of processes attached? Using the largest amount of memory? I think the documentation should clearly define that the oom killer selects the memory cgroup that has the most memory managed at each level. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id D3B9F6B025F for ; Tue, 15 Aug 2017 17:47:17 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id y129so38682431pgy.1 for ; Tue, 15 Aug 2017 14:47:17 -0700 (PDT) Received: from mail-pg0-x230.google.com (mail-pg0-x230.google.com. [2607:f8b0:400e:c05::230]) by mx.google.com with ESMTPS id s2si6623578plk.239.2017.08.15.14.47.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Aug 2017 14:47:13 -0700 (PDT) Received: by mail-pg0-x230.google.com with SMTP id u5so13344040pgn.0 for ; Tue, 15 Aug 2017 14:47:13 -0700 (PDT) Date: Tue, 15 Aug 2017 14:47:10 -0700 (PDT) From: David Rientjes Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170815121558.GA15892@castle.dhcp.TheFacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, 15 Aug 2017, Roman Gushchin wrote: > > I'm curious about the decision made in this conditional and how > > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means > > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it > > should otherwise be disabled. > > > > It's undocumented in the changelog, but I'm questioning whether it's the > > right decision. Doesn't it make sense to kill all tasks that are not oom > > disabled, and allow the user to still protect certain processes by their > > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that > > protection without a sibling memcg and its own reservation of memory. I'm > > thinking about a process that governs jobs inside the memcg and if there > > is an oom kill, it wants to do logging and any cleanup necessary before > > exiting itself. It seems like a powerful combination if coupled with oom > > notification. > > Good question! > I think, that an ability to override any oom_score_adj value and get all tasks > killed is more important, than an ability to kill all processes with some > exceptions. > I'm disagreeing because getting all tasks killed is not necessarily something that only the kernel can do. If all processes are oom disabled, that's a configuration issue done by sysadmin and the kernel should decide to kill the next largest memory cgroup or lower priority memory cgroup. It's not killing things like sshd that intentionally oom disable themselves. You could argue that having an oom disabled process attached to these memcgs in the first place is also a configuration issue, but the problem is that in cgroup v2 with a restriction on processes only being attached at the leaf cgroups that there is no competition for memory in this case. I must assign memory resources to that sshd, or "Activity Manager" described by the cgroup v1 documentation, just to prevent it from being killed. I think the latter of what you describe, killing all processes with some exceptions, is actually quite powerful. I can guarantee that processes that set themselves to oom disabled are really oom disabled and I don't need to work around that in the cgroup hierarchy only because of this caveat. I can also oom disable my Activity Manger that wants to wait on oom notification and collect the oom kill logs, raise notifications, and perhaps restart the process that it manages. > In your example someone still needs to look after the remaining process, > and kill it after some timeout, if it will not quit by itself, right? > No, it can restart the process that was oom killed; or it can be sshd and I can still ssh into my machine. > The special treatment of the -1000 value (without oom_kill_all_tasks) > is required only to not to break the existing setups. > I think as a general principle that allowing an oom disabled process to be oom killed is incorrect and if you really do want these to be killed, then (1) either your oom_score_adj is already wrong or (2) you can wait on oom notification and exit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 7D17A6B025F for ; Wed, 16 Aug 2017 10:44:17 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id n88so6243031wrb.0 for ; Wed, 16 Aug 2017 07:44:17 -0700 (PDT) Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id d93si979676wma.132.2017.08.16.07.44.15 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 Aug 2017 07:44:16 -0700 (PDT) Date: Wed, 16 Aug 2017 15:43:44 +0100 From: Roman Gushchin Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Message-ID: <20170816144344.GA29131@castle.DHCP.thefacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> <20170815141350.GA4510@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Aug 15, 2017 at 01:56:24PM -0700, David Rientjes wrote: > On Tue, 15 Aug 2017, Roman Gushchin wrote: > > > > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > > > > index dec5afdaa36d..22108f31e09d 100644 > > > > --- a/Documentation/cgroup-v2.txt > > > > +++ b/Documentation/cgroup-v2.txt > > > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/. > > > > 5-2-1. Memory Interface Files > > > > 5-2-2. Usage Guidelines > > > > 5-2-3. Memory Ownership > > > > + 5-2-4. Cgroup-aware OOM Killer > > > > > > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom > > > killer? > > > > I don't think we use the term "memcg" somewhere in v2 docs. > > Do you think that "Memory cgroup-aware OOM killer" is better? > > > > I think it would be better to not describe it as its own entity, but > rather a part of how the memory cgroup works, so simply describing it in > section 5-2, perhaps as its own subsection, as how the oom killer works > when using the memory cgroup is sufficient. I wouldn't separate it out as > a distinct cgroup feature in the documentation. Ok I've got the idea, let me look, what I can do. I'll post an updated version soon. > > > > > + cgroups. The default is "0". > > > > + > > > > + Defines whether the OOM killer should treat the cgroup > > > > + as a single entity during the victim selection. > > > > > > Isn't this true independent of the memory.oom_kill_all_tasks setting? > > > The cgroup aware oom killer will consider memcg's as logical units when > > > deciding what to kill with or without memory.oom_kill_all_tasks, right? > > > > > > I think you cover this fact in the cgroup aware oom killer section below > > > so this might result in confusion if described alongside a setting of > > > memory.oom_kill_all_tasks. > > > > > I assume this is fixed so that it's documented that memory cgroups are > considered logical units by the oom killer and that > memory.oom_kill_all_tasks is separate? The former defines the policy on > how a memory cgroup is targeted and the latter defines the mechanism it > uses to free memory. Yes, I've fixed this. Thanks! > > > > +Cgroup-aware OOM Killer > > > > +~~~~~~~~~~~~~~~~~~~~~~~ > > > > + > > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > > > +It means that it treats memory cgroups as first class OOM entities. > > > > + > > > > +Under OOM conditions the memory controller tries to make the best > > > > +choise of a victim, hierarchically looking for the largest memory > > > > +consumer. By default, it will look for the biggest task in the > > > > +biggest leaf cgroup. > > > > + > > > > +Be default, all cgroups have oom_priority 0, and OOM killer will > > > > +chose the largest cgroup recursively on each level. For non-root > > > > +cgroups it's possible to change the oom_priority, and it will cause > > > > +the OOM killer to look athe the priority value first, and compare > > > > +sizes only of cgroups with equal priority. > > > > > > Maybe some description of "largest" would be helpful here? I think you > > > could briefly describe what is accounted for in the decisionmaking. > > > > I'm afraid that it's too implementation-defined to be described. > > Do you have an idea, how to describe it without going too much into details? > > > > The point is that "largest cgroup" is ambiguous here: largest in what > sense? The cgroup with the largest number of processes attached? Using > the largest amount of memory? > > I think the documentation should clearly define that the oom killer > selects the memory cgroup that has the most memory managed at each level. No problems, I'll add a clarification. Thank you! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id B69D96B025F for ; Wed, 16 Aug 2017 11:43:54 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id y129so65508002pgy.1 for ; Wed, 16 Aug 2017 08:43:54 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id z125si638203pgb.748.2017.08.16.08.43.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 Aug 2017 08:43:53 -0700 (PDT) Date: Wed, 16 Aug 2017 16:43:25 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170816154325.GB29131@castle.DHCP.thefacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Aug 15, 2017 at 02:47:10PM -0700, David Rientjes wrote: > On Tue, 15 Aug 2017, Roman Gushchin wrote: > > > > I'm curious about the decision made in this conditional and how > > > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means > > > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it > > > should otherwise be disabled. > > > > > > It's undocumented in the changelog, but I'm questioning whether it's the > > > right decision. Doesn't it make sense to kill all tasks that are not oom > > > disabled, and allow the user to still protect certain processes by their > > > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that > > > protection without a sibling memcg and its own reservation of memory. I'm > > > thinking about a process that governs jobs inside the memcg and if there > > > is an oom kill, it wants to do logging and any cleanup necessary before > > > exiting itself. It seems like a powerful combination if coupled with oom > > > notification. > > > > Good question! > > I think, that an ability to override any oom_score_adj value and get all tasks > > killed is more important, than an ability to kill all processes with some > > exceptions. > > > > I'm disagreeing because getting all tasks killed is not necessarily > something that only the kernel can do. If all processes are oom disabled, > that's a configuration issue done by sysadmin and the kernel should decide > to kill the next largest memory cgroup or lower priority memory cgroup. > It's not killing things like sshd that intentionally oom disable > themselves. > > You could argue that having an oom disabled process attached to these > memcgs in the first place is also a configuration issue, but the problem > is that in cgroup v2 with a restriction on processes only being attached > at the leaf cgroups that there is no competition for memory in this case. > I must assign memory resources to that sshd, or "Activity Manager" > described by the cgroup v1 documentation, just to prevent it from being > killed. > > I think the latter of what you describe, killing all processes with some > exceptions, is actually quite powerful. I can guarantee that processes > that set themselves to oom disabled are really oom disabled and I don't > need to work around that in the cgroup hierarchy only because of this > caveat. I can also oom disable my Activity Manger that wants to wait on > oom notification and collect the oom kill logs, raise notifications, and > perhaps restart the process that it manage. > > > In your example someone still needs to look after the remaining process, > > and kill it after some timeout, if it will not quit by itself, right? > > > > No, it can restart the process that was oom killed; or it can be sshd and > I can still ssh into my machine. > > > The special treatment of the -1000 value (without oom_kill_all_tasks) > > is required only to not to break the existing setups. > > > > I think as a general principle that allowing an oom disabled process to be > oom killed is incorrect and if you really do want these to be killed, then > (1) either your oom_score_adj is already wrong or (2) you can wait on oom > notification and exit. It's natural to expect that inside a container there are their own sshd, "activity manager" or some other stuff, which can play with oom_score_adj. If it can override the upper cgroup-level settings, the whole delegation model is broken. You can think about the oom_kill_all_tasks like the panic_on_oom, but on a cgroup level. It should _guarantee_, that in case of oom the whole cgroup will be destroyed completely, and will not remain in a non-consistent state. The model you're describing is based on a trust given to these oom-unkillable processes on system level. But we can't really trust some unknown processes inside a cgroup that they will be able to do some useful work and finish in a reasonable time; especially in case of a global memory shortage. That means someone needs to look at cgroup after each OOM and check if there are remaining tasks. If so, the whole functionality is useless. If some complex post-oom handling is required, it should be performed from another cgroup (protected by the lower oom_priority value). So, for example: root | A / \ B C B: oom_priority=10, oom_kill_all_tasks=1, contains workload C: oom_priority=0, oom_kill_all_tasks=0, contains control stuff If B is killed by OOM, an "activity manager" in C can be notified and perform some actions. Thanks! Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 6CEE46B025F for ; Thu, 17 Aug 2017 08:20:21 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id i59so12766744wri.3 for ; Thu, 17 Aug 2017 05:20:21 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id 64si1941671wrd.431.2017.08.17.05.20.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Aug 2017 05:20:19 -0700 (PDT) Date: Thu, 17 Aug 2017 13:16:47 +0100 From: Roman Gushchin Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Message-ID: <20170817121647.GA26107@castle.DHCP.thefacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> <20170815141350.GA4510@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Hi David! Please, find an updated version of docs patch below. Thanks! Roman -- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71]) by kanga.kvack.org (Postfix) with ESMTP id 6D3E7280310 for ; Sun, 20 Aug 2017 20:41:28 -0400 (EDT) Received: by mail-pg0-f71.google.com with SMTP id m133so221471288pga.2 for ; Sun, 20 Aug 2017 17:41:28 -0700 (PDT) Received: from mail-pg0-x233.google.com (mail-pg0-x233.google.com. [2607:f8b0:400e:c05::233]) by mx.google.com with ESMTPS id s8si2767919pgc.825.2017.08.20.17.41.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 20 Aug 2017 17:41:27 -0700 (PDT) Received: by mail-pg0-x233.google.com with SMTP id n4so25846130pgn.1 for ; Sun, 20 Aug 2017 17:41:27 -0700 (PDT) Date: Sun, 20 Aug 2017 17:41:25 -0700 (PDT) From: David Rientjes Subject: Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer In-Reply-To: <20170817121647.GA26107@castle.DHCP.thefacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-5-guro@fb.com> <20170815141350.GA4510@castle.DHCP.thefacebook.com> <20170817121647.GA26107@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, 17 Aug 2017, Roman Gushchin wrote: > Hi David! > > Please, find an updated version of docs patch below. > Looks much better, thanks! I think the only pending issue is discussing the relationship of memory.oom_kill_all_tasks with /proc/pid/oom_score_adj == OOM_SCORE_ADJ_MIN. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 4D3616B04BD for ; Sun, 20 Aug 2017 20:50:31 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id b8so20159032pgn.10 for ; Sun, 20 Aug 2017 17:50:31 -0700 (PDT) Received: from mail-pg0-x232.google.com (mail-pg0-x232.google.com. [2607:f8b0:400e:c05::232]) by mx.google.com with ESMTPS id m2si5020151pgt.892.2017.08.20.17.50.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 20 Aug 2017 17:50:29 -0700 (PDT) Received: by mail-pg0-x232.google.com with SMTP id y129so89567076pgy.4 for ; Sun, 20 Aug 2017 17:50:29 -0700 (PDT) Date: Sun, 20 Aug 2017 17:50:27 -0700 (PDT) From: David Rientjes Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170816154325.GB29131@castle.DHCP.thefacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> <20170816154325.GB29131@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, 16 Aug 2017, Roman Gushchin wrote: > It's natural to expect that inside a container there are their own sshd, > "activity manager" or some other stuff, which can play with oom_score_adj. > If it can override the upper cgroup-level settings, the whole delegation model > is broken. > I don't think any delegation model related to core cgroups or memory cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is defined. It could very well behave as memory.oom_kill_all_eligible_tasks when enacted upon. > You can think about the oom_kill_all_tasks like the panic_on_oom, > but on a cgroup level. It should _guarantee_, that in case of oom > the whole cgroup will be destroyed completely, and will not remain > in a non-consistent state. > Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if needed, when it sets memory.oom_kill_all_tasks. If a user gains permissions to change memory.oom_kill_all_tasks, I disagree it should override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj. I would prefer not to exclude oom disabled processes to their own sibling cgroups because they would require their own reservation with cgroup v2 and it makes the single hierarchy model much more difficult to arrange alongside cpusets, for example. > The model you're describing is based on a trust given to these oom-unkillable > processes on system level. But we can't really trust some unknown processes > inside a cgroup that they will be able to do some useful work and finish > in a reasonable time; especially in case of a global memory shortage. Yes, we prefer to panic instead of sshd, for example, being oom killed. We trust that sshd, as well as our own activity manager and security daemons are trusted to do useful work and that we never want the kernel to do this. I'm not sure why you are describing processes that CAP_SYS_ADMIN has set to be oom disabled as unknown processes. I'd be interested in hearing the opinions of others related to a per-memcg knob being allowed to override the setting of the sysadmin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 073626B04E7 for ; Mon, 21 Aug 2017 05:48:08 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id q49so9485261wrb.14 for ; Mon, 21 Aug 2017 02:48:07 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id 204si5624985wmx.93.2017.08.21.02.48.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 21 Aug 2017 02:48:06 -0700 (PDT) Date: Mon, 21 Aug 2017 10:46:56 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170821094656.GA13899@castle.dhcp.TheFacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> <20170816154325.GB29131@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Sun, Aug 20, 2017 at 05:50:27PM -0700, David Rientjes wrote: > On Wed, 16 Aug 2017, Roman Gushchin wrote: > > > It's natural to expect that inside a container there are their own sshd, > > "activity manager" or some other stuff, which can play with oom_score_adj. > > If it can override the upper cgroup-level settings, the whole delegation model > > is broken. > > > > I don't think any delegation model related to core cgroups or memory > cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is > defined. It could very well behave as memory.oom_kill_all_eligible_tasks > when enacted upon. > > > You can think about the oom_kill_all_tasks like the panic_on_oom, > > but on a cgroup level. It should _guarantee_, that in case of oom > > the whole cgroup will be destroyed completely, and will not remain > > in a non-consistent state. > > > > Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to CAP_SYS_RESOURCE > OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if > needed, when it sets memory.oom_kill_all_tasks. If a user gains > permissions to change memory.oom_kill_all_tasks, I disagree it should > override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj. > > I would prefer not to exclude oom disabled processes to their own sibling > cgroups because they would require their own reservation with cgroup v2 > and it makes the single hierarchy model much more difficult to arrange > alongside cpusets, for example. > > > The model you're describing is based on a trust given to these oom-unkillable > > processes on system level. But we can't really trust some unknown processes > > inside a cgroup that they will be able to do some useful work and finish > > in a reasonable time; especially in case of a global memory shortage. > > Yes, we prefer to panic instead of sshd, for example, being oom killed. > We trust that sshd, as well as our own activity manager and security > daemons are trusted to do useful work and that we never want the kernel to > do this. I'm not sure why you are describing processes that CAP_SYS_ADMIN > has set to be oom disabled as unknown processes. > > I'd be interested in hearing the opinions of others related to a per-memcg > knob being allowed to override the setting of the sysadmin. Sure, me too. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 125942806F4 for ; Tue, 22 Aug 2017 13:03:55 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id p14so28151606wrg.8 for ; Tue, 22 Aug 2017 10:03:55 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id q28si13364885edb.390.2017.08.22.10.03.52 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 22 Aug 2017 10:03:52 -0700 (PDT) Date: Tue, 22 Aug 2017 13:03:44 -0400 From: Johannes Weiner Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170822170344.GA13547@cmpxchg.org> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170814183213.12319-3-guro@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Hi Roman, great work! This looks mostly good to me now. Below are some nitpicks concerning naming and code layout, but nothing major. On Mon, Aug 14, 2017 at 07:32:11PM +0100, Roman Gushchin wrote: > @@ -39,6 +39,7 @@ struct oom_control { > unsigned long totalpages; > struct task_struct *chosen; > unsigned long chosen_points; > + struct mem_cgroup *chosen_memcg; > }; Please rename 'chosen' to 'chosen_task' to make the distinction to chosen_memcg clearer. The ordering is a little weird too, with chosen_points in between. chosen_task chosen_memcg chosen_points ? > @@ -2639,6 +2639,181 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) > return ret; > } > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > + const nodemask_t *nodemask) > +{ > + long points = 0; > + int nid; > + > + for_each_node_state(nid, N_MEMORY) { > + if (nodemask && !node_isset(nid, *nodemask)) > + continue; > + > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + } > + > + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / > + (PAGE_SIZE / 1024); > + points += memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE); NR_SLAB_UNRECLAIMABLE is now accounted per-lruvec, which takes nodeness into account, and so would be more accurate here. You can get it with mem_cgroup_lruvec() and lruvec_page_state(). > + points += memcg_page_state(memcg, MEMCG_SOCK); > + points += memcg_page_state(memcg, MEMCG_SWAP); > + > + return points; > +} > + > +static long oom_evaluate_memcg(struct mem_cgroup *memcg, > + const nodemask_t *nodemask) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int elegible = 0; eligible > + > + css_task_iter_start(&memcg->css, 0, &it); > + while ((task = css_task_iter_next(&it))) { > + /* > + * If there are no tasks, or all tasks have oom_score_adj set > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > + * don't select this memory cgroup. > + */ > + if (!elegible && > + (memcg->oom_kill_all_tasks || > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > + elegible = 1; This is a little awkward to read. How about something like this: /* * When killing individual tasks, we respect OOM score adjustments: * at least one task in the group needs to be killable for the group * to be oomable. * * Also check that previous OOM kills have finished, and abort if * there are any pending OOM victims. */ oomable = memcg->oom_kill_all_tasks; while ((task = css_task_iter_next(&it))) { if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN) oomable = 1; > + if (tsk_is_oom_victim(task) && > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > + elegible = -1; > + break; > + } > + } > + css_task_iter_end(&it); etc. > + > + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible; I find these much easier to read if broken up, even if it's more LOC: if (eligible <= 0) return eligible; return memcg_oom_badness(memcg, nodemask); > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > +{ > + struct mem_cgroup *iter, *parent; > + > + for_each_mem_cgroup_tree(iter, root) { > + if (memcg_has_children(iter)) { > + iter->oom_score = 0; > + continue; > + } > + > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask); > + if (iter->oom_score == -1) { Please add comments to document the special returns. Maybe #defines would be clearer, too. > + oc->chosen_memcg = (void *)-1UL; > + mem_cgroup_iter_break(root, iter); > + return; > + } > + > + if (!iter->oom_score) > + continue; Same here. Maybe a switch would be suitable to handle the abort/no-score cases. > + for (parent = parent_mem_cgroup(iter); parent && parent != root; > + parent = parent_mem_cgroup(parent)) > + parent->oom_score += iter->oom_score; > + } > + > + for (;;) { > + struct cgroup_subsys_state *css; > + struct mem_cgroup *memcg = NULL; > + long score = LONG_MIN; > + > + css_for_each_child(css, &root->css) { > + struct mem_cgroup *iter = mem_cgroup_from_css(css); > + > + if (iter->oom_score > score) { > + memcg = iter; > + score = iter->oom_score; > + } > + } > + > + if (!memcg) { > + if (oc->memcg && root == oc->memcg) { > + oc->chosen_memcg = oc->memcg; > + css_get(&oc->chosen_memcg->css); > + oc->chosen_points = oc->memcg->oom_score; > + } > + break; > + } > + > + if (memcg->oom_kill_all_tasks || !memcg_has_children(memcg)) { > + oc->chosen_memcg = memcg; > + css_get(&oc->chosen_memcg->css); > + oc->chosen_points = score; > + break; > + } > + > + root = memcg; > + } > +} > + > +static void select_victim_root_cgroup_task(struct oom_control *oc) > +{ > + struct css_task_iter it; > + struct task_struct *task; > + int ret = 0; > + > + css_task_iter_start(&root_mem_cgroup->css, 0, &it); > + while (!ret && (task = css_task_iter_next(&it))) > + ret = oom_evaluate_task(task, oc); > + css_task_iter_end(&it); > +} > + > +bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + struct mem_cgroup *root = root_mem_cgroup; > + > + oc->chosen = NULL; > + oc->chosen_memcg = NULL; > + > + if (mem_cgroup_disabled()) > + return false; > + > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > + return false; > + > + if (oc->memcg) > + root = oc->memcg; > + > + rcu_read_lock(); > + > + select_victim_memcg(root, oc); > + if (oc->chosen_memcg == (void *)-1UL) { > + /* Existing OOM victims are found. */ > + rcu_read_unlock(); > + return true; > + } It would be good to format this branch like the block below, with a newline and the comment before the branch block rather than inside. That would also set apart the call to select_victim_memcg(), which is the main workhorse of this function. > + /* > + * For system-wide OOMs we should consider tasks in the root cgroup > + * with oom_score larger than oc->chosen_points. > + */ > + if (!oc->memcg) { > + select_victim_root_cgroup_task(oc); > + > + if (oc->chosen && oc->chosen_memcg) { > + /* > + * If we've decided to kill a task in the root memcg, > + * release chosen_memcg. > + */ > + css_put(&oc->chosen_memcg->css); > + oc->chosen_memcg = NULL; > + } > + } ^^ like this one. > + > + rcu_read_unlock(); > + > + return !!oc->chosen || !!oc->chosen_memcg; The !! are detrimental to readability and shouldn't be necessary. > @@ -5190,6 +5365,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > return nbytes; > } > > +static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > + bool oom_kill_all_tasks = memcg->oom_kill_all_tasks; > + > + seq_printf(m, "%d\n", oom_kill_all_tasks); > + > + return 0; > +} > + > +static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, > + loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + int oom_kill_all_tasks; > + int err; > + > + err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks); > + if (err) > + return err; > + > + memcg->oom_kill_all_tasks = !!oom_kill_all_tasks; > + > + return nbytes; > +} > + > static int memory_events_show(struct seq_file *m, void *v) > { > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = { > .write = memory_max_write, > }, > { > + .name = "oom_kill_all_tasks", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = memory_oom_kill_all_tasks_show, > + .write = memory_oom_kill_all_tasks_write, > + }, This name is quite a mouthful and reminiscent of the awkward v1 interface names. It doesn't really go well with the v2 names. How about memory.oom_group? > + { > .name = "events", > .flags = CFTYPE_NOT_ON_ROOT, > .file_offset = offsetof(struct mem_cgroup, events_file), > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 5c29a3dd591b..28e42a0d5eee 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) > return CONSTRAINT_NONE; > } > > -static int oom_evaluate_task(struct task_struct *task, void *arg) > +int oom_evaluate_task(struct task_struct *task, void *arg) > { > struct oom_control *oc = arg; > unsigned long points; > @@ -823,6 +823,9 @@ static void __oom_kill_process(struct task_struct *victim) > struct mm_struct *mm; > bool can_oom_reap = true; > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD)) > + return; > + > p = find_lock_task_mm(victim); > if (!p) { > put_task_struct(victim); > @@ -958,6 +961,60 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > put_task_struct(victim); > } > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused) > +{ > + if (!tsk_is_oom_victim(task)) > + __oom_kill_process(task); > + return 0; > +} > + > +static bool oom_kill_memcg_victim(struct oom_control *oc) > +{ > + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > + DEFAULT_RATELIMIT_BURST); > + > + if (oc->chosen) { > + if (oc->chosen != (void *)-1UL) { > + if (__ratelimit(&oom_rs)) > + dump_header(oc, oc->chosen); > + > + __oom_kill_process(oc->chosen); > + put_task_struct(oc->chosen); > + schedule_timeout_killable(1); > + } > + return true; > + > + } else if (oc->chosen_memcg) { > + if (oc->chosen_memcg == (void *)-1UL) > + return true; Can you format the above chosen == (void *)-1UL the same way? That makes it easier to see that it's checking the same thing. Thanks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 79C852806F4 for ; Tue, 22 Aug 2017 13:07:01 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id a110so14992523wrc.1 for ; Tue, 22 Aug 2017 10:07:01 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id 20si12946405edw.555.2017.08.22.10.06.59 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 22 Aug 2017 10:06:59 -0700 (PDT) Date: Tue, 22 Aug 2017 13:06:55 -0400 From: Johannes Weiner Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function Message-ID: <20170822170655.GB13547@cmpxchg.org> References: <20170814183213.12319-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170814183213.12319-1-guro@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Aug 14, 2017 at 07:32:09PM +0100, Roman Gushchin wrote: > @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task) > return ret; > } > > -static void oom_kill_process(struct oom_control *oc, const char *message) > +static void __oom_kill_process(struct task_struct *victim) oom_kill_task()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f197.google.com (mail-pf0-f197.google.com [209.85.192.197]) by kanga.kvack.org (Postfix) with ESMTP id EE70C6B04FF for ; Wed, 23 Aug 2017 08:30:30 -0400 (EDT) Received: by mail-pf0-f197.google.com with SMTP id a2so13017559pfj.12 for ; Wed, 23 Aug 2017 05:30:30 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id w23si1032733plk.942.2017.08.23.05.30.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 23 Aug 2017 05:30:29 -0700 (PDT) Date: Wed, 23 Aug 2017 13:30:04 +0100 From: Roman Gushchin Subject: Re: [v5 1/4] mm, oom: refactor the oom_kill_process() function Message-ID: <20170823123004.GA10095@castle.dhcp.TheFacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170822170655.GB13547@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170822170655.GB13547@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Aug 22, 2017 at 01:06:55PM -0400, Johannes Weiner wrote: > On Mon, Aug 14, 2017 at 07:32:09PM +0100, Roman Gushchin wrote: > > @@ -817,67 +817,12 @@ static bool task_will_free_mem(struct task_struct *task) > > return ret; > > } > > > > -static void oom_kill_process(struct oom_control *oc, const char *message) > > +static void __oom_kill_process(struct task_struct *victim) > > oom_kill_task()? Not sure, as it kills all tasks which are sharing mm with the given task. Also, it will be confusing to have oom_kill_process() and oom_kill_task(), where the actual difference is in how much verbose they are, and if it's allowed to perfer a child process. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id CBA416B04EA for ; Wed, 23 Aug 2017 12:21:06 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id a2so4165584pfj.12 for ; Wed, 23 Aug 2017 09:21:06 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id m16si1390952pli.707.2017.08.23.09.21.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 23 Aug 2017 09:21:01 -0700 (PDT) Date: Wed, 23 Aug 2017 17:20:31 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170823162031.GA13578@castle.dhcp.TheFacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170822170344.GA13547@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170822170344.GA13547@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Hi Johannes! Thank you for review! I do agree with most of the comments, and I will address them in v6. I'll post it soon. Please, find some comments below. On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote: > Hi Roman, > > great work! This looks mostly good to me now. Below are some nitpicks > concerning naming and code layout, but nothing major. > > > + > > + css_task_iter_start(&memcg->css, 0, &it); > > + while ((task = css_task_iter_next(&it))) { > > + /* > > + * If there are no tasks, or all tasks have oom_score_adj set > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > > + * don't select this memory cgroup. > > + */ > > + if (!elegible && > > + (memcg->oom_kill_all_tasks || > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > > + elegible = 1; > > This is a little awkward to read. How about something like this: > > /* > * When killing individual tasks, we respect OOM score adjustments: > * at least one task in the group needs to be killable for the group > * to be oomable. > * > * Also check that previous OOM kills have finished, and abort if > * there are any pending OOM victims. > */ > oomable = memcg->oom_kill_all_tasks; > while ((task = css_task_iter_next(&it))) { > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN) > oomable = 1; > > > + if (tsk_is_oom_victim(task) && > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > > + elegible = -1; > > + break; > > + } > > + } > > + css_task_iter_end(&it); We ignore oom_score_adj if oom_kill_all_tasks is set, it's not reflected in your version. Anyway, I've moved the comments block outside and rephrased it to make more clear. > > etc. > > > + > > + return elegible > 0 ? memcg_oom_badness(memcg, nodemask) : elegible; > > I find these much easier to read if broken up, even if it's more LOC: > > if (eligible <= 0) > return eligible; > > return memcg_oom_badness(memcg, nodemask); > > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > > +{ > > + struct mem_cgroup *iter, *parent; > > + > > + for_each_mem_cgroup_tree(iter, root) { > > + if (memcg_has_children(iter)) { > > + iter->oom_score = 0; > > + continue; > > + } > > + > > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask); > > + if (iter->oom_score == -1) { > > Please add comments to document the special returns. Maybe #defines > would be clearer, too. > > > + oc->chosen_memcg = (void *)-1UL; > > + mem_cgroup_iter_break(root, iter); > > + return; > > + } > > + > > + if (!iter->oom_score) > > + continue; > > Same here. > > Maybe a switch would be suitable to handle the abort/no-score cases. Not sure about switch/defines, but I've added several comment blocks to describe possible return values, as well as their handling. Hope, it will be enough. > > static int memory_events_show(struct seq_file *m, void *v) > > { > > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = { > > .write = memory_max_write, > > }, > > { > > + .name = "oom_kill_all_tasks", > > + .flags = CFTYPE_NOT_ON_ROOT, > > + .seq_show = memory_oom_kill_all_tasks_show, > > + .write = memory_oom_kill_all_tasks_write, > > + }, > > This name is quite a mouthful and reminiscent of the awkward v1 > interface names. It doesn't really go well with the v2 names. > > How about memory.oom_group? I'd prefer to have something more obvious. I've renamed memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested by Vladimir. Are you ok with it? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 5508D2803E6 for ; Wed, 23 Aug 2017 13:24:51 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id 136so711753wmm.15 for ; Wed, 23 Aug 2017 10:24:51 -0700 (PDT) Received: from gum.cmpxchg.org (gum.cmpxchg.org. [85.214.110.215]) by mx.google.com with ESMTPS id e11si2054844eda.29.2017.08.23.10.24.49 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 23 Aug 2017 10:24:49 -0700 (PDT) Date: Wed, 23 Aug 2017 13:24:41 -0400 From: Johannes Weiner Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170823172441.GA29085@cmpxchg.org> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170822170344.GA13547@cmpxchg.org> <20170823162031.GA13578@castle.dhcp.TheFacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170823162031.GA13578@castle.dhcp.TheFacebook.com> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Hi, On Wed, Aug 23, 2017 at 05:20:31PM +0100, Roman Gushchin wrote: > On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote: > > > + css_task_iter_start(&memcg->css, 0, &it); > > > + while ((task = css_task_iter_next(&it))) { > > > + /* > > > + * If there are no tasks, or all tasks have oom_score_adj set > > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > > > + * don't select this memory cgroup. > > > + */ > > > + if (!elegible && > > > + (memcg->oom_kill_all_tasks || > > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > > > + elegible = 1; > > > > This is a little awkward to read. How about something like this: > > > > /* > > * When killing individual tasks, we respect OOM score adjustments: > > * at least one task in the group needs to be killable for the group > > * to be oomable. > > * > > * Also check that previous OOM kills have finished, and abort if > > * there are any pending OOM victims. > > */ > > oomable = memcg->oom_kill_all_tasks; > > while ((task = css_task_iter_next(&it))) { > > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN) > > oomable = 1; > > > > > + if (tsk_is_oom_victim(task) && > > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > > > + elegible = -1; > > > + break; > > > + } > > > + } > > > + css_task_iter_end(&it); > > We ignore oom_score_adj if oom_kill_all_tasks is set, it's > not reflected in your version. Anyway, I've moved the comments block > outside and rephrased it to make more clear. Yes it is...? We only respect the score if !oomable, which is set to oom_kill_all_tasks. > > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > > > +{ > > > + struct mem_cgroup *iter, *parent; > > > + > > > + for_each_mem_cgroup_tree(iter, root) { > > > + if (memcg_has_children(iter)) { > > > + iter->oom_score = 0; > > > + continue; > > > + } > > > + > > > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask); > > > + if (iter->oom_score == -1) { > > > > Please add comments to document the special returns. Maybe #defines > > would be clearer, too. > > > > > + oc->chosen_memcg = (void *)-1UL; > > > + mem_cgroup_iter_break(root, iter); > > > + return; > > > + } > > > + > > > + if (!iter->oom_score) > > > + continue; > > > > Same here. > > > > Maybe a switch would be suitable to handle the abort/no-score cases. > > Not sure about switch/defines, but I've added several comment blocks > to describe possible return values, as well as their handling. > Hope, it will be enough. Sounds good. > > > static int memory_events_show(struct seq_file *m, void *v) > > > { > > > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = { > > > .write = memory_max_write, > > > }, > > > { > > > + .name = "oom_kill_all_tasks", > > > + .flags = CFTYPE_NOT_ON_ROOT, > > > + .seq_show = memory_oom_kill_all_tasks_show, > > > + .write = memory_oom_kill_all_tasks_write, > > > + }, > > > > This name is quite a mouthful and reminiscent of the awkward v1 > > interface names. It doesn't really go well with the v2 names. > > > > How about memory.oom_group? > > I'd prefer to have something more obvious. I've renamed > memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested > by Vladimir. Are you ok with it? No, we should be striving for short and sweet mnemonics that express a concept (oom applies to group, not member tasks) instead of underscore sentences that describe an implementation (upon oom, kill all tasks in the group). It's better to have newbies consult the documentation once than making everybody deal with long and cumbersome names for the rest of time. Like 'ls' being better than 'read_and_print_directory_contents'. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 02B5B2803E9 for ; Wed, 23 Aug 2017 14:05:30 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id o82so6143937pfj.11 for ; Wed, 23 Aug 2017 11:05:29 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id f35si1473597plh.628.2017.08.23.11.05.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 23 Aug 2017 11:05:28 -0700 (PDT) Date: Wed, 23 Aug 2017 19:04:50 +0100 From: Roman Gushchin Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer Message-ID: <20170823174603.GA26190@castle.DHCP.thefacebook.com> References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170822170344.GA13547@cmpxchg.org> <20170823162031.GA13578@castle.dhcp.TheFacebook.com> <20170823172441.GA29085@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170823172441.GA29085@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Aug 23, 2017 at 01:24:41PM -0400, Johannes Weiner wrote: > Hi, > > On Wed, Aug 23, 2017 at 05:20:31PM +0100, Roman Gushchin wrote: > > On Tue, Aug 22, 2017 at 01:03:44PM -0400, Johannes Weiner wrote: > > > > + css_task_iter_start(&memcg->css, 0, &it); > > > > + while ((task = css_task_iter_next(&it))) { > > > > + /* > > > > + * If there are no tasks, or all tasks have oom_score_adj set > > > > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set, > > > > + * don't select this memory cgroup. > > > > + */ > > > > + if (!elegible && > > > > + (memcg->oom_kill_all_tasks || > > > > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)) > > > > + elegible = 1; > > > > > > This is a little awkward to read. How about something like this: > > > > > > /* > > > * When killing individual tasks, we respect OOM score adjustments: > > > * at least one task in the group needs to be killable for the group > > > * to be oomable. > > > * > > > * Also check that previous OOM kills have finished, and abort if > > > * there are any pending OOM victims. > > > */ > > > oomable = memcg->oom_kill_all_tasks; > > > while ((task = css_task_iter_next(&it))) { > > > if (!oomable && task->signal_oom_score_adj != OOM_SCORE_ADJ_MIN) > > > oomable = 1; > > > > > > > + if (tsk_is_oom_victim(task) && > > > > + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) { > > > > + elegible = -1; > > > > + break; > > > > + } > > > > + } > > > > + css_task_iter_end(&it); > > > > We ignore oom_score_adj if oom_kill_all_tasks is set, it's > > not reflected in your version. Anyway, I've moved the comments block > > outside and rephrased it to make more clear. > > Yes it is...? We only respect the score if !oomable, which is set to > oom_kill_all_tasks. Sorry, haven't noticed this. > > > > static int memory_events_show(struct seq_file *m, void *v) > > > > { > > > > struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); > > > > @@ -5310,6 +5512,12 @@ static struct cftype memory_files[] = { > > > > .write = memory_max_write, > > > > }, > > > > { > > > > + .name = "oom_kill_all_tasks", > > > > + .flags = CFTYPE_NOT_ON_ROOT, > > > > + .seq_show = memory_oom_kill_all_tasks_show, > > > > + .write = memory_oom_kill_all_tasks_write, > > > > + }, > > > > > > This name is quite a mouthful and reminiscent of the awkward v1 > > > interface names. It doesn't really go well with the v2 names. > > > > > > How about memory.oom_group? > > > > I'd prefer to have something more obvious. I've renamed > > memory.oom_kill_all_tasks to memory.oom_kill_all, which was earlier suggested > > by Vladimir. Are you ok with it? > > No, we should be striving for short and sweet mnemonics that express a > concept (oom applies to group, not member tasks) instead of underscore > sentences that describe an implementation (upon oom, kill all tasks in > the group). Why do you call it implementation, it's definitely an user's intention "if a cgroup is under OOM, all belonging processes should be killed". How it can be implemented differently? > > It's better to have newbies consult the documentation once than making > everybody deal with long and cumbersome names for the rest of time. > > Like 'ls' being better than 'read_and_print_directory_contents'. I don't think it's a good argument here: realistically, nobody will type the knob's name often. Your option is shorter only by 3 characters :) Anyway, I'm ok with memory.oom_group too, if everybody else prefer it. Michal, David? What's your opinion? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 3BD942803FE for ; Wed, 23 Aug 2017 19:13:30 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id b8so18936146pgn.10 for ; Wed, 23 Aug 2017 16:13:30 -0700 (PDT) Received: from mail-pg0-x234.google.com (mail-pg0-x234.google.com. [2607:f8b0:400e:c05::234]) by mx.google.com with ESMTPS id m80si1745420pfj.129.2017.08.23.16.13.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 23 Aug 2017 16:13:29 -0700 (PDT) Received: by mail-pg0-x234.google.com with SMTP id 83so6806609pgb.3 for ; Wed, 23 Aug 2017 16:13:29 -0700 (PDT) Date: Wed, 23 Aug 2017 16:13:26 -0700 (PDT) From: David Rientjes Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170823174603.GA26190@castle.DHCP.thefacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170822170344.GA13547@cmpxchg.org> <20170823162031.GA13578@castle.dhcp.TheFacebook.com> <20170823172441.GA29085@cmpxchg.org> <20170823174603.GA26190@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, 23 Aug 2017, Roman Gushchin wrote: > > It's better to have newbies consult the documentation once than making > > everybody deal with long and cumbersome names for the rest of time. > > > > Like 'ls' being better than 'read_and_print_directory_contents'. > > I don't think it's a good argument here: realistically, nobody will type > the knob's name often. Your option is shorter only by 3 characters :) > > Anyway, I'm ok with memory.oom_group too, if everybody else prefer it. > Michal, David? > What's your opinion? > I'm probably the worst person in the world for succinctly naming stuff, but I at least think the knob should have the word "kill" in it to describe the behavior. ("oom_group", out of memory group, what exactly is that?) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org