From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id 3C1736B0279 for ; Wed, 21 Jun 2017 17:19:42 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id f49so19455153wrf.5 for ; Wed, 21 Jun 2017 14:19:42 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id 2si3692673wri.139.2017.06.21.14.19.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:19:40 -0700 (PDT) Received: from pps.filterd (m0001255.ppops.net [127.0.0.1]) by mx0b-00082601.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v5LLH2IB007077 for ; Wed, 21 Jun 2017 14:19:39 -0700 Received: from maileast.thefacebook.com ([199.201.65.23]) by mx0b-00082601.pphosted.com with ESMTP id 2b7xavrg4t-1 (version=TLSv1 cipher=ECDHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 21 Jun 2017 14:19:39 -0700 From: Roman Gushchin Subject: [v3 0/6] cgroup-aware OOM killer Date: Wed, 21 Jun 2017 22:19:10 +0100 Message-ID: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin This patchset makes the OOM killer cgroup-aware. Patch 1 causes out_of_memory() look at the oom_victim counter to decide if a new victim is required. Patch 2 is main patch which implements cgroup-aware OOM killer. Patch 3 adds some debug output, which can be refined later. Patch 4 introduces per-cgroup oom_score_adj knob. Patch 5 fixes a problem with too many processes receiving an access to the memory reserves. Patch 6 is docs update. v1: https://lkml.org/lkml/2017/5/18/969 v2: - Reworked victim selection based on feedback from Michal Hocko, Vladimir Davydov and Johannes Weiner - "Kill all tasks" is now an opt-in option, by default only one process will be killed - Added per-cgroup oom_score_adj - Refined oom score calculations, suggested by Vladimir Davydov - Converted to a patchset v3: - Fixed swap accounting - Switched to use oom_victims counter to prevent unnecessary kills - TIF_MEMDIE is set only when necessary - Moved all oom victim killing code into oom_kill.c - Merged commits 1-4 into 6 - Separated oom_score_adj logic into a separate commit 4 - Separated debug output into a separate commit 3 Roman Gushchin (6): mm, oom: use oom_victims counter to synchronize oom victim selection mm, oom: cgroup-aware OOM killer mm, oom: cgroup-aware OOM killer debug info mm, oom: introduce oom_score_adj for memory cgroups mm, oom: don't mark all oom victims tasks with TIF_MEMDIE mm,oom,docs: describe the cgroup-aware OOM killer Documentation/cgroup-v2.txt | 44 ++++++++++ include/linux/memcontrol.h | 23 +++++ include/linux/oom.h | 3 + kernel/exit.c | 2 +- mm/memcontrol.c | 209 ++++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 202 ++++++++++++++++++++++++++++-------------- 6 files changed, 416 insertions(+), 67 deletions(-) -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f198.google.com (mail-ot0-f198.google.com [74.125.82.198]) by kanga.kvack.org (Postfix) with ESMTP id 247AB6B0292 for ; Wed, 21 Jun 2017 17:19:52 -0400 (EDT) Received: by mail-ot0-f198.google.com with SMTP id v16so45717868ote.7 for ; Wed, 21 Jun 2017 14:19:52 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id i23si6148311otd.202.2017.06.21.14.19.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:19:51 -0700 (PDT) From: Roman Gushchin Subject: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Date: Wed, 21 Jun 2017 22:19:11 +0100 Message-ID: <1498079956-24467-2-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Oom killer should avoid unnecessary kills. To prevent them, during the tasks list traverse we check for task which was previously selected as oom victims. If there is such a task, new victim is not selected. This approach is sub-optimal (we're doing costly iteration over the task list every time) and will not work for the cgroup-aware oom killer. We already have oom_victims counter, which can be effectively used for the task. If there are victims in flight, don't do anything; if the counter falls to 0, there are no more oom victims left. So, it's a good time to start looking for a new victim. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/oom_kill.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0e2c925..e3aaf5c8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) if (oom_killer_disabled) return false; + /* + * If there are oom victims in flight, we don't need to select + * a new victim. + */ + if (atomic_read(&oom_victims) > 0) + return true; + if (!is_memcg_oom(oc)) { blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f69.google.com (mail-it0-f69.google.com [209.85.214.69]) by kanga.kvack.org (Postfix) with ESMTP id B80896B02B4 for ; Wed, 21 Jun 2017 17:19:53 -0400 (EDT) Received: by mail-it0-f69.google.com with SMTP id i7so12665881ita.10 for ; Wed, 21 Jun 2017 14:19:53 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id e96si2977418iod.0.2017.06.21.14.19.52 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:19:52 -0700 (PDT) From: Roman Gushchin Subject: [v3 2/6] mm, oom: cgroup-aware OOM killer Date: Wed, 21 Jun 2017 22:19:12 +0100 Message-ID: <1498079956-24467-3-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers. There are two main issues: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much more safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. 3) Per-process oom_score_adj affects global OOM, so it's a breache in the isolation. To address these issues, cgroup-aware OOM killer is introduced. Under OOM conditions, it tries to find the biggest memory consumer, and free memory by killing corresponding task(s). The difference the "traditional" OOM killer is that it can treat memory cgroups as memory consumers as well as single processes. By default, it will look for the biggest leaf cgroup, and kill the largest task inside. But a user can change this behavior by enabling the per-cgroup oom_kill_all_tasks option. If set, it causes the OOM killer treat the whole cgroup as an indivisible memory consumer. In case if it's selected as on OOM victim, all belonging tasks will be killed. Tasks in the root cgroup are treated as independent memory consumers, and are compared with other memory consumers (e.g. leaf cgroups). The root cgroup doesn't support the oom_kill_all_tasks feature. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 20 ++++++ include/linux/oom.h | 3 + mm/memcontrol.c | 155 ++++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 164 +++++++++++++++++++++++++++++---------------- 4 files changed, 285 insertions(+), 57 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3914e3d..c59926c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -35,6 +35,7 @@ struct mem_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_control; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -199,6 +200,9 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + /* kill all tasks in the subtree in case of OOM */ + bool oom_kill_all_tasks; + /* handle for "memory.events" */ struct cgroup_file events_file; @@ -342,6 +346,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + #define mem_cgroup_from_counter(counter, member) \ container_of(counter, struct mem_cgroup, member) @@ -480,6 +489,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p) bool mem_cgroup_oom_synchronize(bool wait); +bool mem_cgroup_select_oom_victim(struct oom_control *oc); + #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; #endif @@ -739,6 +750,10 @@ static inline bool task_in_mem_cgroup(struct task_struct *task, return true; } +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline struct mem_cgroup * mem_cgroup_iter(struct mem_cgroup *root, struct mem_cgroup *prev, @@ -926,6 +941,11 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { } + +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + return false; +} #endif /* CONFIG_MEMCG */ static inline void __inc_memcg_state(struct mem_cgroup *memcg, diff --git a/include/linux/oom.h b/include/linux/oom.h index 8a266e2..b7ec3bd 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -39,6 +39,7 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + struct mem_cgroup *chosen_memcg; }; extern struct mutex oom_lock; @@ -79,6 +80,8 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern int oom_evaluate_task(struct task_struct *task, void *arg); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 544d47e..bdb5103 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2625,6 +2625,128 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) return ret; } +static long mem_cgroup_oom_badness(struct mem_cgroup *memcg, + const nodemask_t *nodemask) +{ + long points = 0; + int nid; + struct mem_cgroup *iter; + + for_each_mem_cgroup_tree(iter, memcg) { + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; + + points += mem_cgroup_node_nr_lru_pages(iter, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + } + + points += memcg_page_state(iter, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(iter, NR_SLAB_UNRECLAIMABLE); + points += memcg_page_state(iter, MEMCG_SOCK); + points += memcg_page_state(iter, MEMCG_SWAP); + } + + return points; +} + +bool mem_cgroup_select_oom_victim(struct oom_control *oc) +{ + struct cgroup_subsys_state *css = NULL; + struct mem_cgroup *iter = NULL; + struct mem_cgroup *chosen_memcg = NULL; + struct mem_cgroup *parent = root_mem_cgroup; + unsigned long totalpages = oc->totalpages; + long chosen_memcg_points = 0; + long points = 0; + + oc->chosen = NULL; + oc->chosen_memcg = NULL; + + if (mem_cgroup_disabled()) + return false; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return false; + if (oc->memcg) { + chosen_memcg = oc->memcg; + parent = oc->memcg; + } + + rcu_read_lock(); + + for (;;) { + css = css_next_child(css, &parent->css); + if (css) { + iter = mem_cgroup_from_css(css); + + points = mem_cgroup_oom_badness(iter, oc->nodemask); + + if (points > chosen_memcg_points) { + chosen_memcg = iter; + chosen_memcg_points = points; + oc->chosen_points = points; + } + + continue; + } + + if (chosen_memcg && !chosen_memcg->oom_kill_all_tasks) { + /* Go deeper in the cgroup hierarchy */ + totalpages = chosen_memcg_points; + chosen_memcg_points = 0; + + parent = chosen_memcg; + chosen_memcg = NULL; + + continue; + } + + if (!chosen_memcg && parent != root_mem_cgroup) + chosen_memcg = parent; + + break; + } + + if (!oc->memcg) { + /* + * We should also consider tasks in the root cgroup + * with badness larger than oc->chosen_points + */ + + struct css_task_iter it; + struct task_struct *task; + int ret = 0; + + css_task_iter_start(&root_mem_cgroup->css, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret = oom_evaluate_task(task, oc); + css_task_iter_end(&it); + } + + if (!oc->chosen && chosen_memcg) { + if (chosen_memcg->oom_kill_all_tasks) { + css_get(&chosen_memcg->css); + oc->chosen_memcg = chosen_memcg; + } + + /* + * Even if we have to kill all tasks in the cgroup, + * we need to select the biggest task to start with. + * This task will receive an access to the memory + * reserves. + */ + oc->chosen_points = 0; + mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc); + } + + rcu_read_unlock(); + + oc->chosen_points = 0; + return !!oc->chosen || !!oc->chosen_memcg; +} + /* * Reclaims as many pages from the given memcg as possible. * @@ -5166,6 +5288,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_kill_all_tasks_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + bool oom_kill_all_tasks = memcg->oom_kill_all_tasks; + + seq_printf(m, "%d\n", oom_kill_all_tasks); + + return 0; +} + +static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_kill_all_tasks; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_kill_all_tasks); + if (err) + return err; + + memcg->oom_kill_all_tasks = !!oom_kill_all_tasks; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5286,6 +5435,12 @@ static struct cftype memory_files[] = { .write = memory_max_write, }, { + .name = "oom_kill_all_tasks", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_kill_all_tasks_show, + .write = memory_oom_kill_all_tasks_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e3aaf5c8..489ab69 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -288,7 +288,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } -static int oom_evaluate_task(struct task_struct *task, void *arg) +int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; unsigned long points; @@ -802,66 +802,14 @@ static bool task_will_free_mem(struct task_struct *task) return ret; } -static void oom_kill_process(struct oom_control *oc, const char *message) +static void __oom_kill_process(struct task_struct *victim) { - struct task_struct *p = oc->chosen; - unsigned int points = oc->chosen_points; - struct task_struct *victim = p; - struct task_struct *child; - struct task_struct *t; + struct task_struct *p; struct mm_struct *mm; - unsigned int victim_points = 0; - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); bool can_oom_reap = true; - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just set TIF_MEMDIE so it can die quickly - */ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); + if (is_global_init(victim) || (victim->flags & PF_KTHREAD)) return; - } - task_unlock(p); - - if (__ratelimit(&oom_rs)) - dump_header(oc, p); - - pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", - message, task_pid_nr(p), p->comm, points); - - /* - * If any of p's children has a different mm and is eligible for kill, - * the one with the highest oom_badness() score is sacrificed for its - * parent. This attempts to lose the minimal amount of work done while - * still freeing memory. - */ - read_lock(&tasklist_lock); - for_each_thread(p, t) { - list_for_each_entry(child, &t->children, sibling) { - unsigned int child_points; - - if (process_shares_mm(child, p->mm)) - continue; - /* - * oom_badness() returns 0 if the thread is unkillable - */ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); - if (child_points > victim_points) { - put_task_struct(victim); - victim = child; - victim_points = child_points; - get_task_struct(victim); - } - } - } - read_unlock(&tasklist_lock); p = find_lock_task_mm(victim); if (!p) { @@ -932,10 +880,107 @@ static void oom_kill_process(struct oom_control *oc, const char *message) wake_oom_reaper(victim); mmdrop(mm); - put_task_struct(victim); } #undef K +static void oom_kill_process(struct oom_control *oc, const char *message) +{ + struct task_struct *p = oc->chosen; + unsigned int points = oc->chosen_points; + struct task_struct *victim = p; + struct task_struct *child; + struct task_struct *t; + unsigned int victim_points = 0; + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + /* + * If the task is already exiting, don't alarm the sysadmin or kill + * its children or threads, just set TIF_MEMDIE so it can die quickly + */ + task_lock(p); + if (task_will_free_mem(p)) { + mark_oom_victim(p); + wake_oom_reaper(p); + task_unlock(p); + put_task_struct(p); + return; + } + task_unlock(p); + + if (__ratelimit(&oom_rs)) + dump_header(oc, p); + + pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); + + /* + * If any of p's children has a different mm and is eligible for kill, + * the one with the highest oom_badness() score is sacrificed for its + * parent. This attempts to lose the minimal amount of work done while + * still freeing memory. + */ + read_lock(&tasklist_lock); + for_each_thread(p, t) { + list_for_each_entry(child, &t->children, sibling) { + unsigned int child_points; + + if (process_shares_mm(child, p->mm)) + continue; + /* + * oom_badness() returns 0 if the thread is unkillable + */ + child_points = oom_badness(child, + oc->memcg, oc->nodemask, oc->totalpages); + if (child_points > victim_points) { + put_task_struct(victim); + victim = child; + victim_points = child_points; + get_task_struct(victim); + } + } + } + read_unlock(&tasklist_lock); + + __oom_kill_process(victim); + put_task_struct(victim); +} + +static int oom_kill_memcg_member(struct task_struct *task, void *unused) +{ + if (!tsk_is_oom_victim(task)) + __oom_kill_process(task); + return 0; +} + +static bool oom_kill_memcg_victim(struct oom_control *oc) +{ + bool ret = !!oc->chosen; + + if (oc->chosen && oc->chosen != (void *)-1UL) { + __oom_kill_process(oc->chosen); + put_task_struct(oc->chosen); + } + + if (oc->chosen_memcg) { + /* + * Kill all tasks in the cgroup hierarchy + */ + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member, + NULL); + mem_cgroup_put(oc->chosen_memcg); + oc->chosen_memcg = NULL; + } + + /* + * Reset points before falling back to the + * old per-process OOM victim selection logic + */ + oc->chosen_points = 0; + oc->chosen = NULL; + return ret; +} + /* * Determines whether the kernel must panic because of the panic_on_oom sysctl. */ @@ -1044,6 +1089,11 @@ bool out_of_memory(struct oom_control *oc) return true; } + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) { + schedule_timeout_killable(1); + return true; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f71.google.com (mail-it0-f71.google.com [209.85.214.71]) by kanga.kvack.org (Postfix) with ESMTP id 75F206B02C3 for ; Wed, 21 Jun 2017 17:19:58 -0400 (EDT) Received: by mail-it0-f71.google.com with SMTP id 7so12684964itg.12 for ; Wed, 21 Jun 2017 14:19:58 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id l62si2891586ioa.18.2017.06.21.14.19.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:19:57 -0700 (PDT) From: Roman Gushchin Subject: [v3 4/6] mm, oom: introduce oom_score_adj for memory cgroups Date: Wed, 21 Jun 2017 22:19:14 +0100 Message-ID: <1498079956-24467-5-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Introduce a per-memory-cgroup oom_score_adj setting. A read-write single value file which exits on non-root cgroups. The default is "0". It will have a similar meaning to a per-process value, available via /proc//oom_score_adj. Should be in a range [-1000, 1000]. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/memcontrol.h | 3 +++ mm/memcontrol.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c59926c..b84a050 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -203,6 +203,9 @@ struct mem_cgroup { /* kill all tasks in the subtree in case of OOM */ bool oom_kill_all_tasks; + /* OOM kill score adjustment */ + short oom_score_adj; + /* handle for "memory.events" */ struct cgroup_file events_file; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4face20..e474eba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5333,6 +5333,36 @@ static ssize_t memory_oom_kill_all_tasks_write(struct kernfs_open_file *of, return nbytes; } +static int memory_oom_score_adj_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + short oom_score_adj = memcg->oom_score_adj; + + seq_printf(m, "%d\n", oom_score_adj); + + return 0; +} + +static ssize_t memory_oom_score_adj_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int oom_score_adj; + int err; + + err = kstrtoint(strstrip(buf), 0, &oom_score_adj); + if (err) + return err; + + if (oom_score_adj < OOM_SCORE_ADJ_MIN || + oom_score_adj > OOM_SCORE_ADJ_MAX) + return -EINVAL; + + memcg->oom_score_adj = (short)oom_score_adj; + + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5459,6 +5489,12 @@ static struct cftype memory_files[] = { .write = memory_oom_kill_all_tasks_write, }, { + .name = "oom_score_adj", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_oom_score_adj_show, + .write = memory_oom_score_adj_write, + }, + { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, .file_offset = offsetof(struct mem_cgroup, events_file), -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 2CAEB6B02F3 for ; Wed, 21 Jun 2017 17:20:01 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id p4so169815527pfk.15 for ; Wed, 21 Jun 2017 14:20:01 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id m21si14869327pli.508.2017.06.21.14.20.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:20:00 -0700 (PDT) From: Roman Gushchin Subject: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Date: Wed, 21 Jun 2017 22:19:15 +0100 Message-ID: <1498079956-24467-6-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org We want to limit the number of tasks which are having an access to the memory reserves. To ensure the progress it's enough to have one such process at the time. If we need to kill the whole cgroup, let's give an access to the memory reserves only to the first process in the list, which is (usually) the biggest process. This will give us good chances that all other processes will be able to quit without an access to the memory reserves. Otherwise, to keep going forward, let's grant the access to the memory reserves for tasks, which can't be reaped by the oom_reaper. As it will be done from the oom reaper thread, which handles the oom reaper queue consequently, there is no high risk to have too many such processes at the same time. To implement this solution, we need to stop using TIF_MEMDIE flag as an universal marker for oom victims tasks. It's not a big issue, as we have oom_mm pointer/tsk_is_oom_victim(), which are just better. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tejun Heo Cc: Tetsuo Handa Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- kernel/exit.c | 2 +- mm/oom_kill.c | 31 ++++++++++++++++++++++--------- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index d211425..5b95d74 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -554,7 +554,7 @@ static void exit_mm(void) task_unlock(current); mm_update_next_owner(mm); mmput(mm); - if (test_thread_flag(TIF_MEMDIE)) + if (tsk_is_oom_victim(current)) exit_oom_victim(); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 489ab69..b55bd18 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && + !__oom_reap_task_mm(tsk, mm)) { + + /* + * If the task has no access to the memory reserves, + * grant it to help the task to exit. + */ + if (!test_tsk_thread_flag(tsk, TIF_MEMDIE)) + set_tsk_thread_flag(tsk, TIF_MEMDIE); + schedule_timeout_idle(HZ/10); + } if (attempts <= MAX_OOM_REAP_RETRIES) goto done; @@ -647,16 +657,13 @@ static inline void wake_oom_reaper(struct task_struct *tsk) */ static void mark_oom_victim(struct task_struct *tsk) { - struct mm_struct *mm = tsk->mm; - WARN_ON(oom_killer_disabled); - /* OOM killer might race with memcg OOM */ - if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE)) - return; /* oom_mm is bound to the signal struct life time. */ - if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) - mmgrab(tsk->signal->oom_mm); + if (cmpxchg(&tsk->signal->oom_mm, NULL, tsk->mm) != NULL) + return; + + mmgrab(tsk->signal->oom_mm); /* * Make sure that the task is woken up from uninterruptible sleep @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk) * that TIF_MEMDIE tasks should be ignored. */ __thaw_task(tsk); - atomic_inc(&oom_victims); + + /* + * If there are no oom victims in flight, + * give the task an access to the memory reserves. + */ + if (atomic_inc_return(&oom_victims) == 1) + set_tsk_thread_flag(tsk, TIF_MEMDIE); } /** -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id CCF436B02FA for ; Wed, 21 Jun 2017 17:20:01 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id v9so168080542pfk.5 for ; Wed, 21 Jun 2017 14:20:01 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id l11si14142342pgc.375.2017.06.21.14.20.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:20:01 -0700 (PDT) From: Roman Gushchin Subject: [v3 6/6] mm,oom,docs: describe the cgroup-aware OOM killer Date: Wed, 21 Jun 2017 22:19:16 +0100 Message-ID: <1498079956-24467-7-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Update cgroups v2 docs. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Tejun Heo Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index a86f3cb..7a1a1ac 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -44,6 +44,7 @@ CONTENTS 5-2-1. Memory Interface Files 5-2-2. Usage Guidelines 5-2-3. Memory Ownership + 5-2-4. Cgroup-aware OOM Killer 5-3. IO 5-3-1. IO Interface Files 5-3-2. Writeback @@ -799,6 +800,26 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom_kill_all_tasks + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + Defines whether the OOM killer should treat the cgroup + as a single entity during the victim selection. + + If set, it will cause the OOM killer to kill all belonging + tasks, both in case of a system-wide or cgroup-wide OOM. + + memory.oom_score_adj + + A read-write single value file which exits on non-root + cgroups. The default is "0". + + OOM killer score adjustment, which has as similar meaning + to a per-process value, available via /proc//oom_score_adj. + Should be in a range [-1000, 1000]. + memory.events A read-only flat-keyed file which exists on non-root cgroups. @@ -1028,6 +1049,29 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. +5-2-4. Cgroup-aware OOM Killer + +Cgroup v2 memory controller implements a cgroup-aware OOM killer. +It means that it treats memory cgroups as first class OOM entities. + +Under OOM conditions the memory controller tries to make the best +choise of a victim, hierarchically looking for the largest memory +consumer. By default, it will look for the biggest task in the +biggest leaf cgroup. + +But a user can change this behavior by enabling the per-cgroup +oom_kill_all_tasks option. If set, it causes the OOM killer treat +the whole cgroup as an indivisible memory consumer. In case if it's +selected as on OOM victim, all belonging tasks will be killed. + +Tasks in the root cgroup are treated as independent memory consumers, +and are compared with other memory consumers (e.g. leaf cgroups). +The root cgroup doesn't support the oom_kill_all_tasks feature. + +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM +the memory controller considers only cgroups belonging to the sub-tree +of the OOM'ing cgroup. + 5-3. IO The "io" controller regulates the distribution of IO resources. This -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 7F4076B02FD for ; Wed, 21 Jun 2017 17:20:14 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id s74so169795603pfe.10 for ; Wed, 21 Jun 2017 14:20:14 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id 7si15318238pll.337.2017.06.21.14.20.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Jun 2017 14:20:12 -0700 (PDT) From: Roman Gushchin Subject: [v3 3/6] mm, oom: cgroup-aware OOM killer debug info Date: Wed, 21 Jun 2017 22:19:13 +0100 Message-ID: <1498079956-24467-4-git-send-email-guro@fb.com> In-Reply-To: <1498079956-24467-1-git-send-email-guro@fb.com> References: <1498079956-24467-1-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Roman Gushchin , Tejun Heo , Johannes Weiner , Li Zefan , Michal Hocko , Vladimir Davydov , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Dump the cgroup oom badness score, as well as the name of chosen victim cgroup. Here how it looks like in dmesg: [ 18.824495] Choosing a victim memcg because of the system-wide OOM [ 18.826911] Cgroup /A1: 200805 [ 18.827996] Cgroup /A2: 273072 [ 18.828937] Cgroup /A2/B3: 51 [ 18.829795] Cgroup /A2/B4: 272969 [ 18.830800] Cgroup /A2/B5: 52 [ 18.831890] Chosen cgroup /A2/B4: 272969 Signed-off-by: Roman Gushchin Cc: Tejun Heo Cc: Johannes Weiner Cc: Li Zefan Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tetsuo Handa Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/memcontrol.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bdb5103..4face20 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2669,7 +2669,15 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; + + pr_info("Choosing a victim memcg because of the %s", + oc->memcg ? + "memory limit reached of cgroup " : + "system-wide OOM\n"); if (oc->memcg) { + pr_cont_cgroup_path(oc->memcg->css.cgroup); + pr_cont("\n"); + chosen_memcg = oc->memcg; parent = oc->memcg; } @@ -2683,6 +2691,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) points = mem_cgroup_oom_badness(iter, oc->nodemask); + pr_info("Cgroup "); + pr_cont_cgroup_path(iter->css.cgroup); + pr_cont(": %ld\n", points); + if (points > chosen_memcg_points) { chosen_memcg = iter; chosen_memcg_points = points; @@ -2731,6 +2743,10 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) oc->chosen_memcg = chosen_memcg; } + pr_info("Chosen cgroup "); + pr_cont_cgroup_path(chosen_memcg->css.cgroup); + pr_cont(": %ld\n", oc->chosen_points); + /* * Even if we have to kill all tasks in the cgroup, * we need to select the biggest task to start with. @@ -2739,7 +2755,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) */ oc->chosen_points = 0; mem_cgroup_scan_tasks(chosen_memcg, oom_evaluate_task, oc); - } + } else if (oc->chosen) + pr_info("Chosen task %s (%d) in root cgroup: %ld\n", + oc->chosen->comm, oc->chosen->pid, oc->chosen_points); rcu_read_unlock(); -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 6C8E86B0279 for ; Thu, 22 Jun 2017 12:59:27 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id z45so6172101wrb.13 for ; Thu, 22 Jun 2017 09:59:27 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id 69si2102520wra.135.2017.06.22.09.59.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 22 Jun 2017 09:59:26 -0700 (PDT) Date: Thu, 22 Jun 2017 17:58:58 +0100 From: Roman Gushchin Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Message-ID: <20170622165858.GA30035@castle> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-2-git-send-email-guro@fb.com> <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote: > Roman Gushchin wrote: > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) > > if (oom_killer_disabled) > > return false; > > > > + /* > > + * If there are oom victims in flight, we don't need to select > > + * a new victim. > > + */ > > + if (atomic_read(&oom_victims) > 0) > > + return true; > > + > > if (!is_memcg_oom(oc)) { > > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > > if (freed > 0) > > Above in this patch and below in patch 5 are wrong. > > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk) > > * that TIF_MEMDIE tasks should be ignored. > > */ > > __thaw_task(tsk); > > - atomic_inc(&oom_victims); > > + > > + /* > > + * If there are no oom victims in flight, > > + * give the task an access to the memory reserves. > > + */ > > + if (atomic_inc_return(&oom_victims) == 1) > > + set_tsk_thread_flag(tsk, TIF_MEMDIE); > > } > > > > /** > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based > giveup is not permitted, but a multithreaded process might be selected as > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's > mm increases possibility of preventing some OOM victim thread from terminating > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write). I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't be used directly. But can you, please, why do you find the first chunk wrong? Basically, it's exactly what we do now: we increment oom_victims for every oom victim, and every process decrements this counter during it's exit path. If there is at least one existing victim, we will select it again, so it's just an optimization. Am I missing something? Why should we start new victim selection if there processes that will likely quit and release memory soon? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 65E256B02C3 for ; Thu, 29 Jun 2017 04:54:01 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id b11so1071127wmh.0 for ; Thu, 29 Jun 2017 01:54:01 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s69si658040wma.14.2017.06.29.01.54.00 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 29 Jun 2017 01:54:00 -0700 (PDT) Date: Thu, 29 Jun 2017 10:53:57 +0200 From: Michal Hocko Subject: Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Message-ID: <20170629085357.GF31603@dhcp22.suse.cz> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-6-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1498079956-24467-6-git-send-email-guro@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed 21-06-17 22:19:15, Roman Gushchin wrote: > We want to limit the number of tasks which are having an access > to the memory reserves. To ensure the progress it's enough > to have one such process at the time. > > If we need to kill the whole cgroup, let's give an access to the > memory reserves only to the first process in the list, which is > (usually) the biggest process. > This will give us good chances that all other processes will be able > to quit without an access to the memory reserves. I don't like this to be honest. Is there any reason to go the reduced memory reserves access to oom victims I was suggesting earlier [1]? [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 7F0DA6B02C3 for ; Thu, 29 Jun 2017 05:04:34 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id z5so1111970wmz.4 for ; Thu, 29 Jun 2017 02:04:34 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id z128si645927wmg.186.2017.06.29.02.04.32 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 29 Jun 2017 02:04:32 -0700 (PDT) Date: Thu, 29 Jun 2017 11:04:31 +0200 From: Michal Hocko Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Message-ID: <20170629090431.GG31603@dhcp22.suse.cz> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-2-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1498079956-24467-2-git-send-email-guro@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed 21-06-17 22:19:11, Roman Gushchin wrote: > Oom killer should avoid unnecessary kills. To prevent them, during > the tasks list traverse we check for task which was previously > selected as oom victims. If there is such a task, new victim > is not selected. > > This approach is sub-optimal (we're doing costly iteration over the task > list every time) and will not work for the cgroup-aware oom killer. > > We already have oom_victims counter, which can be effectively used > for the task. A global counter will not work properly, I am afraid. a) you should consider the oom domain and do not block oom on unrelated domains and b) you have no guarantee that the oom victim will terminate reasonably. That is why we have MMF_OOM_SKIP check in oom_evaluate_task. I think you should have something similar for your memcg victim selection. If you see a memcg in the oom hierarchy with oom victims which are alive and not MMF_OOM_SKIP, you should abort the scanning. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id DF3416B0292 for ; Thu, 29 Jun 2017 14:45:34 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id 86so93638813pfq.11 for ; Thu, 29 Jun 2017 11:45:34 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id 1si4629210plk.129.2017.06.29.11.45.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Jun 2017 11:45:33 -0700 (PDT) Date: Thu, 29 Jun 2017 14:45:13 -0400 From: Roman Gushchin Subject: Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Message-ID: <20170629184513.GA27714@castle> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-6-git-send-email-guro@fb.com> <20170629085357.GF31603@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170629085357.GF31603@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote: > On Wed 21-06-17 22:19:15, Roman Gushchin wrote: > > We want to limit the number of tasks which are having an access > > to the memory reserves. To ensure the progress it's enough > > to have one such process at the time. > > > > If we need to kill the whole cgroup, let's give an access to the > > memory reserves only to the first process in the list, which is > > (usually) the biggest process. > > This will give us good chances that all other processes will be able > > to quit without an access to the memory reserves. > > I don't like this to be honest. Is there any reason to go the reduced > memory reserves access to oom victims I was suggesting earlier [1]? > > [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org I've nothing against your approach. What's the state of this patchset? Do you plan to bring it upstream? Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 20A786B0292 for ; Thu, 29 Jun 2017 14:48:07 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id z5so3611281wmz.4 for ; Thu, 29 Jun 2017 11:48:07 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id k186si9120555wmg.36.2017.06.29.11.48.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 29 Jun 2017 11:48:05 -0700 (PDT) Date: Thu, 29 Jun 2017 14:47:48 -0400 From: Roman Gushchin Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Message-ID: <20170629184748.GB27714@castle> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-2-git-send-email-guro@fb.com> <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> <20170622165858.GA30035@castle> <201706230537.IDB21366.SQHJVFOOFOMFLt@I-love.SAKURA.ne.jp> <201706230652.FDH69263.OtOLFSFMHFOQJV@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <201706230652.FDH69263.OtOLFSFMHFOQJV@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org List-ID: To: Tetsuo Handa Cc: linux-mm@kvack.org, mhocko@kernel.org, vdavydov.dev@gmail.com, hannes@cmpxchg.org, tj@kernel.org, kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Roman Gushchin wrote: > > > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote: > > > > Roman Gushchin wrote: > > > > > --- a/mm/oom_kill.c > > > > > +++ b/mm/oom_kill.c > > > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) > > > > > if (oom_killer_disabled) > > > > > return false; > > > > > > > > > > + /* > > > > > + * If there are oom victims in flight, we don't need to select > > > > > + * a new victim. > > > > > + */ > > > > > + if (atomic_read(&oom_victims) > 0) > > > > > + return true; > > > > > + > > > > > if (!is_memcg_oom(oc)) { > > > > > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > > > > > if (freed > 0) > > > > > > > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based > > > > giveup is not permitted, but a multithreaded process might be selected as > > > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's > > > > mm increases possibility of preventing some OOM victim thread from terminating > > > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for > > > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when > > > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write). > > > > > > I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't > > > be used directly. But can you, please, why do you find the first chunk wrong? > > > > Since you are checking oom_victims before checking task_will_free_mem(current), > > only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without > > the OOM reaper can get stuck forever. > > Oops, I misinterpreted. This is where a multithreaded OOM victim with or without > the OOM reaper can get stuck forever. Think about a process with two threads is > selected by the OOM killer and only one of these two threads can get TIF_MEMDIE. > > Thread-1 Thread-2 The OOM killer The OOM reaper > > Calls down_write(¤t->mm->mmap_sem). > Enters __alloc_pages_slowpath(). > Enters __alloc_pages_slowpath(). > Takes oom_lock. > Calls out_of_memory(). > Selects Thread-1 as an OOM victim. > Gets SIGKILL. Gets SIGKILL. > Gets TIF_MEMDIE. > Releases oom_lock. > Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE. > Takes oom_lock. > Will do nothing because down_read_trylock() fails. > Releases oom_lock. > Gives up and sets MMF_OOM_SKIP after one second. > Takes oom_lock. > Calls out_of_memory(). > Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1. > Releases oom_lock. > Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE. > Will not call up_write(¤t->mm->mmap_sem). > Reaches do_exit(). > Calls down_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2. > Will not call up_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). > Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit(). That's interesting... Does it mean, that we have to give an access to the reserves to all threads to guarantee the forward progress? What do you think about Michal's approach? He posted a link in the thread. Thank you! Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 1991C6B0292 for ; Thu, 29 Jun 2017 16:13:39 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id y62so96373823pfa.3 for ; Thu, 29 Jun 2017 13:13:39 -0700 (PDT) Received: from www262.sakura.ne.jp (www262.sakura.ne.jp. [2001:e42:101:1:202:181:97:72]) by mx.google.com with ESMTPS id q2si4666210plh.464.2017.06.29.13.13.37 for (version=TLS1 cipher=AES128-SHA bits=128/128); Thu, 29 Jun 2017 13:13:38 -0700 (PDT) Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection From: Tetsuo Handa References: <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> <20170622165858.GA30035@castle> <201706230537.IDB21366.SQHJVFOOFOMFLt@I-love.SAKURA.ne.jp> <201706230652.FDH69263.OtOLFSFMHFOQJV@I-love.SAKURA.ne.jp> <20170629184748.GB27714@castle> In-Reply-To: <20170629184748.GB27714@castle> Message-Id: <201706300513.BGC60962.LQFJOOtMOFVFSH@I-love.SAKURA.ne.jp> Date: Fri, 30 Jun 2017 05:13:13 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: guro@fb.com Cc: linux-mm@kvack.org, mhocko@kernel.org, vdavydov.dev@gmail.com, hannes@cmpxchg.org, tj@kernel.org, kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Roman Gushchin wrote: > On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote: > > Tetsuo Handa wrote: > > Oops, I misinterpreted. This is where a multithreaded OOM victim with or without > > the OOM reaper can get stuck forever. Think about a process with two threads is > > selected by the OOM killer and only one of these two threads can get TIF_MEMDIE. > > > > Thread-1 Thread-2 The OOM killer The OOM reaper > > > > Calls down_write(¤t->mm->mmap_sem). > > Enters __alloc_pages_slowpath(). > > Enters __alloc_pages_slowpath(). > > Takes oom_lock. > > Calls out_of_memory(). > > Selects Thread-1 as an OOM victim. > > Gets SIGKILL. Gets SIGKILL. > > Gets TIF_MEMDIE. > > Releases oom_lock. > > Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE. > > Takes oom_lock. > > Will do nothing because down_read_trylock() fails. > > Releases oom_lock. > > Gives up and sets MMF_OOM_SKIP after one second. > > Takes oom_lock. > > Calls out_of_memory(). > > Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1. > > Releases oom_lock. > > Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE. > > Will not call up_write(¤t->mm->mmap_sem). > > Reaches do_exit(). > > Calls down_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2. > > Will not call up_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). > > Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit(). > > That's interesting... Does it mean, that we have to give an access to the reserves > to all threads to guarantee the forward progress? Yes, for we don't have __GFP_KILLABLE flag. > > What do you think about Michal's approach? He posted a link in the thread. Please read that thread. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id A3F392802FE for ; Fri, 30 Jun 2017 04:25:55 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id u110so37662996wrb.14 for ; Fri, 30 Jun 2017 01:25:55 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 17si5787253wrt.177.2017.06.30.01.25.54 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 30 Jun 2017 01:25:54 -0700 (PDT) Date: Fri, 30 Jun 2017 10:25:51 +0200 From: Michal Hocko Subject: Re: [v3 5/6] mm, oom: don't mark all oom victims tasks with TIF_MEMDIE Message-ID: <20170630082551.GB22917@dhcp22.suse.cz> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-6-git-send-email-guro@fb.com> <20170629085357.GF31603@dhcp22.suse.cz> <20170629184513.GA27714@castle> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170629184513.GA27714@castle> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tejun Heo , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Thu 29-06-17 14:45:13, Roman Gushchin wrote: > On Thu, Jun 29, 2017 at 10:53:57AM +0200, Michal Hocko wrote: > > On Wed 21-06-17 22:19:15, Roman Gushchin wrote: > > > We want to limit the number of tasks which are having an access > > > to the memory reserves. To ensure the progress it's enough > > > to have one such process at the time. > > > > > > If we need to kill the whole cgroup, let's give an access to the > > > memory reserves only to the first process in the list, which is > > > (usually) the biggest process. > > > This will give us good chances that all other processes will be able > > > to quit without an access to the memory reserves. > > > > I don't like this to be honest. Is there any reason to go the reduced > > memory reserves access to oom victims I was suggesting earlier [1]? > > > > [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/1472723464-22866-2-git-send-email-mhocko@kernel.org > > I've nothing against your approach. What's the state of this patchset? > Do you plan to bring it upstream? Just the specific patch I have linked should be sufficient for what you need here. The patchset had some issues which I didn't have time to fix and as such the need for the above patch was not a high priority as well. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tetsuo Handa Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Date: Fri, 23 Jun 2017 05:37:50 +0900 Message-ID: <201706230537.IDB21366.SQHJVFOOFOMFLt@I-love.SAKURA.ne.jp> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-2-git-send-email-guro@fb.com> <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> <20170622165858.GA30035@castle> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: In-Reply-To: <20170622165858.GA30035@castle> Sender: linux-kernel-owner@vger.kernel.org To: guro@fb.com Cc: linux-mm@kvack.org, mhocko@kernel.org, vdavydov.dev@gmail.com, hannes@cmpxchg.org, tj@kernel.org, kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org List-Id: linux-mm.kvack.org Roman Gushchin wrote: > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote: > > Roman Gushchin wrote: > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) > > > if (oom_killer_disabled) > > > return false; > > > > > > + /* > > > + * If there are oom victims in flight, we don't need to select > > > + * a new victim. > > > + */ > > > + if (atomic_read(&oom_victims) > 0) > > > + return true; > > > + > > > if (!is_memcg_oom(oc)) { > > > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > > > if (freed > 0) > > > > Above in this patch and below in patch 5 are wrong. > > > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk) > > > * that TIF_MEMDIE tasks should be ignored. > > > */ > > > __thaw_task(tsk); > > > - atomic_inc(&oom_victims); > > > + > > > + /* > > > + * If there are no oom victims in flight, > > > + * give the task an access to the memory reserves. > > > + */ > > > + if (atomic_inc_return(&oom_victims) == 1) > > > + set_tsk_thread_flag(tsk, TIF_MEMDIE); > > > } > > > > > > /** > > > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based > > giveup is not permitted, but a multithreaded process might be selected as > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's > > mm increases possibility of preventing some OOM victim thread from terminating > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write). > > I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't > be used directly. But can you, please, why do you find the first chunk wrong? Since you are checking oom_victims before checking task_will_free_mem(current), only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without the OOM reaper can get stuck forever. > Basically, it's exactly what we do now: we increment oom_victims for every oom > victim, and every process decrements this counter during it's exit path. > If there is at least one existing victim, we will select it again, so it's just > an optimization. Am I missing something? Why should we start new victim selection > if there processes that will likely quit and release memory soon? If you check oom_victims like http://lkml.kernel.org/r/201706220053.v5M0rmOU078764@www262.sakura.ne.jp does, all threads in a multithreaded OOM victim will have a chance to get TIF_MEMDIE. By the way, below in patch 5 is also wrong. You are not permitted to set TIF_MEMDIE if oom_killer_disabled == true. --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && + !__oom_reap_task_mm(tsk, mm)) { + + /* + * If the task has no access to the memory reserves, + * grant it to help the task to exit. + */ + if (!test_tsk_thread_flag(tsk, TIF_MEMDIE)) + set_tsk_thread_flag(tsk, TIF_MEMDIE); + schedule_timeout_idle(HZ/10); + } if (attempts <= MAX_OOM_REAP_RETRIES) goto done; From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tetsuo Handa Subject: Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection Date: Fri, 23 Jun 2017 06:52:20 +0900 Message-ID: <201706230652.FDH69263.OtOLFSFMHFOQJV@I-love.SAKURA.ne.jp> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-2-git-send-email-guro@fb.com> <201706220040.v5M0eSnK074332@www262.sakura.ne.jp> <20170622165858.GA30035@castle> <201706230537.IDB21366.SQHJVFOOFOMFLt@I-love.SAKURA.ne.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: In-Reply-To: <201706230537.IDB21366.SQHJVFOOFOMFLt-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: guro-b10kYP2dOMg@public.gmane.org Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org List-Id: linux-mm.kvack.org Tetsuo Handa wrote: > Roman Gushchin wrote: > > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote: > > > Roman Gushchin wrote: > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc) > > > > if (oom_killer_disabled) > > > > return false; > > > > > > > > + /* > > > > + * If there are oom victims in flight, we don't need to select > > > > + * a new victim. > > > > + */ > > > > + if (atomic_read(&oom_victims) > 0) > > > > + return true; > > > > + > > > > if (!is_memcg_oom(oc)) { > > > > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > > > > if (freed > 0) > > > > > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based > > > giveup is not permitted, but a multithreaded process might be selected as > > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's > > > mm increases possibility of preventing some OOM victim thread from terminating > > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for > > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when > > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write). > > > > I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't > > be used directly. But can you, please, why do you find the first chunk wrong? > > Since you are checking oom_victims before checking task_will_free_mem(current), > only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim without > the OOM reaper can get stuck forever. Oops, I misinterpreted. This is where a multithreaded OOM victim with or without the OOM reaper can get stuck forever. Think about a process with two threads is selected by the OOM killer and only one of these two threads can get TIF_MEMDIE. Thread-1 Thread-2 The OOM killer The OOM reaper Calls down_write(¤t->mm->mmap_sem). Enters __alloc_pages_slowpath(). Enters __alloc_pages_slowpath(). Takes oom_lock. Calls out_of_memory(). Selects Thread-1 as an OOM victim. Gets SIGKILL. Gets SIGKILL. Gets TIF_MEMDIE. Releases oom_lock. Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE. Takes oom_lock. Will do nothing because down_read_trylock() fails. Releases oom_lock. Gives up and sets MMF_OOM_SKIP after one second. Takes oom_lock. Calls out_of_memory(). Will not check MMF_OOM_SKIP because Thread-1 still has TIF_MEMDIE. // <= get stuck waiting for Thread-1. Releases oom_lock. Will not leave __alloc_pages_slowpath() because Thread-2 does not have TIF_MEMDIE. Will not call up_write(¤t->mm->mmap_sem). Reaches do_exit(). Calls down_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). // <= get stuck waiting for Thread-2. Will not call up_read(¤t->mm->mmap_sem) in exit_mm() in do_exit(). Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit(). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 57CAC44084A for ; Mon, 10 Jul 2017 19:05:52 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id 125so131462595pgi.2 for ; Mon, 10 Jul 2017 16:05:52 -0700 (PDT) Received: from mail-pf0-x22b.google.com (mail-pf0-x22b.google.com. [2607:f8b0:400e:c00::22b]) by mx.google.com with ESMTPS id p91si9773181plb.278.2017.07.10.16.05.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Jul 2017 16:05:51 -0700 (PDT) Received: by mail-pf0-x22b.google.com with SMTP id c73so57061837pfk.2 for ; Mon, 10 Jul 2017 16:05:51 -0700 (PDT) Date: Mon, 10 Jul 2017 16:05:49 -0700 (PDT) From: David Rientjes Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer In-Reply-To: <1498079956-24467-3-git-send-email-guro@fb.com> Message-ID: References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-3-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, 21 Jun 2017, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers. There are two main issues: > > 1) There is no fairness between containers. A small container with > few large processes will be chosen over a large one with huge > number of small processes. > Yes, the original motivation was to limit killing to a single process, if possible. To do that, we kill the process with the largest rss to free the most memory and rely on the user to configure /proc/pid/oom_score_adj if something else should be prioritized. With containerization and overcommit of system memory, we concur that killing the single largest process isn't always preferable and neglects the priority of its memcg. Your motivation seems to be to provide fairness between one memcg with a large process and one memcg with a large number of small processes; I'm curious if you are concerned about the priority of a memcg hierarchy (how important that "job" is) or whether you are strictly concerned with "largeness" of memcgs relative to each other. > 2) Containers often do not expect that some random process inside > will be killed. In many cases much more safer behavior is to kill > all tasks in the container. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > We agree. > 3) Per-process oom_score_adj affects global OOM, so it's a breache > in the isolation. > This should only be a consequence of overcommiting memcgs at the top level so the system oom killer is actually ever invoked, otherwise per-process oom_score_adj works well for memcg oom killing. > To address these issues, cgroup-aware OOM killer is introduced. > > Under OOM conditions, it tries to find the biggest memory consumer, > and free memory by killing corresponding task(s). The difference > the "traditional" OOM killer is that it can treat memory cgroups > as memory consumers as well as single processes. > > By default, it will look for the biggest leaf cgroup, and kill > the largest task inside. > > But a user can change this behavior by enabling the per-cgroup > oom_kill_all_tasks option. If set, it causes the OOM killer treat > the whole cgroup as an indivisible memory consumer. In case if it's > selected as on OOM victim, all belonging tasks will be killed. > These are two different things, right? We can adjust how the system oom killer chooses victims when memcg hierarchies overcommit the system to not strictly prefer the single process with the largest rss without killing everything attached to the memcg. Separately: do you not intend to support memcg priorities at all, but rather strictly consider the "largeness" of a memcg versus other memcgs? In our methodology, each memcg is assigned a priority value and the iteration of the hierarchy simply compares and visits the memcg with the lowest priority at each level and then selects the largest process to kill. This could also support a "kill-all" knob. struct mem_cgroup *memcg = root_mem_cgroup; struct mem_cgroup *low_memcg; unsigned long low_priority; next: low_memcg = NULL; low_priority = ULONG_MAX; for_each_child_of_memcg(memcg) { unsigned long prio = memcg_oom_priority(memcg); if (prio < low_priority) { low_memcg = memcg; low_priority = prio; } } if (low_memcg) goto next; oom_kill_process_from_memcg(memcg); So this is a priority based model that is different than your aggregate usage model but I think it allows userspace to define a more powerful policy. We certainly may want to kill from a memcg with a single large process, or we may want to kill from a memcg with several small processes, it depends on the importance of that job. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 483806B04FC for ; Tue, 11 Jul 2017 08:52:00 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id 77so31560721wrb.11 for ; Tue, 11 Jul 2017 05:52:00 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id u10si8706460wmg.100.2017.07.11.05.51.58 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 05:51:58 -0700 (PDT) Date: Tue, 11 Jul 2017 13:51:24 +0100 From: Roman Gushchin Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer Message-ID: <20170711125124.GA12406@castle> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-3-git-send-email-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Jul 10, 2017 at 04:05:49PM -0700, David Rientjes wrote: > On Wed, 21 Jun 2017, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > Under oom conditions, it finds a process with the highest oom score > > and kills it. > > > > This behavior doesn't suit well the system with many running > > containers. There are two main issues: > > > > 1) There is no fairness between containers. A small container with > > few large processes will be chosen over a large one with huge > > number of small processes. > > > > Yes, the original motivation was to limit killing to a single process, if > possible. To do that, we kill the process with the largest rss to free > the most memory and rely on the user to configure /proc/pid/oom_score_adj > if something else should be prioritized. > > With containerization and overcommit of system memory, we concur that > killing the single largest process isn't always preferable and neglects > the priority of its memcg. Your motivation seems to be to provide > fairness between one memcg with a large process and one memcg with a large > number of small processes; I'm curious if you are concerned about the > priority of a memcg hierarchy (how important that "job" is) or whether you > are strictly concerned with "largeness" of memcgs relative to each other. I'm pretty sure we should provide some way to prioritize some cgroups over other (in terms of oom killer preferences), but I'm not 100% sure yet, what's the best way to do it. I've suggested something similar to the existing oom_score_adj for tasks, mostly to folow the existing design. One of the questions to answer in priority-based model is how to compare tasks in the root cgroup with cgroups? > > ... > > By default, it will look for the biggest leaf cgroup, and kill > > the largest task inside. > > > > But a user can change this behavior by enabling the per-cgroup > > oom_kill_all_tasks option. If set, it causes the OOM killer treat > > the whole cgroup as an indivisible memory consumer. In case if it's > > selected as on OOM victim, all belonging tasks will be killed. > > > > These are two different things, right? We can adjust how the system oom > killer chooses victims when memcg hierarchies overcommit the system to not > strictly prefer the single process with the largest rss without killing > everything attached to the memcg. They are different, and I thought about providing two independent knobs. But after all I haven't found enough real life examples, where it can be useful. Can you provide something here? Also, they are different only for non-leaf cgroups; leaf cgroups are always treated as indivisible memory consumers during victim selection. I assume, that containerized systems will always set oom_kill_all_tasks for top-level container memory cgroups. By default it's turned off to provide backward compatibility with current behavior and avoid excessive kills and support oom_score_adj==-1000 (I've added this to v4, will post soon). > Separately: do you not intend to support memcg priorities at all, but > rather strictly consider the "largeness" of a memcg versus other memcgs? Answered upper. > In our methodology, each memcg is assigned a priority value and the > iteration of the hierarchy simply compares and visits the memcg with the > lowest priority at each level and then selects the largest process to > kill. This could also support a "kill-all" knob. > > struct mem_cgroup *memcg = root_mem_cgroup; > struct mem_cgroup *low_memcg; > unsigned long low_priority; > > next: > low_memcg = NULL; > low_priority = ULONG_MAX; > for_each_child_of_memcg(memcg) { > unsigned long prio = memcg_oom_priority(memcg); > > if (prio < low_priority) { > low_memcg = memcg; > low_priority = prio; > } > } > if (low_memcg) > goto next; > oom_kill_process_from_memcg(memcg); > > So this is a priority based model that is different than your aggregate > usage model but I think it allows userspace to define a more powerful > policy. We certainly may want to kill from a memcg with a single large > process, or we may want to kill from a memcg with several small processes, > it depends on the importance of that job. I believe, that both models have some advantages. Priority-based model is more powerful, but requires support from the userspace to set up these priorities (and, probably, adjust them dynamically). Size-based model is limited, but provides reasonable behavior without any additional configuration. I will agree here with Michal Hocko, that bpf like mechanism can be used here to provide an option for writing some custom oom policies. After we will have necessary infrastructure to iterate over cgroup tree, select a cgroup with largest oom score, and kill a biggest task/all the tasks there, we can add an ability to customize the cgroup (and maybe tasks) evaluation. Thanks! Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f199.google.com (mail-pf0-f199.google.com [209.85.192.199]) by kanga.kvack.org (Postfix) with ESMTP id 23A596B04C6 for ; Tue, 11 Jul 2017 16:56:34 -0400 (EDT) Received: by mail-pf0-f199.google.com with SMTP id u17so3709483pfa.6 for ; Tue, 11 Jul 2017 13:56:34 -0700 (PDT) Received: from mail-pf0-x22a.google.com (mail-pf0-x22a.google.com. [2607:f8b0:400e:c00::22a]) by mx.google.com with ESMTPS id j10si271629pfc.13.2017.07.11.13.56.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Jul 2017 13:56:32 -0700 (PDT) Received: by mail-pf0-x22a.google.com with SMTP id q85so1860953pfq.1 for ; Tue, 11 Jul 2017 13:56:32 -0700 (PDT) Date: Tue, 11 Jul 2017 13:56:30 -0700 (PDT) From: David Rientjes Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170711125124.GA12406@castle> Message-ID: References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-3-git-send-email-guro@fb.com> <20170711125124.GA12406@castle> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, 11 Jul 2017, Roman Gushchin wrote: > > Yes, the original motivation was to limit killing to a single process, if > > possible. To do that, we kill the process with the largest rss to free > > the most memory and rely on the user to configure /proc/pid/oom_score_adj > > if something else should be prioritized. > > > > With containerization and overcommit of system memory, we concur that > > killing the single largest process isn't always preferable and neglects > > the priority of its memcg. Your motivation seems to be to provide > > fairness between one memcg with a large process and one memcg with a large > > number of small processes; I'm curious if you are concerned about the > > priority of a memcg hierarchy (how important that "job" is) or whether you > > are strictly concerned with "largeness" of memcgs relative to each other. > > I'm pretty sure we should provide some way to prioritize some cgroups > over other (in terms of oom killer preferences), but I'm not 100% sure yet, > what's the best way to do it. I've suggested something similar to the existing > oom_score_adj for tasks, mostly to folow the existing design. > > One of the questions to answer in priority-based model is > how to compare tasks in the root cgroup with cgroups? > We do this with an alternate scoring mechanism, that is purely priority based and tiebreaks based on largest rss. An additional tunable is added for each process, under /proc/pid, and also to the memcg hierarchy, and is enabled via a system-wide sysctl. I way to mesh the two scoring mechanisms together would be helpful, but for our purposes we don't use oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be oom disabled when written by third party apps. For memcg oom conditions, iteration of the hierarchy begins at the oom memcg. For system oom conditions, this is the root memcg. All processes attached to the oom memcg have their priority based value and this is compared to all child memcg's priority value at that level. If a process has the lowest priority, it is killed and we're done; we could implement a "kill all" mechanism for this memcg that is checked before the process is killed. If a memcg has the lowest priority compared to attached processes, it is iterated as well, and so on throughout the memcg hierarchy until we find the lowest priority process in the lowest priority leaf memcg. This way, we can fully control which process is killed for both system and memcg oom conditions. I can easily post patches for this, we have used it for years. > > These are two different things, right? We can adjust how the system oom > > killer chooses victims when memcg hierarchies overcommit the system to not > > strictly prefer the single process with the largest rss without killing > > everything attached to the memcg. > > They are different, and I thought about providing two independent knobs. > But after all I haven't found enough real life examples, where it can be useful. > Can you provide something here? > Yes, we have users who we chown their memcg hierarchy to and have full control over setting up their hierarchy however we want. Our "Activity Manager", using Documentation/cgroup-v1/memory.txt terminology, only is aware of the top level memcg that was chown'd to the user. That user runs a series of batch jobs that are submitted to it and each job is represented as a subcontainer to enforce strict limits on the amount of memory that job can use. When it becomes oom, we have found that it is preferable to oom kill the entire batch job rather than leave it in an inconsistent state, so enabling such a knob here would be helpful. Other top-level jobs are fine with individual processes being oom killed. It can be a low priority process for which they have full control over defining the priority through the new per-process and per-memcg value described above. Easy example is scraping logs periodically or other best-effort tasks like cleanup. They can happily be oom killed and rescheduled without taking down the entire first-class job. > Also, they are different only for non-leaf cgroups; leaf cgroups > are always treated as indivisible memory consumers during victim selection. > > I assume, that containerized systems will always set oom_kill_all_tasks for > top-level container memory cgroups. By default it's turned off > to provide backward compatibility with current behavior and avoid > excessive kills and support oom_score_adj==-1000 (I've added this to v4, > will post soon). > We certainly would not be enabling it for top-level memcgs, there would be no way that we could because we have best-effort processes, but we would like to enable it for small batch jobs that are run on behalf of a user in their own subcontainer. We have had this usecase for ~3 years and solely because of the problem that you pointed out earlier: it is often much more reliable for the kernel to do oom killing of multiple processes rather than userspace. > > In our methodology, each memcg is assigned a priority value and the > > iteration of the hierarchy simply compares and visits the memcg with the > > lowest priority at each level and then selects the largest process to > > kill. This could also support a "kill-all" knob. > > > > struct mem_cgroup *memcg = root_mem_cgroup; > > struct mem_cgroup *low_memcg; > > unsigned long low_priority; > > > > next: > > low_memcg = NULL; > > low_priority = ULONG_MAX; > > for_each_child_of_memcg(memcg) { > > unsigned long prio = memcg_oom_priority(memcg); > > > > if (prio < low_priority) { > > low_memcg = memcg; > > low_priority = prio; > > } > > } > > if (low_memcg) > > goto next; > > oom_kill_process_from_memcg(memcg); > > > > So this is a priority based model that is different than your aggregate > > usage model but I think it allows userspace to define a more powerful > > policy. We certainly may want to kill from a memcg with a single large > > process, or we may want to kill from a memcg with several small processes, > > it depends on the importance of that job. > > I believe, that both models have some advantages. > Priority-based model is more powerful, but requires support from the userspace > to set up these priorities (and, probably, adjust them dynamically). It's a no-op if nobody sets up priorities or the system-wide sysctl is disabled. Presumably, as in our model, the Activity Manager sets the sysctl and is responsible for configuring the priorities if present. All memcgs at the sibling level or subcontainer level remain the default if not defined by the chown'd user, so this falls back to an rss model for backwards compatibility. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id B0B456B050E for ; Wed, 12 Jul 2017 08:11:56 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id 40so937631wrw.10 for ; Wed, 12 Jul 2017 05:11:56 -0700 (PDT) Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com. [67.231.153.30]) by mx.google.com with ESMTPS id 90si1622464wrp.138.2017.07.12.05.11.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 05:11:55 -0700 (PDT) Date: Wed, 12 Jul 2017 13:11:10 +0100 From: Roman Gushchin Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer Message-ID: <20170712121110.GA9017@castle> References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-3-git-send-email-guro@fb.com> <20170711125124.GA12406@castle> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Jul 11, 2017 at 01:56:30PM -0700, David Rientjes wrote: > On Tue, 11 Jul 2017, Roman Gushchin wrote: > > > > Yes, the original motivation was to limit killing to a single process, if > > > possible. To do that, we kill the process with the largest rss to free > > > the most memory and rely on the user to configure /proc/pid/oom_score_adj > > > if something else should be prioritized. > > > > > > With containerization and overcommit of system memory, we concur that > > > killing the single largest process isn't always preferable and neglects > > > the priority of its memcg. Your motivation seems to be to provide > > > fairness between one memcg with a large process and one memcg with a large > > > number of small processes; I'm curious if you are concerned about the > > > priority of a memcg hierarchy (how important that "job" is) or whether you > > > are strictly concerned with "largeness" of memcgs relative to each other. > > > > I'm pretty sure we should provide some way to prioritize some cgroups > > over other (in terms of oom killer preferences), but I'm not 100% sure yet, > > what's the best way to do it. I've suggested something similar to the existing > > oom_score_adj for tasks, mostly to folow the existing design. > > > > One of the questions to answer in priority-based model is > > how to compare tasks in the root cgroup with cgroups? > > > > We do this with an alternate scoring mechanism, that is purely priority > based and tiebreaks based on largest rss. An additional tunable is added > for each process, under /proc/pid, and also to the memcg hierarchy, and is > enabled via a system-wide sysctl. I way to mesh the two scoring > mechanisms together would be helpful, but for our purposes we don't use > oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be > oom disabled when written by third party apps. > > For memcg oom conditions, iteration of the hierarchy begins at the oom > memcg. For system oom conditions, this is the root memcg. > > All processes attached to the oom memcg have their priority based value > and this is compared to all child memcg's priority value at that level. > If a process has the lowest priority, it is killed and we're done; we > could implement a "kill all" mechanism for this memcg that is checked > before the process is killed. > > If a memcg has the lowest priority compared to attached processes, it is > iterated as well, and so on throughout the memcg hierarchy until we find > the lowest priority process in the lowest priority leaf memcg. This way, > we can fully control which process is killed for both system and memcg oom > conditions. I can easily post patches for this, we have used it for > years. > > > > These are two different things, right? We can adjust how the system oom > > > killer chooses victims when memcg hierarchies overcommit the system to not > > > strictly prefer the single process with the largest rss without killing > > > everything attached to the memcg. > > > > They are different, and I thought about providing two independent knobs. > > But after all I haven't found enough real life examples, where it can be useful. > > Can you provide something here? > > > > Yes, we have users who we chown their memcg hierarchy to and have full > control over setting up their hierarchy however we want. Our "Activity > Manager", using Documentation/cgroup-v1/memory.txt terminology, only is > aware of the top level memcg that was chown'd to the user. That user runs > a series of batch jobs that are submitted to it and each job is > represented as a subcontainer to enforce strict limits on the amount of > memory that job can use. When it becomes oom, we have found that it is > preferable to oom kill the entire batch job rather than leave it in an > inconsistent state, so enabling such a knob here would be helpful. > > Other top-level jobs are fine with individual processes being oom killed. > It can be a low priority process for which they have full control over > defining the priority through the new per-process and per-memcg value > described above. Easy example is scraping logs periodically or other > best-effort tasks like cleanup. They can happily be oom killed and > rescheduled without taking down the entire first-class job. > > > Also, they are different only for non-leaf cgroups; leaf cgroups > > are always treated as indivisible memory consumers during victim selection. > > > > I assume, that containerized systems will always set oom_kill_all_tasks for > > top-level container memory cgroups. By default it's turned off > > to provide backward compatibility with current behavior and avoid > > excessive kills and support oom_score_adj==-1000 (I've added this to v4, > > will post soon). > > > > We certainly would not be enabling it for top-level memcgs, there would be > no way that we could because we have best-effort processes, but we would > like to enable it for small batch jobs that are run on behalf of a user in > their own subcontainer. We have had this usecase for ~3 years and solely > because of the problem that you pointed out earlier: it is often much more > reliable for the kernel to do oom killing of multiple processes rather > than userspace. > > > > In our methodology, each memcg is assigned a priority value and the > > > iteration of the hierarchy simply compares and visits the memcg with the > > > lowest priority at each level and then selects the largest process to > > > kill. This could also support a "kill-all" knob. > > > > > > struct mem_cgroup *memcg = root_mem_cgroup; > > > struct mem_cgroup *low_memcg; > > > unsigned long low_priority; > > > > > > next: > > > low_memcg = NULL; > > > low_priority = ULONG_MAX; > > > for_each_child_of_memcg(memcg) { > > > unsigned long prio = memcg_oom_priority(memcg); > > > > > > if (prio < low_priority) { > > > low_memcg = memcg; > > > low_priority = prio; > > > } > > > } > > > if (low_memcg) > > > goto next; > > > oom_kill_process_from_memcg(memcg); > > > > > > So this is a priority based model that is different than your aggregate > > > usage model but I think it allows userspace to define a more powerful > > > policy. We certainly may want to kill from a memcg with a single large > > > process, or we may want to kill from a memcg with several small processes, > > > it depends on the importance of that job. > > > > I believe, that both models have some advantages. > > Priority-based model is more powerful, but requires support from the userspace > > to set up these priorities (and, probably, adjust them dynamically). > > It's a no-op if nobody sets up priorities or the system-wide sysctl is > disabled. Presumably, as in our model, the Activity Manager sets the > sysctl and is responsible for configuring the priorities if present. All > memcgs at the sibling level or subcontainer level remain the default if > not defined by the chown'd user, so this falls back to an rss model for > backwards compatibility. Hm, this is interesting... What I'm thinking about, is that we can introduce the following model: each memory cgroup has an integer oom priority value, 0 be default. Root cgroup priority is always 0, other cgroups can have both positive or negative priorities. During OOM victim selection we compare cgroups on each hierarchy level based on priority and size, if there are several cgroups with equal priority. Per-task oom_score_adj will affect task selection inside a cgroup if oom_kill_all_tasks is not set. -1000 special value will also completely protect a task from being killed, if only oom_kill_all_tasks is not set. I wonder, if it will work for you? Thanks! Roman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 111F6440874 for ; Wed, 12 Jul 2017 16:26:24 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id p15so35824346pgs.7 for ; Wed, 12 Jul 2017 13:26:24 -0700 (PDT) Received: from mail-pf0-x230.google.com (mail-pf0-x230.google.com. [2607:f8b0:400e:c00::230]) by mx.google.com with ESMTPS id y75si2557506pfa.339.2017.07.12.13.26.22 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 12 Jul 2017 13:26:22 -0700 (PDT) Received: by mail-pf0-x230.google.com with SMTP id q86so18167778pfl.3 for ; Wed, 12 Jul 2017 13:26:22 -0700 (PDT) Date: Wed, 12 Jul 2017 13:26:20 -0700 (PDT) From: David Rientjes Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170712121110.GA9017@castle> Message-ID: References: <1498079956-24467-1-git-send-email-guro@fb.com> <1498079956-24467-3-git-send-email-guro@fb.com> <20170711125124.GA12406@castle> <20170712121110.GA9017@castle> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, 12 Jul 2017, Roman Gushchin wrote: > > It's a no-op if nobody sets up priorities or the system-wide sysctl is > > disabled. Presumably, as in our model, the Activity Manager sets the > > sysctl and is responsible for configuring the priorities if present. All > > memcgs at the sibling level or subcontainer level remain the default if > > not defined by the chown'd user, so this falls back to an rss model for > > backwards compatibility. > > Hm, this is interesting... > > What I'm thinking about, is that we can introduce the following model: > each memory cgroup has an integer oom priority value, 0 be default. > Root cgroup priority is always 0, other cgroups can have both positive > or negative priorities. > For our purposes we use a range of [0, 10000] for the per-process oom priority; 10000 implies the process is not oom killable, 5000 is the default. We use a range of [0, 9999] for the per-memcg oom priority since memcgs cannot disable themselves from oom killing (although they could oom disable all attached processes). We can obviously remap our priorities to whatever we decide here, but I think we should give ourselves more room and provide 10000 priorities at the minimum (we have 5000 true priorities plus overlimit bias). I'm not sure that negative priorities make sense in this model, is there a strong reason to prefer [-5000, 5000] over [0, 10000]? And, yes, the root memcg remains a constant oom priority and is never actually checked. > During OOM victim selection we compare cgroups on each hierarchy level > based on priority and size, if there are several cgroups with equal priority. > Per-task oom_score_adj will affect task selection inside a cgroup if > oom_kill_all_tasks is not set. -1000 special value will also completely > protect a task from being killed, if only oom_kill_all_tasks is not set. > If there are several cgroups of equal priority, we prefer the one that was created the most recently just to avoid losing work that has been done for a long period of time. But the key in this proposal is that we _always_ continue to iterate the memcg hierarchy until we find a process attached to a memcg with the lowest priority relative to sibling cgroups, if any. To adapt your model to this proposal, memory.oom_kill_all_tasks would only be effective if there are no descendant memcgs. In that case, iteration stops anyway and in my model we kill the process with the lowest per-process priority. This could trivially check memory.oom_kill_all_tasks and kill everything, and I'm happy to support that feature since we have had a need for it in the past as well. We should talk about when this priority-based scoring becomes effective. We enable it by default in our kernel, but it could be guarded with a VM sysctl if necessary to enact a system-wide policy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org