From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 52D256B0071 for ; Sun, 6 Jun 2010 18:34:05 -0400 (EDT) Received: from wpaz29.hot.corp.google.com (wpaz29.hot.corp.google.com [172.24.198.93]) by smtp-out.google.com with ESMTP id o56MY1RG008890 for ; Sun, 6 Jun 2010 15:34:02 -0700 Received: from pwj8 (pwj8.prod.google.com [10.241.219.72]) by wpaz29.hot.corp.google.com with ESMTP id o56MXxFY028846 for ; Sun, 6 Jun 2010 15:34:00 -0700 Received: by pwj8 with SMTP id 8so1813230pwj.26 for ; Sun, 06 Jun 2010 15:33:59 -0700 (PDT) Date: Sun, 6 Jun 2010 15:33:53 -0700 (PDT) From: David Rientjes Subject: [patch 00/18] oom killer rewrite Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: This is the latest update of the oom killer rewrite based on mmotm-2010-06-03-16-36, although it applies cleanly to 2.6.35-rc2 as well. There are two changes in this update, which I hope to now be considered for -mm inclusion and pushed for 2.6.36: - reordered the patches to more accurately seperate fixes from enhancements: the order is now very close to how KAMEZAWA Hiroyuki suggested (thanks!), and - the changelog for "oom: badness heuristic rewrite" was slightly expanded to mention how this rewrite improves the oom killer's behavior on the desktop. Many thanks to Nick Piggin for converting the remaining architectures that weren't using the oom killer to handle pagefault oom conditions to do so. His patches have hit mainline, so there is no longer an inconsistency in the semantics of panic_on_oom in such cases! Many thanks to KAMEZAWA Hiroyuki for his help and patience in working with me on this patchset. --- Documentation/feature-removal-schedule.txt | 25 + Documentation/filesystems/proc.txt | 100 ++-- Documentation/sysctl/vm.txt | 23 fs/proc/base.c | 107 ++++ include/linux/memcontrol.h | 8 include/linux/mempolicy.h | 13 include/linux/oom.h | 27 + include/linux/sched.h | 3 kernel/fork.c | 1 kernel/sysctl.c | 12 mm/memcontrol.c | 18 mm/mempolicy.c | 44 + mm/oom_kill.c | 675 ++++++++++++++++------------- mm/page_alloc.c | 29 - 14 files changed, 727 insertions(+), 358 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 8C5A96B01AC for ; Sun, 6 Jun 2010 18:34:10 -0400 (EDT) Received: from kpbe17.cbf.corp.google.com (kpbe17.cbf.corp.google.com [172.25.105.81]) by smtp-out.google.com with ESMTP id o56MY6Ma023916 for ; Sun, 6 Jun 2010 15:34:06 -0700 Received: from pvg11 (pvg11.prod.google.com [10.241.210.139]) by kpbe17.cbf.corp.google.com with ESMTP id o56MY39X015304 for ; Sun, 6 Jun 2010 15:34:03 -0700 Received: by pvg11 with SMTP id 11so823069pvg.36 for ; Sun, 06 Jun 2010 15:34:03 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:00 -0700 (PDT) From: David Rientjes Subject: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: From: Oleg Nesterov select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: Oleg Nesterov Signed-off-by: David Rientjes --- mm/oom_kill.c | 9 +++------ 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, for_each_process(p) { unsigned long points; - /* - * skip kernel threads and tasks which have already released - * their mm. - */ + /* skip tasks that have already released their mm */ if (!p->mm) continue; - /* skip the init task */ - if (is_global_init(p)) + /* skip the init task and kthreads */ + if (is_global_init(p) || (p->flags & PF_KTHREAD)) continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 08C316B01B0 for ; Sun, 6 Jun 2010 18:34:14 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id o56MYDwN022174 for ; Sun, 6 Jun 2010 15:34:13 -0700 Received: from pvh11 (pvh11.prod.google.com [10.241.210.203]) by kpbe16.cbf.corp.google.com with ESMTP id o56MYCMe011778 for ; Sun, 6 Jun 2010 15:34:12 -0700 Received: by pvh11 with SMTP id 11so1609710pvh.27 for ; Sun, 06 Jun 2010 15:34:12 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:03 -0700 (PDT) From: David Rientjes Subject: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: From: Oleg Nesterov Almost all ->mm == NUL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the "if (!p->mm)" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] Signed-off-by: Oleg Nesterov Signed-off-by: David Rientjes --- mm/oom_kill.c | 74 +++++++++++++++++++++++++++++++++------------------------ 1 files changed, 43 insertions(+), 31 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) return 0; } +static struct task_struct *find_lock_task_mm(struct task_struct *p) +{ + struct task_struct *t = p; + + do { + task_lock(t); + if (likely(t->mm)) + return t; + task_unlock(t); + } while_each_thread(p, t); + + return NULL; +} + /** * badness - calculate a numeric value for how bad this task has been * @p: task struct of which task we should calculate @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) unsigned long badness(struct task_struct *p, unsigned long uptime) { unsigned long points, cpu_time, run_time; - struct mm_struct *mm; struct task_struct *child; + struct task_struct *c, *t; int oom_adj = p->signal->oom_adj; struct task_cputime task_time; unsigned long utime; @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) if (oom_adj == OOM_DISABLE) return 0; - task_lock(p); - mm = p->mm; - if (!mm) { - task_unlock(p); + p = find_lock_task_mm(p); + if (!p) return 0; - } /* * The memory size of the process is the basis for the badness. */ - points = mm->total_vm; + points = p->mm->total_vm; /* * After this unlock we can no longer dereference local variable `mm' @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) * child is eating the vast majority of memory, adding only half * to the parents will make the child our kill candidate of choice. */ - list_for_each_entry(child, &p->children, sibling) { - task_lock(child); - if (child->mm != mm && child->mm) - points += child->mm->total_vm/2 + 1; - task_unlock(child); - } + t = p; + do { + list_for_each_entry(c, &t->children, sibling) { + child = find_lock_task_mm(c); + if (child) { + if (child->mm != p->mm) + points += child->mm->total_vm/2 + 1; + task_unlock(child); + } + } + } while_each_thread(p, t); /* * CPU time is in tens of seconds and run time is in thousands @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, for_each_process(p) { unsigned long points; - /* skip tasks that have already released their mm */ - if (!p->mm) - continue; /* skip the init task and kthreads */ if (is_global_init(p) || (p->flags & PF_KTHREAD)) continue; @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose) return; } - task_lock(p); - if (!p->mm) { - WARN_ON(1); - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n", - task_pid_nr(p), p->comm); - task_unlock(p); + p = find_lock_task_mm(p); + if (!p) return; - } if (verbose) printk(KERN_ERR "Killed process %d (%s) " @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, const char *message) { struct task_struct *c; + struct task_struct *t = p; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, message, task_pid_nr(p), p->comm, points); /* Try to kill a child first */ - list_for_each_entry(c, &p->children, sibling) { - if (c->mm == p->mm) - continue; - if (mem && !task_in_mem_cgroup(c, mem)) - continue; - if (!oom_kill_task(c)) - return 0; - } + do { + list_for_each_entry(c, &t->children, sibling) { + if (c->mm == p->mm) + continue; + if (mem && !task_in_mem_cgroup(c, mem)) + continue; + if (!oom_kill_task(c)) + return 0; + } + } while_each_thread(p, t); + return oom_kill_task(p); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 203926B01B4 for ; Sun, 6 Jun 2010 18:34:19 -0400 (EDT) Received: from hpaq6.eem.corp.google.com (hpaq6.eem.corp.google.com [172.25.149.6]) by smtp-out.google.com with ESMTP id o56MYG5Y022163 for ; Sun, 6 Jun 2010 15:34:17 -0700 Received: from pzk34 (pzk34.prod.google.com [10.243.19.162]) by hpaq6.eem.corp.google.com with ESMTP id o56MYEW7030966 for ; Sun, 6 Jun 2010 15:34:15 -0700 Received: by pzk34 with SMTP id 34so3205059pzk.26 for ; Sun, 06 Jun 2010 15:34:14 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:12 -0700 (PDT) From: David Rientjes Subject: [patch 03/18] oom: dump_tasks use find_lock_task_mm too In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: From: KOSAKI Motohiro dump_task() should use find_lock_task_mm() too. It is necessary for protecting task-exiting race. Signed-off-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/oom_kill.c | 39 +++++++++++++++++++++------------------ 1 files changed, 21 insertions(+), 18 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, */ static void dump_tasks(const struct mem_cgroup *mem) { - struct task_struct *g, *p; + struct task_struct *p; + struct task_struct *task; printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " "name\n"); - do_each_thread(g, p) { - struct mm_struct *mm; - - if (mem && !task_in_mem_cgroup(p, mem)) + for_each_process(p) { + /* + * We don't have is_global_init() check here, because the old + * code do that. printing init process is not big matter. But + * we don't hope to make unnecessary compatibility breaking. + */ + if (p->flags & PF_KTHREAD) continue; - if (!thread_group_leader(p)) + if (mem && !task_in_mem_cgroup(p, mem)) continue; - task_lock(p); - mm = p->mm; - if (!mm) { + task = find_lock_task_mm(p); + if (!task) { /* - * total_vm and rss sizes do not exist for tasks with no - * mm so there's no need to report them; they can't be - * oom killed anyway. + * Probably oom vs task-exiting race was happen and ->mm + * have been detached. thus there's no need to report + * them; they can't be oom killed anyway. */ - task_unlock(p); continue; } + printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", - p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm, - get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj, - p->comm); - task_unlock(p); - } while_each_thread(g, p); + task->pid, __task_cred(task)->uid, task->tgid, + task->mm->total_vm, get_mm_rss(task->mm), + (int)task_cpu(task), task->signal->oom_adj, p->comm); + task_unlock(task); + } } static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 10DC06B01B5 for ; Sun, 6 Jun 2010 18:34:21 -0400 (EDT) Received: from hpaq11.eem.corp.google.com (hpaq11.eem.corp.google.com [172.25.149.11]) by smtp-out.google.com with ESMTP id o56MYKwC022212 for ; Sun, 6 Jun 2010 15:34:20 -0700 Received: from pvh1 (pvh1.prod.google.com [10.241.210.193]) by hpaq11.eem.corp.google.com with ESMTP id o56MYIjG006547 for ; Sun, 6 Jun 2010 15:34:19 -0700 Received: by pvh1 with SMTP id 1so890241pvh.20 for ; Sun, 06 Jun 2010 15:34:18 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:15 -0700 (PDT) From: David Rientjes Subject: [patch 04/18] oom: PF_EXITING check should take mm into account In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: From: Oleg Nesterov select_bad_process() checks PF_EXITING to detect the task which is going to release its memory, but the logic is very wrong. - a single process P with the dead group leader disables select_bad_process() completely, it will always return ERR_PTR() while P can live forever - if the PF_EXITING task has already released its ->mm it doesn't make sense to expect it is goiing to free more memory (except task_struct/etc) Change the code to ignore the PF_EXITING tasks without ->mm. Signed-off-by: Oleg Nesterov Signed-off-by: David Rientjes --- mm/oom_kill.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, * the process of exiting and releasing its resources. * Otherwise we could get an easy OOM deadlock. */ - if (p->flags & PF_EXITING) { + if ((p->flags & PF_EXITING) && p->mm) { if (p != current) return ERR_PTR(-1UL); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 503FB6B01B5 for ; Sun, 6 Jun 2010 18:34:24 -0400 (EDT) Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o56MYMSk009229 for ; Sun, 6 Jun 2010 15:34:23 -0700 Received: from pwj6 (pwj6.prod.google.com [10.241.219.70]) by kpbe11.cbf.corp.google.com with ESMTP id o56MYLDx011696 for ; Sun, 6 Jun 2010 15:34:22 -0700 Received: by pwj6 with SMTP id 6so1769296pwj.24 for ; Sun, 06 Jun 2010 15:34:21 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:18 -0700 (PDT) From: David Rientjes Subject: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, /* Got some memory back in the last second. */ return; + /* + * If current has a pending SIGKILL, then automatically select it. The + * goal is to allow it to allocate so that it may quickly exit and free + * its memory. + */ + if (fatal_signal_pending(current)) { + set_thread_flag(TIF_MEMDIE); + return; + } + if (sysctl_panic_on_oom == 2) { dump_header(NULL, gfp_mask, order, NULL); panic("out of memory. Compulsory panic_on_oom is selected.\n"); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 7B1C26B01BA for ; Sun, 6 Jun 2010 18:34:30 -0400 (EDT) Received: from wpaz29.hot.corp.google.com (wpaz29.hot.corp.google.com [172.24.198.93]) by smtp-out.google.com with ESMTP id o56MYQUB022729 for ; Sun, 6 Jun 2010 15:34:26 -0700 Received: from pxi12 (pxi12.prod.google.com [10.243.27.12]) by wpaz29.hot.corp.google.com with ESMTP id o56MYOAB029229 for ; Sun, 6 Jun 2010 15:34:25 -0700 Received: by pxi12 with SMTP id 12so2928371pxi.0 for ; Sun, 06 Jun 2010 15:34:24 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:22 -0700 (PDT) From: David Rientjes Subject: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { - __oom_kill_task(p, 0); + set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id E44A86B01BE for ; Sun, 6 Jun 2010 18:34:30 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o56MYTmR015906 for ; Sun, 6 Jun 2010 15:34:29 -0700 Received: from pvh11 (pvh11.prod.google.com [10.241.210.203]) by kpbe15.cbf.corp.google.com with ESMTP id o56MYEAc011115 for ; Sun, 6 Jun 2010 15:34:28 -0700 Received: by pvh11 with SMTP id 11so1700083pvh.41 for ; Sun, 06 Jun 2010 15:34:27 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:25 -0700 (PDT) From: David Rientjes Subject: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Acked-by: Rik van Riel Acked-by: Nick Piggin Acked-by: Balbir Singh Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 10 ++-------- 1 files changed, 2 insertions(+), 8 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) points /= 4; /* - * If p's nodes don't overlap ours, it may still help to kill p - * because p may have allocated or otherwise mapped memory on - * this node before. However it will be less likely. - */ - if (!has_intersects_mems_allowed(p)) - points /= 8; - - /* * Adjust the score by oom_adj. */ if (oom_adj) { @@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; + if (!has_intersects_mems_allowed(p)) + continue; /* * This task already has access to memory reserves and is -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 162956B01BF for ; Sun, 6 Jun 2010 18:34:34 -0400 (EDT) Received: from kpbe18.cbf.corp.google.com (kpbe18.cbf.corp.google.com [172.25.105.82]) by smtp-out.google.com with ESMTP id o56MYWVN009327 for ; Sun, 6 Jun 2010 15:34:32 -0700 Received: from pzk42 (pzk42.prod.google.com [10.243.19.170]) by kpbe18.cbf.corp.google.com with ESMTP id o56MYVqH018758 for ; Sun, 6 Jun 2010 15:34:31 -0700 Received: by pzk42 with SMTP id 42so1147680pzk.4 for ; Sun, 06 Jun 2010 15:34:31 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:28 -0700 (PDT) From: David Rientjes Subject: [patch 08/18] oom: sacrifice child with highest badness score for parent In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Acked-by: Rik van Riel Acked-by: Nick Piggin Acked-by: Balbir Singh Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/oom_kill.c | 23 +++++++++++++++++------ 1 files changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned long points, struct mem_cgroup *mem, const char *message) { + struct task_struct *victim = p; struct task_struct *c; struct task_struct *t = p; + unsigned long victim_points = 0; + struct timespec uptime; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return 0; } - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", - message, task_pid_nr(p), p->comm, points); + pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); - /* Try to kill a child first */ + /* Try to sacrifice the worst child first */ + do_posix_clock_monotonic_gettime(&uptime); do { + unsigned long cpoints; + list_for_each_entry(c, &t->children, sibling) { if (c->mm == p->mm) continue; if (mem && !task_in_mem_cgroup(c, mem)) continue; - if (!oom_kill_task(c)) - return 0; + + /* badness() returns 0 if the thread is unkillable */ + cpoints = badness(c, uptime.tv_sec); + if (cpoints > victim_points) { + victim = c; + victim_points = cpoints; + } } } while_each_thread(p, t); - return oom_kill_task(p); + return oom_kill_task(victim); } #ifdef CONFIG_CGROUP_MEM_RES_CTLR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 7182E6B01C6 for ; Sun, 6 Jun 2010 18:34:40 -0400 (EDT) Received: from wpaz21.hot.corp.google.com (wpaz21.hot.corp.google.com [172.24.198.85]) by smtp-out.google.com with ESMTP id o56MYcCN022556 for ; Sun, 6 Jun 2010 15:34:38 -0700 Received: from pwj3 (pwj3.prod.google.com [10.241.219.67]) by wpaz21.hot.corp.google.com with ESMTP id o56MYbbw031368 for ; Sun, 6 Jun 2010 15:34:38 -0700 Received: by pwj3 with SMTP id 3so2483978pwj.18 for ; Sun, 06 Jun 2010 15:34:37 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:35 -0700 (PDT) From: David Rientjes Subject: [patch 10/18] oom: enable oom tasklist dump by default In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Acked-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- Documentation/sysctl/vm.txt | 2 +- mm/oom_kill.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -511,7 +511,7 @@ information may not be desired. If this is set to non-zero, this information is shown whenever the OOM killer actually kills a memory-hogging task. -The default value is 0. +The default value is 1 (enabled). ============================================================== diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ef048c1..833de48 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -32,7 +32,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; -int sysctl_oom_dump_tasks; +int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); /* #define DEBUG */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id D11C86B01C7 for ; Sun, 6 Jun 2010 18:34:40 -0400 (EDT) Received: from wpaz9.hot.corp.google.com (wpaz9.hot.corp.google.com [172.24.198.73]) by smtp-out.google.com with ESMTP id o56MYacO006844 for ; Sun, 6 Jun 2010 15:34:36 -0700 Received: from pxi15 (pxi15.prod.google.com [10.243.27.15]) by wpaz9.hot.corp.google.com with ESMTP id o56MYYD9007938 for ; Sun, 6 Jun 2010 15:34:35 -0700 Received: by pxi15 with SMTP id 15so958655pxi.30 for ; Sun, 06 Jun 2010 15:34:34 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:31 -0700 (PDT) From: David Rientjes Subject: [patch 09/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Reviewed-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- include/linux/mempolicy.h | 13 +++++++- mm/mempolicy.c | 44 ++++++++++++++++++++++++ mm/oom_kill.c | 80 +++++++++++++++++++++++++++----------------- 3 files changed, 105 insertions(+), 32 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol, nodemask_t **nodemask); extern bool init_nodemask_of_mempolicy(nodemask_t *mask); +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask); extern unsigned slab_node(struct mempolicy *policy); extern enum zone_type policy_zone; @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, return node_zonelist(0, gfp_flags); } -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) +{ + return false; +} + +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask) +{ + return false; +} static inline int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from_nodes, diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) } #endif +/* + * mempolicy_nodemask_intersects + * + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default + * policy. Otherwise, check for intersection between mask and the policy + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local' + * policy, always return true since it may allocate elsewhere on fallback. + * + * Takes task_lock(tsk) to prevent freeing of its mempolicy. + */ +bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask) +{ + struct mempolicy *mempolicy; + bool ret = true; + + if (!mask) + return ret; + task_lock(tsk); + mempolicy = tsk->mempolicy; + if (!mempolicy) + goto out; + + switch (mempolicy->mode) { + case MPOL_PREFERRED: + /* + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to + * allocate from, they may fallback to other nodes when oom. + * Thus, it's possible for tsk to have allocated memory from + * nodes in mask. + */ + break; + case MPOL_BIND: + case MPOL_INTERLEAVE: + ret = nodes_intersects(mempolicy->v.nodes, *mask); + break; + default: + BUG(); + } +out: + task_unlock(tsk); + return ret; +} + /* Allocate a page in interleaved policy. Own path because it needs to do special accounting. */ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -27,6 +27,7 @@ #include #include #include +#include #include int sysctl_panic_on_oom; @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock); /* #define DEBUG */ /* - * Is all threads of the target process nodes overlap ours? + * Do all threads of the target process overlap our allowed nodes? + * @tsk: task struct of which task to consider + * @mask: nodemask passed to page allocator for mempolicy ooms */ -static int has_intersects_mems_allowed(struct task_struct *tsk) +static bool has_intersects_mems_allowed(struct task_struct *tsk, + const nodemask_t *mask) { - struct task_struct *t; + struct task_struct *start = tsk; - t = tsk; do { - if (cpuset_mems_allowed_intersects(current, t)) - return 1; - t = next_thread(t); - } while (t != tsk); - - return 0; + if (mask) { + /* + * If this is a mempolicy constrained oom, tsk's + * cpuset is irrelevant. Only return true if its + * mempolicy intersects current, otherwise it may be + * needlessly killed. + */ + if (mempolicy_nodemask_intersects(tsk, mask)) + return true; + } else { + /* + * This is not a mempolicy constrained oom, so only + * check the mems of tsk's cpuset. + */ + if (cpuset_mems_allowed_intersects(current, tsk)) + return true; + } + tsk = next_thread(tsk); + } while (tsk != start); + return false; } static struct task_struct *find_lock_task_mm(struct task_struct *p) @@ -253,7 +270,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, * (not docbooked, we don't want this one cluttering up the manual) */ static struct task_struct *select_bad_process(unsigned long *ppoints, - struct mem_cgroup *mem) + struct mem_cgroup *mem, enum oom_constraint constraint, + const nodemask_t *mask) { struct task_struct *p; struct task_struct *chosen = NULL; @@ -269,7 +287,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; - if (!has_intersects_mems_allowed(p)) + if (!has_intersects_mems_allowed(p, + constraint == CONSTRAINT_MEMORY_POLICY ? mask : + NULL)) continue; /* @@ -495,7 +515,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) panic("out of memory(memcg). panic_on_oom is selected.\n"); read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem); + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; @@ -574,7 +594,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) /* * Must be called with tasklist_lock held for read. */ -static void __out_of_memory(gfp_t gfp_mask, int order) +static void __out_of_memory(gfp_t gfp_mask, int order, + enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; unsigned long points; @@ -588,7 +609,7 @@ retry: * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points, NULL); + p = select_bad_process(&points, NULL, constraint, mask); if (PTR_ERR(p) == -1UL) return; @@ -622,7 +643,8 @@ void pagefault_out_of_memory(void) panic("out of memory from page fault. panic_on_oom is selected.\n"); read_lock(&tasklist_lock); - __out_of_memory(0, 0); /* unknown gfp_mask and order */ + /* unknown gfp_mask and order */ + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); read_unlock(&tasklist_lock); /* @@ -638,6 +660,7 @@ void pagefault_out_of_memory(void) * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator * * If we run out of memory, we have the choice between either * killing a random task (bad), letting the system crash (worse) @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, */ constraint = constrained_alloc(zonelist, gfp_mask, nodemask); read_lock(&tasklist_lock); - - switch (constraint) { - case CONSTRAINT_MEMORY_POLICY: - oom_kill_process(current, gfp_mask, order, 0, NULL, - "No available memory (MPOL_BIND)"); - break; - - case CONSTRAINT_NONE: - if (sysctl_panic_on_oom) { + if (unlikely(sysctl_panic_on_oom)) { + /* + * panic_on_oom only affects CONSTRAINT_NONE, the kernel + * should not panic for cpuset or mempolicy induced memory + * failures. + */ + if (constraint == CONSTRAINT_NONE) { dump_header(NULL, gfp_mask, order, NULL); - panic("out of memory. panic_on_oom is selected\n"); + read_unlock(&tasklist_lock); + panic("Out of memory: panic_on_oom is enabled\n"); } - /* Fall-through */ - case CONSTRAINT_CPUSET: - __out_of_memory(gfp_mask, order); - break; } - + __out_of_memory(gfp_mask, order, constraint, nodemask); read_unlock(&tasklist_lock); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 8838B6B01C8 for ; Sun, 6 Jun 2010 18:34:44 -0400 (EDT) Received: from hpaq3.eem.corp.google.com (hpaq3.eem.corp.google.com [172.25.149.3]) by smtp-out.google.com with ESMTP id o56MYgqt009441 for ; Sun, 6 Jun 2010 15:34:42 -0700 Received: from pxi7 (pxi7.prod.google.com [10.243.27.7]) by hpaq3.eem.corp.google.com with ESMTP id o56MYYdu017757 for ; Sun, 6 Jun 2010 15:34:41 -0700 Received: by pxi7 with SMTP id 7so1265448pxi.41 for ; Sun, 06 Jun 2010 15:34:40 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:38 -0700 (PDT) From: David Rientjes Subject: [patch 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: If memory has been depleted in lowmem zones even with the protection afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that killing current users will help. The memory is either reclaimable (or migratable) already, in which case we should not invoke the oom killer at all, or it is pinned by an application for I/O. Killing such an application may leave the hardware in an unspecified state and there is no guarantee that it will be able to make a timely exit. Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is not used so that the task can perhaps recover or try again later. Previously, the heuristic provided some protection for those tasks with CAP_SYS_RAWIO, but this is no longer necessary since we will not be killing tasks for the purposes of ISA allocations. high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the default for all allocations that are not __GFP_DMA, __GFP_DMA32, __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those flags. Testing for high_zoneidx being less than ZONE_NORMAL will only return true for allocations that have either __GFP_DMA or __GFP_DMA32. Acked-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/page_alloc.c | 29 ++++++++++++++++++++--------- 1 files changed, 20 insertions(+), 9 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer will not help higher order allocs */ if (order > PAGE_ALLOC_COSTLY_ORDER) goto out; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (high_zoneidx < ZONE_NORMAL) + goto out; /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2052,15 +2055,23 @@ rebalance: if (page) goto got_pg; - /* - * The OOM killer does not trigger for high-order - * ~__GFP_NOFAIL allocations so if no progress is being - * made, there are no other options and retrying is - * unlikely to help. - */ - if (order > PAGE_ALLOC_COSTLY_ORDER && - !(gfp_mask & __GFP_NOFAIL)) - goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { + /* + * The oom killer is not called for high-order + * allocations that may fail, so if no progress + * is being made, there are no other options and + * retrying is unlikely to help. + */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto nopage; + /* + * The oom killer is not called for lowmem + * allocations to prevent needlessly killing + * innocent tasks. + */ + if (high_zoneidx < ZONE_NORMAL) + goto nopage; + } goto restart; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 6F7126B01CC for ; Sun, 6 Jun 2010 18:34:46 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o56MYjmr009476 for ; Sun, 6 Jun 2010 15:34:45 -0700 Received: from pxi8 (pxi8.prod.google.com [10.243.27.8]) by kpbe15.cbf.corp.google.com with ESMTP id o56MYho9011426 for ; Sun, 6 Jun 2010 15:34:44 -0700 Received: by pxi8 with SMTP id 8so2215587pxi.5 for ; Sun, 06 Jun 2010 15:34:43 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:41 -0700 (PDT) From: David Rientjes Subject: [patch 12/18] oom: extract panic helper function In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. Also fix a call to dump_header() where tasklist_lock is not read- locked, as required. There's no functional change with this patch. Acked-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- include/linux/oom.h | 1 + mm/oom_kill.c | 53 +++++++++++++++++++++++++++----------------------- 2 files changed, 30 insertions(+), 24 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -22,6 +22,7 @@ enum oom_constraint { CONSTRAINT_NONE, CONSTRAINT_CPUSET, CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, }; extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -505,17 +505,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return oom_kill_task(victim); } +/* + * Determines whether the kernel must panic because of the panic_on_oom sysctl. + */ +static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, + int order) +{ + if (likely(!sysctl_panic_on_oom)) + return; + if (sysctl_panic_on_oom != 2) { + /* + * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel + * does not panic for cpuset, mempolicy, or memcg allocation + * failures. + */ + if (constraint != CONSTRAINT_NONE) + return; + } + read_lock(&tasklist_lock); + dump_header(NULL, gfp_mask, order, NULL); + read_unlock(&tasklist_lock); + panic("Out of memory: %s panic_on_oom is enabled\n", + sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) { unsigned long points = 0; struct task_struct *p; - if (sysctl_panic_on_oom == 2) - panic("out of memory(memcg). panic_on_oom is selected.\n"); + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); + p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; @@ -616,8 +639,8 @@ retry: /* Found nothing?!?! Either we hang forever, or we panic. */ if (!p) { - read_unlock(&tasklist_lock); dump_header(NULL, gfp_mask, order, NULL); + read_unlock(&tasklist_lock); panic("Out of memory and no killable processes...\n"); } @@ -639,9 +662,7 @@ void pagefault_out_of_memory(void) /* Got some memory back in the last second. */ return; - if (sysctl_panic_on_oom) - panic("out of memory from page fault. panic_on_oom is selected.\n"); - + check_panic_on_oom(CONSTRAINT_NONE, 0, 0); read_lock(&tasklist_lock); /* unknown gfp_mask and order */ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); @@ -688,29 +709,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, return; } - if (sysctl_panic_on_oom == 2) { - dump_header(NULL, gfp_mask, order, NULL); - panic("out of memory. Compulsory panic_on_oom is selected.\n"); - } - /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); - if (unlikely(sysctl_panic_on_oom)) { - /* - * panic_on_oom only affects CONSTRAINT_NONE, the kernel - * should not panic for cpuset or mempolicy induced memory - * failures. - */ - if (constraint == CONSTRAINT_NONE) { - dump_header(NULL, gfp_mask, order, NULL); - read_unlock(&tasklist_lock); - panic("Out of memory: panic_on_oom is enabled\n"); - } - } __out_of_memory(gfp_mask, order, constraint, nodemask); read_unlock(&tasklist_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 361296B01CD for ; Sun, 6 Jun 2010 18:34:50 -0400 (EDT) Received: from kpbe20.cbf.corp.google.com (kpbe20.cbf.corp.google.com [172.25.105.84]) by smtp-out.google.com with ESMTP id o56MYmrj009516 for ; Sun, 6 Jun 2010 15:34:48 -0700 Received: from pwi7 (pwi7.prod.google.com [10.241.219.7]) by kpbe20.cbf.corp.google.com with ESMTP id o56MYlLx001625 for ; Sun, 6 Jun 2010 15:34:47 -0700 Received: by pwi7 with SMTP id 7so11767pwi.35 for ; Sun, 06 Jun 2010 15:34:47 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:44 -0700 (PDT) From: David Rientjes Subject: [patch 13/18] oom: remove special handling for pagefault ooms In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: It is possible to remove the special pagefault oom handler by simply oom locking all system zones and then calling directly into out_of_memory(). All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a parallel oom killing in progress that will lead to eventual memory freeing so it's not necessary to needlessly kill another task. The context in which the pagefault is allocating memory is unknown to the oom killer, so this is done on a system-wide level. If a task has already been oom killed and hasn't fully exited yet, this will be a no-op since select_bad_process() recognizes tasks across the system with TIF_MEMDIE set. Acked-by: Nick Piggin Acked-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/oom_kill.c | 86 +++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 57 insertions(+), 29 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -615,6 +615,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) } /* + * Try to acquire the oom killer lock for all system zones. Returns zero if a + * parallel oom killing is taking place, otherwise locks all zones and returns + * non-zero. + */ +static int try_set_system_oom(void) +{ + struct zone *zone; + int ret = 1; + + spin_lock(&zone_scan_lock); + for_each_populated_zone(zone) + if (zone_is_oom_locked(zone)) { + ret = 0; + goto out; + } + for_each_populated_zone(zone) + zone_set_flag(zone, ZONE_OOM_LOCKED); +out: + spin_unlock(&zone_scan_lock); + return ret; +} + +/* + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation + * attempts or page faults may now recall the oom killer, if necessary. + */ +static void clear_system_oom(void) +{ + struct zone *zone; + + spin_lock(&zone_scan_lock); + for_each_populated_zone(zone) + zone_clear_flag(zone, ZONE_OOM_LOCKED); + spin_unlock(&zone_scan_lock); +} + + +/* * Must be called with tasklist_lock held for read. */ static void __out_of_memory(gfp_t gfp_mask, int order, @@ -649,33 +687,6 @@ retry: goto retry; } -/* - * pagefault handler calls into here because it is out of memory but - * doesn't know exactly how or why. - */ -void pagefault_out_of_memory(void) -{ - unsigned long freed = 0; - - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); - if (freed > 0) - /* Got some memory back in the last second. */ - return; - - check_panic_on_oom(CONSTRAINT_NONE, 0, 0); - read_lock(&tasklist_lock); - /* unknown gfp_mask and order */ - __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); - read_unlock(&tasklist_lock); - - /* - * Give "p" a good chance of killing itself before we - * retry to allocate memory. - */ - if (!test_thread_flag(TIF_MEMDIE)) - schedule_timeout_uninterruptible(1); -} - /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -692,7 +703,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask) { unsigned long freed = 0; - enum oom_constraint constraint; + enum oom_constraint constraint = CONSTRAINT_NONE; blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) @@ -713,7 +724,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + if (zonelist) + constraint = constrained_alloc(zonelist, gfp_mask, nodemask); check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); __out_of_memory(gfp_mask, order, constraint, nodemask); @@ -726,3 +738,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, if (!test_thread_flag(TIF_MEMDIE)) schedule_timeout_uninterruptible(1); } + +/* + * The pagefault handler calls here because it is out of memory, so kill a + * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel + * oom killing is already in progress so do nothing. If a task is found with + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit. + */ +void pagefault_out_of_memory(void) +{ + if (try_set_system_oom()) { + out_of_memory(NULL, 0, 0, NULL); + clear_system_oom(); + } + if (!test_thread_flag(TIF_MEMDIE)) + schedule_timeout_uninterruptible(1); +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id C3F516B01D0 for ; Sun, 6 Jun 2010 18:34:52 -0400 (EDT) Received: from wpaz33.hot.corp.google.com (wpaz33.hot.corp.google.com [172.24.198.97]) by smtp-out.google.com with ESMTP id o56MYqOF023856 for ; Sun, 6 Jun 2010 15:34:52 -0700 Received: from pwj8 (pwj8.prod.google.com [10.241.219.72]) by wpaz33.hot.corp.google.com with ESMTP id o56MYo6t015668 for ; Sun, 6 Jun 2010 15:34:51 -0700 Received: by pwj8 with SMTP id 8so1402614pwj.12 for ; Sun, 06 Jun 2010 15:34:50 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:47 -0700 (PDT) From: David Rientjes Subject: [patch 14/18] oom: move sysctl declarations to oom.h In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: The three oom killer sysctl variables (sysctl_oom_dump_tasks, sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better declared in include/linux/oom.h rather than kernel/sysctl.c. Acked-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- include/linux/oom.h | 5 +++++ kernel/sysctl.c | 4 +--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -44,5 +44,10 @@ static inline void oom_killer_enable(void) { oom_killer_disabled = false; } + +/* sysctls */ +extern int sysctl_oom_dump_tasks; +extern int sysctl_oom_kill_allocating_task; +extern int sysctl_panic_on_oom; #endif /* __KERNEL__*/ #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -87,9 +88,6 @@ /* External variables not in a header file. */ extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; -extern int sysctl_panic_on_oom; -extern int sysctl_oom_kill_allocating_task; -extern int sysctl_oom_dump_tasks; extern int max_threads; extern int core_uses_pid; extern int suid_dumpable; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 54C256B01D4 for ; Sun, 6 Jun 2010 18:34:56 -0400 (EDT) Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o56MYsYY016341 for ; Sun, 6 Jun 2010 15:34:54 -0700 Received: from pvg4 (pvg4.prod.google.com [10.241.210.132]) by kpbe11.cbf.corp.google.com with ESMTP id o56MYrRv012038 for ; Sun, 6 Jun 2010 15:34:53 -0700 Received: by pvg4 with SMTP id 4so634133pvg.9 for ; Sun, 06 Jun 2010 15:34:53 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:51 -0700 (PDT) From: David Rientjes Subject: [patch 15/18] oom: remove unnecessary code and cleanup In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. __oom_kill_task() only has a single caller, so it can be merged into that function at the same time. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 56 ++++++++++---------------------------------------------- 1 files changed, 10 insertions(+), 46 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -401,61 +401,25 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, } #define K(x) ((x) << (PAGE_SHIFT-10)) - -/* - * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO - * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO - * set. - */ -static void __oom_kill_task(struct task_struct *p, int verbose) +static int oom_kill_task(struct task_struct *p) { - if (is_global_init(p)) { - WARN_ON(1); - printk(KERN_WARNING "tried to kill init!\n"); - return; - } - p = find_lock_task_mm(p); - if (!p) - return; - - if (verbose) - printk(KERN_ERR "Killed process %d (%s) " - "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n", - task_pid_nr(p), p->comm, - K(p->mm->total_vm), - K(get_mm_counter(p->mm, MM_ANONPAGES)), - K(get_mm_counter(p->mm, MM_FILEPAGES))); + if (!p || p->signal->oom_adj == OOM_DISABLE) { + task_unlock(p); + return 1; + } + pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", + task_pid_nr(p), p->comm, K(p->mm->total_vm), + K(get_mm_counter(p->mm, MM_ANONPAGES)), + K(get_mm_counter(p->mm, MM_FILEPAGES))); task_unlock(p); - /* - * We give our sacrificial lamb high priority and access to - * all the memory it needs. That way it should be able to - * exit() and clear out its resources quickly... - */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); - force_sig(SIGKILL, p); -} - -static int oom_kill_task(struct task_struct *p) -{ - /* WARNING: mm may not be dereferenced since we did not obtain its - * value from get_task_mm(p). This is OK since all we need to do is - * compare mm to q->mm below. - * - * Furthermore, even if mm contains a non-NULL value, p->mm may - * change to NULL at any time since we do not hold task_lock(p). - * However, this is of no concern to us. - */ - if (!p->mm || p->signal->oom_adj == OOM_DISABLE) - return 1; - - __oom_kill_task(p, 1); - return 0; } +#undef K static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned long points, struct mem_cgroup *mem, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 2F93E6B01D5 for ; Sun, 6 Jun 2010 18:35:04 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o56MYxsY024676 for ; Sun, 6 Jun 2010 15:34:59 -0700 Received: from pzk7 (pzk7.prod.google.com [10.243.19.135]) by kpbe15.cbf.corp.google.com with ESMTP id o56MYVnr011328 for ; Sun, 6 Jun 2010 15:34:57 -0700 Received: by pzk7 with SMTP id 7so1564211pzk.30 for ; Sun, 06 Jun 2010 15:34:57 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:54 -0700 (PDT) From: David Rientjes Subject: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes --- Documentation/filesystems/proc.txt | 94 ++++++++----- fs/proc/base.c | 99 ++++++++++++- include/linux/memcontrol.h | 8 + include/linux/oom.h | 14 ++- include/linux/sched.h | 3 +- kernel/fork.c | 1 + mm/memcontrol.c | 18 +++ mm/oom_kill.c | 279 ++++++++++++++++-------------------- 8 files changed, 316 insertions(+), 200 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -33,7 +33,8 @@ Table of Contents 2 Modifying System Parameters 3 Per-Process Parameters - 3.1 /proc//oom_adj - Adjust the oom-killer score + 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer + score 3.2 /proc//oom_score - Display current oom-killer score 3.3 /proc//io - Display the IO accounting fields 3.4 /proc//coredump_filter - Core dump filtering settings @@ -1234,42 +1235,61 @@ of the kernel. CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------ -3.1 /proc//oom_adj - Adjust the oom-killer score ------------------------------------------------------- - -This file can be used to adjust the score used to select which processes -should be killed in an out-of-memory situation. Giving it a high score will -increase the likelihood of this process being killed by the oom-killer. Valid -values are in the range -16 to +15, plus the special value -17, which disables -oom-killing altogether for this process. - -The process to be killed in an out-of-memory situation is selected among all others -based on its badness score. This value equals the original memory size of the process -and is then updated according to its CPU time (utime + stime) and the -run time (uptime - start time). The longer it runs the smaller is the score. -Badness score is divided by the square root of the CPU time and then by -the double square root of the run time. - -Swapped out tasks are killed first. Half of each child's memory size is added to -the parent's score if they do not share the same memory. Thus forking servers -are the prime candidates to be killed. Having only one 'hungry' child will make -parent less preferable than the child. - -/proc//oom_score shows process' current badness score. - -The following heuristics are then applied: - * if the task was reniced, its score doubles - * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE - or CAP_SYS_RAWIO) have their score divided by 4 - * if oom condition happened in one cpuset and checked process does not belong - to it, its score is divided by 8 - * the resulting score is multiplied by two to the power of oom_adj, i.e. - points <<= oom_adj when it is positive and - points >>= -(oom_adj) otherwise - -The task with the highest badness score is then selected and its children -are killed, process itself will be killed in an OOM situation when it does -not have children or some of them disabled oom like described above. +3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: root +processes are given 3% extra memory over other tasks. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc//oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc//oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc//oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc//oom_score_adj. + +Writing to /proc//oom_score_adj or /proc//oom_adj will change the +other with its scaled value. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with seperate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + 3.2 /proc//oom_score - Display current oom-killer score ------------------------------------------------------------- diff --git a/fs/proc/base.c b/fs/proc/base.c --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include #include @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = { #endif /* The badness from the OOM killer */ -unsigned long badness(struct task_struct *p, unsigned long uptime); static int proc_oom_score(struct task_struct *task, char *buffer) { unsigned long points = 0; - struct timespec uptime; - do_posix_clock_monotonic_gettime(&uptime); read_lock(&tasklist_lock); if (pid_alive(task)) - points = badness(task, uptime.tv_sec); + points = oom_badness(task->group_leader, + global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages); read_unlock(&tasklist_lock); return sprintf(buffer, "%lu\n", points); } @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, } task->signal->oom_adj = oom_adjust; - + /* + * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum + * value is always attainable. + */ + if (task->signal->oom_adj == OOM_ADJUST_MAX) + task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; + else + task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / + -OOM_DISABLE; unlock_task_sighand(task, &flags); put_task_struct(task); @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = { .llseek = generic_file_llseek, }; +static ssize_t oom_score_adj_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); + char buffer[PROC_NUMBUF]; + int oom_score_adj = OOM_SCORE_ADJ_MIN; + unsigned long flags; + size_t len; + + if (!task) + return -ESRCH; + if (lock_task_sighand(task, &flags)) { + oom_score_adj = task->signal->oom_score_adj; + unlock_task_sighand(task, &flags); + } + put_task_struct(task); + len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj); + return simple_read_from_buffer(buf, count, ppos, buffer, len); +} + +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + char buffer[PROC_NUMBUF]; + unsigned long flags; + long oom_score_adj; + int err; + + memset(buffer, 0, sizeof(buffer)); + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + + err = strict_strtol(strstrip(buffer), 0, &oom_score_adj); + if (err) + return -EINVAL; + if (oom_score_adj < OOM_SCORE_ADJ_MIN || + oom_score_adj > OOM_SCORE_ADJ_MAX) + return -EINVAL; + + task = get_proc_task(file->f_path.dentry->d_inode); + if (!task) + return -ESRCH; + if (!lock_task_sighand(task, &flags)) { + put_task_struct(task); + return -ESRCH; + } + if (oom_score_adj < task->signal->oom_score_adj && + !capable(CAP_SYS_RESOURCE)) { + unlock_task_sighand(task, &flags); + put_task_struct(task); + return -EACCES; + } + + task->signal->oom_score_adj = oom_score_adj; + /* + * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is + * always attainable. + */ + if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + task->signal->oom_adj = OOM_DISABLE; + else + task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / + OOM_SCORE_ADJ_MAX; + unlock_task_sighand(task, &flags); + put_task_struct(task); + return count; +} + +static const struct file_operations proc_oom_score_adj_operations = { + .read = oom_score_adj_read, + .write = oom_score_adj_write, +}; + #ifdef CONFIG_AUDITSYSCALL #define TMPBUFLEN 21 static ssize_t proc_loginuid_read(struct file * file, char __user * buf, @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = { #endif INF("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUGO, proc_sessionid_operations), @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif INF("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUSR, proc_sessionid_operations), diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); +u64 mem_cgroup_get_limit(struct mem_cgroup *mem); + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct mem_cgroup; @@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline +u64 mem_cgroup_get_limit(struct mem_cgroup *mem) +{ + return 0; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -1,14 +1,24 @@ #ifndef __INCLUDE_LINUX_OOM_H #define __INCLUDE_LINUX_OOM_H -/* /proc//oom_adj set to -17 protects from the oom-killer */ +/* + * /proc//oom_adj set to -17 protects from the oom-killer + */ #define OOM_DISABLE (-17) /* inclusive */ #define OOM_ADJUST_MIN (-16) #define OOM_ADJUST_MAX 15 +/* + * /proc//oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for + * pid. + */ +#define OOM_SCORE_ADJ_MIN (-1000) +#define OOM_SCORE_ADJ_MAX 1000 + #ifdef __KERNEL__ +#include #include #include @@ -25,6 +35,8 @@ enum oom_constraint { CONSTRAINT_MEMCG, }; +extern unsigned int oom_badness(struct task_struct *p, + unsigned long totalpages); extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -629,7 +629,8 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif - int oom_adj; /* OOM kill score adjustment (bit shift) */ + int oom_adj; /* OOM kill score adjustment (bit shift) */ + int oom_score_adj; /* OOM kill score adjustment */ }; /* Context switch must be unlocked if interrupts are to be enabled */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sig->oom_adj = current->signal->oom_adj; + sig->oom_score_adj = current->signal->oom_score_adj; return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem) } /* + * Return the memory (and swap, if configured) limit for a memcg. + */ +u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) +{ + u64 limit; + u64 memsw; + + limit = res_counter_read_u64(&memcg->res, RES_LIMIT) + + total_swap_pages; + memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT); + /* + * If memsw is finite and limits the amount of swap space available + * to this memcg, return that limit. + */ + return min(limit, memsw); +} + +/* * Visit the first child (need not be the first child as per the ordering * of the cgroup list, since we track last_scanned_child) of @mem and use * that to reclaim free pages from. diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -4,6 +4,8 @@ * Copyright (C) 1998,2000 Rik van Riel * Thanks go out to Claus Fischer for some serious inspiration and * for goading me into coding this file... + * Copyright (C) 2010 Google, Inc. + * Rewritten by David Rientjes * * The routines in this file are used to kill a process when * we're seriously out of memory. This gets called from __alloc_pages() @@ -34,7 +36,6 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); -/* #define DEBUG */ /* * Do all threads of the target process overlap our allowed nodes? @@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p) } /** - * badness - calculate a numeric value for how bad this task has been + * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate - * @uptime: current uptime in seconds + * @totalpages: total present RAM allowed for page allocation * - * The formula used is relatively simple and documented inline in the - * function. The main rationale is that we want to select a good task - * to kill when we run out of memory. - * - * Good in this context means that: - * 1) we lose the minimum amount of work done - * 2) we recover a large amount of memory - * 3) we don't kill anything innocent of eating tons of memory - * 4) we want to kill the minimum amount of processes (one) - * 5) we try to kill the process the user expects us to kill, this - * algorithm has been meticulously tuned to meet the principle - * of least surprise ... (be careful when you change it) + * The heuristic for determining which task to kill is made to be as simple and + * predictable as possible. The goal is to return the highest value for the + * task consuming the most memory to avoid subsequent oom failures. */ - -unsigned long badness(struct task_struct *p, unsigned long uptime) +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) { - unsigned long points, cpu_time, run_time; - struct task_struct *child; - struct task_struct *c, *t; - int oom_adj = p->signal->oom_adj; - struct task_cputime task_time; - unsigned long utime; - unsigned long stime; - - if (oom_adj == OOM_DISABLE) - return 0; + int points; p = find_lock_task_mm(p); if (!p) return 0; /* - * The memory size of the process is the basis for the badness. - */ - points = p->mm->total_vm; - - /* - * After this unlock we can no longer dereference local variable `mm' - */ - task_unlock(p); - - /* - * swapoff can easily use up all memory, so kill those first. + * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't + * need to be executed for something that cannot be killed. */ - if (p->flags & PF_OOM_ORIGIN) - return ULONG_MAX; - - /* - * Processes which fork a lot of child processes are likely - * a good choice. We add half the vmsize of the children if they - * have an own mm. This prevents forking servers to flood the - * machine with an endless amount of children. In case a single - * child is eating the vast majority of memory, adding only half - * to the parents will make the child our kill candidate of choice. - */ - t = p; - do { - list_for_each_entry(c, &t->children, sibling) { - child = find_lock_task_mm(c); - if (child) { - if (child->mm != p->mm) - points += child->mm->total_vm/2 + 1; - task_unlock(child); - } - } - } while_each_thread(p, t); + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { + task_unlock(p); + return 0; + } /* - * CPU time is in tens of seconds and run time is in thousands - * of seconds. There is no particular reason for this other than - * that it turned out to work very well in practice. + * When the PF_OOM_ORIGIN bit is set, it indicates the task should have + * priority for oom killing. */ - thread_group_cputime(p, &task_time); - utime = cputime_to_jiffies(task_time.utime); - stime = cputime_to_jiffies(task_time.stime); - cpu_time = (utime + stime) >> (SHIFT_HZ + 3); - - - if (uptime >= p->start_time.tv_sec) - run_time = (uptime - p->start_time.tv_sec) >> 10; - else - run_time = 0; - - if (cpu_time) - points /= int_sqrt(cpu_time); - if (run_time) - points /= int_sqrt(int_sqrt(run_time)); + if (p->flags & PF_OOM_ORIGIN) { + task_unlock(p); + return 1000; + } /* - * Niced processes are most likely less important, so double - * their badness points. + * The memory controller may have a limit of 0 bytes, so avoid a divide + * by zero if necessary. */ - if (task_nice(p) > 0) - points *= 2; + if (!totalpages) + totalpages = 1; /* - * Superuser processes are usually more important, so we make it - * less likely that we kill those. + * The baseline for the badness score is the proportion of RAM that each + * task's rss and swap space use. */ - if (has_capability_noaudit(p, CAP_SYS_ADMIN) || - has_capability_noaudit(p, CAP_SYS_RESOURCE)) - points /= 4; + points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 / + totalpages; + task_unlock(p); /* - * We don't want to kill a process with direct hardware access. - * Not only could that mess up the hardware, but usually users - * tend to only have this flag set on applications they think - * of as important. + * Root processes get 3% bonus, just like the __vm_enough_memory() + * implementation used by LSMs. */ - if (has_capability_noaudit(p, CAP_SYS_RAWIO)) - points /= 4; + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) + points -= 30; /* - * Adjust the score by oom_adj. + * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may + * either completely disable oom killing or always prefer a certain + * task. */ - if (oom_adj) { - if (oom_adj > 0) { - if (!points) - points = 1; - points <<= oom_adj; - } else - points >>= -(oom_adj); - } + points += p->signal->oom_score_adj; -#ifdef DEBUG - printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n", - p->pid, p->comm, points); -#endif - return points; + if (points < 0) + return 0; + return (points < 1000) ? points : 1000; } /* @@ -224,12 +158,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) */ #ifdef CONFIG_NUMA static enum oom_constraint constrained_alloc(struct zonelist *zonelist, - gfp_t gfp_mask, nodemask_t *nodemask) + gfp_t gfp_mask, nodemask_t *nodemask, + unsigned long *totalpages) { struct zone *zone; struct zoneref *z; enum zone_type high_zoneidx = gfp_zone(gfp_mask); + bool cpuset_limited = false; + int nid; + /* Default to all anonymous memory, page cache, and swap */ + *totalpages = global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages; + + if (!zonelist) + return CONSTRAINT_NONE; /* * Reach here only when __GFP_NOFAIL is used. So, we should avoid * to kill current.We have to random task kill in this case. @@ -239,26 +185,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, return CONSTRAINT_NONE; /* - * The nodemask here is a nodemask passed to alloc_pages(). Now, - * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy - * feature. mempolicy is an only user of nodemask here. - * check mempolicy's nodemask contains all N_HIGH_MEMORY + * This is not a __GFP_THISNODE allocation, so a truncated nodemask in + * the page allocator means a mempolicy is in effect. Cpuset policy + * is enforced in get_page_from_freelist(). */ - if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) + if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) { + *totalpages = total_swap_pages; + for_each_node_mask(nid, *nodemask) + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + + node_page_state(nid, NR_ACTIVE_ANON) + + node_page_state(nid, NR_INACTIVE_FILE) + + node_page_state(nid, NR_ACTIVE_FILE); return CONSTRAINT_MEMORY_POLICY; + } /* Check this allocation failure is caused by cpuset's wall function */ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, nodemask) if (!cpuset_zone_allowed_softwall(zone, gfp_mask)) - return CONSTRAINT_CPUSET; - + cpuset_limited = true; + + if (cpuset_limited) { + *totalpages = total_swap_pages; + for_each_node_mask(nid, cpuset_current_mems_allowed) + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + + node_page_state(nid, NR_ACTIVE_ANON) + + node_page_state(nid, NR_INACTIVE_FILE) + + node_page_state(nid, NR_ACTIVE_FILE); + return CONSTRAINT_CPUSET; + } return CONSTRAINT_NONE; } #else static enum oom_constraint constrained_alloc(struct zonelist *zonelist, - gfp_t gfp_mask, nodemask_t *nodemask) + gfp_t gfp_mask, nodemask_t *nodemask, + unsigned long *totalpages) { + *totalpages = global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages; return CONSTRAINT_NONE; } #endif @@ -269,18 +236,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, * * (not docbooked, we don't want this one cluttering up the manual) */ -static struct task_struct *select_bad_process(unsigned long *ppoints, - struct mem_cgroup *mem, enum oom_constraint constraint, - const nodemask_t *mask) +static struct task_struct *select_bad_process(unsigned int *ppoints, + unsigned long totalpages, struct mem_cgroup *mem, + enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; struct task_struct *chosen = NULL; - struct timespec uptime; *ppoints = 0; - do_posix_clock_monotonic_gettime(&uptime); for_each_process(p) { - unsigned long points; + unsigned int points; /* skip the init task and kthreads */ if (is_global_init(p) || (p->flags & PF_KTHREAD)) @@ -319,14 +284,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, return ERR_PTR(-1UL); chosen = p; - *ppoints = ULONG_MAX; + *ppoints = 1000; } - if (p->signal->oom_adj == OOM_DISABLE) - continue; - - points = badness(p, uptime.tv_sec); - if (points > *ppoints || !chosen) { + points = oom_badness(p, totalpages); + if (points > *ppoints) { chosen = p; *ppoints = points; } @@ -341,7 +303,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, * * Dumps the current memory state of all system tasks, excluding kernel threads. * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj - * score, and name. + * value, oom_score_adj value, and name. * * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are * shown. @@ -354,7 +316,7 @@ static void dump_tasks(const struct mem_cgroup *mem) struct task_struct *task; printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " - "name\n"); + "oom_score_adj name\n"); for_each_process(p) { /* * We don't have is_global_init() check here, because the old @@ -376,10 +338,11 @@ static void dump_tasks(const struct mem_cgroup *mem) continue; } - printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu %3d %3d %4d %s\n", task->pid, __task_cred(task)->uid, task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - (int)task_cpu(task), task->signal->oom_adj, p->comm); + (int)task_cpu(task), task->signal->oom_adj, + task->signal->oom_score_adj, p->comm); task_unlock(task); } } @@ -388,8 +351,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, struct mem_cgroup *mem) { pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " - "oom_adj=%d\n", - current->comm, gfp_mask, order, current->signal->oom_adj); + "oom_adj=%d, oom_score_adj=%d\n", + current->comm, gfp_mask, order, current->signal->oom_adj, + current->signal->oom_score_adj); task_lock(current); cpuset_print_task_mems_allowed(current); task_unlock(current); @@ -404,7 +368,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, static int oom_kill_task(struct task_struct *p) { p = find_lock_task_mm(p); - if (!p || p->signal->oom_adj == OOM_DISABLE) { + if (!p || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { task_unlock(p); return 1; } @@ -422,14 +386,13 @@ static int oom_kill_task(struct task_struct *p) #undef K static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, - unsigned long points, struct mem_cgroup *mem, - const char *message) + unsigned int points, unsigned long totalpages, + struct mem_cgroup *mem, const char *message) { struct task_struct *victim = p; struct task_struct *c; struct task_struct *t = p; - unsigned long victim_points = 0; - struct timespec uptime; + unsigned int victim_points = 0; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -443,13 +406,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return 0; } - pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); /* Try to sacrifice the worst child first */ - do_posix_clock_monotonic_gettime(&uptime); do { - unsigned long cpoints; + unsigned int cpoints; list_for_each_entry(c, &t->children, sibling) { if (c->mm == p->mm) @@ -457,8 +419,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, if (mem && !task_in_mem_cgroup(c, mem)) continue; - /* badness() returns 0 if the thread is unkillable */ - cpoints = badness(c, uptime.tv_sec); + /* + * oom_badness() returns 0 if the thread is unkillable + */ + cpoints = oom_badness(c, totalpages); if (cpoints > victim_points) { victim = c; victim_points = cpoints; @@ -496,17 +460,19 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, #ifdef CONFIG_CGROUP_MEM_RES_CTLR void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) { - unsigned long points = 0; + unsigned long limit; + unsigned int points = 0; struct task_struct *p; check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); + limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT; read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); + p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; - if (oom_kill_process(p, gfp_mask, 0, points, mem, + if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, "Memory cgroup out of memory")) goto retry; out: @@ -619,22 +585,22 @@ static void clear_system_oom(void) /* * Must be called with tasklist_lock held for read. */ -static void __out_of_memory(gfp_t gfp_mask, int order, +static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages, enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; - unsigned long points; + unsigned int points; if (sysctl_oom_kill_allocating_task) - if (!oom_kill_process(current, gfp_mask, order, 0, NULL, - "Out of memory (oom_kill_allocating_task)")) + if (!oom_kill_process(current, gfp_mask, order, 0, totalpages, + NULL, "Out of memory (oom_kill_allocating_task)")) return; retry: /* * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points, NULL, constraint, mask); + p = select_bad_process(&points, totalpages, NULL, constraint, mask); if (PTR_ERR(p) == -1UL) return; @@ -646,7 +612,7 @@ retry: panic("Out of memory and no killable processes...\n"); } - if (oom_kill_process(p, gfp_mask, order, points, NULL, + if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, "Out of memory")) goto retry; } @@ -666,6 +632,7 @@ retry: void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask) { + unsigned long totalpages; unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; @@ -688,11 +655,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ - if (zonelist) - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + constraint = constrained_alloc(zonelist, gfp_mask, nodemask, + &totalpages); check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); - __out_of_memory(gfp_mask, order, constraint, nodemask); + __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask); read_unlock(&tasklist_lock); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 131246B01D7 for ; Sun, 6 Jun 2010 18:35:08 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id o56MZ5N3020603 for ; Sun, 6 Jun 2010 15:35:05 -0700 Received: from pzk4 (pzk4.prod.google.com [10.243.19.132]) by wpaz17.hot.corp.google.com with ESMTP id o56MYvwD032081 for ; Sun, 6 Jun 2010 15:35:04 -0700 Received: by pzk4 with SMTP id 4so1704255pzk.7 for ; Sun, 06 Jun 2010 15:35:03 -0700 (PDT) Date: Sun, 6 Jun 2010 15:35:01 -0700 (PDT) From: David Rientjes Subject: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: /proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is June 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes --- Documentation/feature-removal-schedule.txt | 25 +++++++++++++++++++++++++ Documentation/filesystems/proc.txt | 3 +++ fs/proc/base.c | 8 ++++++++ include/linux/oom.h | 3 +++ 4 files changed, 39 insertions(+), 0 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -174,6 +174,31 @@ Who: Eric Biederman --------------------------- +What: /proc//oom_adj +When: June 2012 +Why: /proc//oom_adj allows userspace to influence the oom killer's + badness heuristic used to determine which task to kill when the kernel + is out of memory. + + The badness heuristic has since been rewritten since the introduction of + this tunable such that its meaning is deprecated. The value was + implemented as a bitshift on a score generated by the badness() + function that did not have any precise units of measure. With the + rewrite, the score is given as a proportion of available memory to the + task allocating pages, so using a bitshift which grows the score + exponentially is, thus, impossible to tune with fine granularity. + + A much more powerful interface, /proc//oom_score_adj, was + introduced with the oom killer rewrite that allows users to increase or + decrease the badness() score linearly. This interface will replace + /proc//oom_adj. + + A warning will be emitted to the kernel log if an application uses this + deprecated interface. After it is printed once, future warnings will be + suppressed until the kernel is rebooted. + +--------------------------- + What: remove EXPORT_SYMBOL(kernel_thread) When: August 2006 Files: arch/*/kernel/*_ksyms.c diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1288,6 +1288,9 @@ scaled linearly with /proc//oom_score_adj. Writing to /proc//oom_score_adj or /proc//oom_adj will change the other with its scaled value. +NOTICE: /proc//oom_adj is deprecated and will be removed, please see +Documentation/feature-removal-schedule.txt. + Caveat: when a parent task is selected, the oom killer will sacrifice any first generation children with seperate address spaces instead, if possible. This avoids servers and important system daemons from being killed and loses the diff --git a/fs/proc/base.c b/fs/proc/base.c --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1044,6 +1044,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, return -EACCES; } + /* + * Warn that /proc/pid/oom_adj is deprecated, see + * Documentation/feature-removal-schedule.txt. + */ + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " + "please use /proc/%d/oom_score_adj instead.\n", + current->comm, task_pid_nr(current), + task_pid_nr(task), task_pid_nr(task)); task->signal->oom_adj = oom_adjust; /* * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -2,6 +2,9 @@ #define __INCLUDE_LINUX_OOM_H /* + * /proc//oom_adj is deprecated, see + * Documentation/feature-removal-schedule.txt. + * * /proc//oom_adj set to -17 protects from the oom-killer */ #define OOM_DISABLE (-17) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 373026B01DA for ; Sun, 6 Jun 2010 18:35:10 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id o56MZ7O8009804 for ; Sun, 6 Jun 2010 15:35:08 -0700 Received: from pvh11 (pvh11.prod.google.com [10.241.210.203]) by wpaz1.hot.corp.google.com with ESMTP id o56MZ1xk021673 for ; Sun, 6 Jun 2010 15:35:01 -0700 Received: by pvh11 with SMTP id 11so1700285pvh.41 for ; Sun, 06 Jun 2010 15:35:01 -0700 (PDT) Date: Sun, 6 Jun 2010 15:34:58 -0700 (PDT) From: David Rientjes Subject: [patch 17/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: Add a forkbomb penalty for processes that fork an excessively large number of children to penalize that group of tasks and not others. A threshold is configurable from userspace to determine how many first- generation execve children (those with their own address spaces) a task may have before it is considered a forkbomb. This can be tuned by altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to 1000. When a task has more than 1000 first-generation children with different address spaces than itself, a penalty of (average rss of children) * (# of 1st generation execve children) ----------------------------------------------------------------- oom_forkbomb_thres is assessed. So, for example, using the default oom_forkbomb_thres of 1000, the penalty is twice the average rss of all its execve children if there are 2000 such tasks. A task is considered to count toward the threshold if its total runtime is less than one second; for 1000 of such tasks to exist, the parent process must be forking at an extremely high rate either erroneously or maliciously. Even though a particular task may be designated a forkbomb and selected as the victim, the oom killer will still kill the 1st generation execve child with the highest badness() score in its place. The avoids killing important servers or system daemons. When a web server forks a very large number of threads for client connections, for example, it is much better to kill one of those threads than to kill the server and make it unresponsive. Signed-off-by: David Rientjes --- Documentation/filesystems/proc.txt | 7 +++- Documentation/sysctl/vm.txt | 21 +++++++++++ include/linux/oom.h | 4 ++ kernel/sysctl.c | 8 ++++ mm/oom_kill.c | 66 ++++++++++++++++++++++++++++++++++++ 5 files changed, 104 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1248,8 +1248,11 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: root -processes are given 3% extra memory over other tasks. +There are a couple of additional factor included in the badness score: root +processes are given 3% extra memory over other tasks, and tasks which forkbomb +an excessive number of child processes are penalized by their average size. +The number of child processes considered to be a forkbomb is configurable +via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt). The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -46,6 +46,7 @@ Currently, these files are in /proc/sys/vm: - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order - oom_dump_tasks +- oom_forkbomb_thres - oom_kill_allocating_task - overcommit_memory - overcommit_ratio @@ -515,6 +516,26 @@ The default value is 1 (enabled). ============================================================== +oom_forkbomb_thres + +This value defines how many children with a seperate address space a specific +task may have before being considered as a possible forkbomb. Tasks with more +children not sharing the same address space as the parent will be penalized by a +quantity of memory equaling + + (average rss of execve children) * (# of 1st generation execve children) + ------------------------------------------------------------------------ + oom_forkbomb_thres + +in the oom killer's badness heuristic. Such tasks may be protected with a lower +oom_adj value (see Documentation/filesystems/proc.txt) if necessary. + +A value of 0 will disable forkbomb detection. + +The default value is 1000. + +============================================================== + oom_kill_allocating_task This enables or disables killing the OOM-triggering task in diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -16,6 +16,9 @@ #define OOM_SCORE_ADJ_MIN (-1000) #define OOM_SCORE_ADJ_MAX 1000 +/* See Documentation/sysctl/vm.txt */ +#define DEFAULT_OOM_FORKBOMB_THRES 1000 + #ifdef __KERNEL__ #include @@ -59,6 +62,7 @@ static inline void oom_killer_enable(void) /* sysctls */ extern int sysctl_oom_dump_tasks; +extern int sysctl_oom_forkbomb_thres; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; #endif /* __KERNEL__*/ diff --git a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1001,6 +1001,14 @@ static struct ctl_table vm_table[] = { .proc_handler = proc_dointvec, }, { + .procname = "oom_forkbomb_thres", + .data = &sysctl_oom_forkbomb_thres, + .maxlen = sizeof(sysctl_oom_forkbomb_thres), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { .procname = "overcommit_ratio", .data = &sysctl_overcommit_ratio, .maxlen = sizeof(sysctl_overcommit_ratio), diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -35,6 +35,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; +int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES; static DEFINE_SPINLOCK(zone_scan_lock); /* @@ -84,6 +85,70 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p) return NULL; } +/* + * Tasks that fork a very large number of children with seperate address spaces + * may be the result of a bug, user error, malicious applications, or even those + * with a very legitimate purpose such as a webserver. The oom killer assesses + * a penalty equaling + * + * (average rss of children) * (# of 1st generation execve children) + * ----------------------------------------------------------------- + * sysctl_oom_forkbomb_thres + * + * for such tasks to target the parent. oom_kill_process() will attempt to + * first kill a child, so there's no risk of killing an important system daemon + * via this method. A web server, for example, may fork a very large number of + * threads to respond to client connections; it's much better to kill a child + * than to kill the parent, making the server unresponsive. The goal here is + * to give the user a chance to recover from the error rather than deplete all + * memory such that the system is unusable, it's not meant to effect a forkbomb + * policy. + */ +static unsigned long oom_forkbomb_penalty(struct task_struct *tsk) +{ + struct task_struct *child; + struct task_struct *c, *t; + unsigned long child_rss = 0; + int forkcount = 0; + + if (!sysctl_oom_forkbomb_thres) + return 0; + + t = tsk; + do { + struct task_cputime task_time; + unsigned long runtime; + unsigned long rss; + + list_for_each_entry(c, &t->children, sibling) { + child = find_lock_task_mm(c); + if (!child) + continue; + if (child->mm == tsk->mm) { + task_unlock(child); + continue; + } + rss = get_mm_rss(child->mm); + task_unlock(child); + + thread_group_cputime(child, &task_time); + runtime = cputime_to_jiffies(task_time.utime) + + cputime_to_jiffies(task_time.stime); + /* + * Only threads that have run for less than a second are + * considered toward the forkbomb penalty, these threads + * rarely get to execute at all in such cases anyway. + */ + if (runtime < HZ) { + child_rss += rss; + forkcount++; + } + } + } while_each_thread(tsk, t); + return forkcount > sysctl_oom_forkbomb_thres ? + (child_rss / sysctl_oom_forkbomb_thres) : 0; +} + /** * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate @@ -133,6 +198,7 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 / totalpages; task_unlock(p); + points += oom_forkbomb_penalty(p); /* * Root processes get 3% bonus, just like the __vm_enough_memory() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 4B5906B0071 for ; Mon, 7 Jun 2010 08:12:29 -0400 (EDT) Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e37.co.us.ibm.com (8.14.4/8.13.1) with ESMTP id o57CAfEP009308 for ; Mon, 7 Jun 2010 06:10:41 -0600 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v9.1) with ESMTP id o57CCHJV169212 for ; Mon, 7 Jun 2010 06:12:21 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o57CCGbP009022 for ; Mon, 7 Jun 2010 06:12:17 -0600 Date: Mon, 7 Jun 2010 17:42:04 +0530 From: Balbir Singh Subject: Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads Message-ID: <20100607121204.GV4603@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: * David Rientjes [2010-06-06 15:34:00]: > From: Oleg Nesterov > > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this > is not true due to use_mm(). > > Change the code to check PF_KTHREAD. > Quick check are all kernel threads marked with PF_KTHREAD? daemonize() marks threads as kernel threads and I suppose children of init_task inherit the flag on fork. I suppose both should cover all kernel threads, but just checking to see if we missed anything. > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 8ADBD6B0071 for ; Mon, 7 Jun 2010 09:15:40 -0400 (EDT) Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e37.co.us.ibm.com (8.14.4/8.13.1) with ESMTP id o57DDtHs003190 for ; Mon, 7 Jun 2010 07:13:55 -0600 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o57DFPRX100076 for ; Mon, 7 Jun 2010 07:15:26 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o57DFKxF025667 for ; Mon, 7 Jun 2010 07:15:21 -0600 Date: Mon, 7 Jun 2010 18:28:28 +0530 From: Balbir Singh Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives Message-ID: <20100607125828.GW4603@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: * David Rientjes [2010-06-06 15:34:03]: > From: Oleg Nesterov > > Almost all ->mm == NUL checks in oom_kill.c are wrong. typo should be NULL > > The current code assumes that the task without ->mm has already > released its memory and ignores the process. However this is not > necessarily true when this process is multithreaded, other live > sub-threads can use this ->mm. > > - Remove the "if (!p->mm)" check in select_bad_process(), it is > just wrong. > > - Add the new helper, find_lock_task_mm(), which finds the live > thread which uses the memory and takes task_lock() to pin ->mm > > - change oom_badness() to use this helper instead of just checking > ->mm != NULL. > > - As David pointed out, select_bad_process() must never choose the > task without ->mm, but no matter what oom_badness() returns the > task can be chosen if nothing else has been found yet. > > Change oom_badness() to return int, change it to return -1 if > find_lock_task_mm() fails, and change select_bad_process() to > check points >= 0. > > Note! This patch is not enough, we need more changes. > > - oom_badness() was fixed, but oom_kill_task() still ignores > the task without ->mm > > - oom_forkbomb_penalty() should use find_lock_task_mm() too, > and it also needs other changes to actually find the first > first-descendant children > > This will be addressed later. > > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 74 +++++++++++++++++++++++++++++++++------------------------ > 1 files changed, 43 insertions(+), 31 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > return 0; > } > > +static struct task_struct *find_lock_task_mm(struct task_struct *p) > +{ > + struct task_struct *t = p; > + > + do { > + task_lock(t); > + if (likely(t->mm)) > + return t; > + task_unlock(t); > + } while_each_thread(p, t); > + > + return NULL; > +} > + Even if we miss this mm via p->mm, won't for_each_process actually catch it? Are you suggesting that the main thread could have detached the mm and a thread might still have it mapped? -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 715676B0071 for ; Mon, 7 Jun 2010 09:49:31 -0400 (EDT) Received: by iwn2 with SMTP id 2so1520408iwn.14 for ; Mon, 07 Jun 2010 06:49:29 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20100607125828.GW4603@balbir.in.ibm.com> References: <20100607125828.GW4603@balbir.in.ibm.com> Date: Mon, 7 Jun 2010 22:49:29 +0900 Message-ID: Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives From: Minchan Kim Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: Hi, Balbir. On Mon, Jun 7, 2010 at 9:58 PM, Balbir Singh wr= ote: > * David Rientjes [2010-06-06 15:34:03]: > >> From: Oleg Nesterov >> >> Almost all ->mm =3D=3D NUL checks in oom_kill.c are wrong. > > typo should be NULL > >> >> The current code assumes that the task without ->mm has already >> released its memory and ignores the process. However this is not >> necessarily true when this process is multithreaded, other live >> sub-threads can use this ->mm. >> >> - Remove the "if (!p->mm)" check in select_bad_process(), it is >> =C2=A0 just wrong. >> >> - Add the new helper, find_lock_task_mm(), which finds the live >> =C2=A0 thread which uses the memory and takes task_lock() to pin ->mm >> >> - change oom_badness() to use this helper instead of just checking >> =C2=A0 ->mm !=3D NULL. >> >> - As David pointed out, select_bad_process() must never choose the >> =C2=A0 task without ->mm, but no matter what oom_badness() returns the >> =C2=A0 task can be chosen if nothing else has been found yet. >> >> =C2=A0 Change oom_badness() to return int, change it to return -1 if >> =C2=A0 find_lock_task_mm() fails, and change select_bad_process() to >> =C2=A0 check points >=3D 0. >> >> Note! This patch is not enough, we need more changes. >> >> =C2=A0 =C2=A0 =C2=A0 - oom_badness() was fixed, but oom_kill_task() stil= l ignores >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 the task without ->mm >> >> =C2=A0 =C2=A0 =C2=A0 - oom_forkbomb_penalty() should use find_lock_task_= mm() too, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 and it also needs other changes to actually = find the first >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 first-descendant children >> >> This will be addressed later. >> >> [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] >> Signed-off-by: Oleg Nesterov >> Signed-off-by: David Rientjes >> --- >> =C2=A0mm/oom_kill.c | =C2=A0 74 +++++++++++++++++++++++++++++++++-------= ----------------- >> =C2=A01 files changed, 43 insertions(+), 31 deletions(-) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_st= ruct *tsk) >> =C2=A0 =C2=A0 =C2=A0 return 0; >> =C2=A0} >> >> +static struct task_struct *find_lock_task_mm(struct task_struct *p) >> +{ >> + =C2=A0 =C2=A0 struct task_struct *t =3D p; >> + >> + =C2=A0 =C2=A0 do { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 task_lock(t); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (likely(t->mm)) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = return t; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 task_unlock(t); >> + =C2=A0 =C2=A0 } while_each_thread(p, t); >> + >> + =C2=A0 =C2=A0 return NULL; >> +} >> + > > Even if we miss this mm via p->mm, won't for_each_process actually > catch it? Are you suggesting that the main thread could have detached > the mm and a thread might still have it mapped? Yes. Although main thread detach mm, sub-thread still may have the mm. As you have confused, I think this function name isn't good. So I suggested following as. http://lkml.org/lkml/2010/6/2/325 Anyway, It does make sense to me. --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 6DBFF6B0071 for ; Mon, 7 Jun 2010 15:49:12 -0400 (EDT) Received: from wpaz24.hot.corp.google.com (wpaz24.hot.corp.google.com [172.24.198.88]) by smtp-out.google.com with ESMTP id o57Jn9PG030056 for ; Mon, 7 Jun 2010 12:49:09 -0700 Received: from pwi7 (pwi7.prod.google.com [10.241.219.7]) by wpaz24.hot.corp.google.com with ESMTP id o57Jn7Lj008764 for ; Mon, 7 Jun 2010 12:49:08 -0700 Received: by pwi7 with SMTP id 7so546296pwi.7 for ; Mon, 07 Jun 2010 12:49:07 -0700 (PDT) Date: Mon, 7 Jun 2010 12:49:03 -0700 (PDT) From: David Rientjes Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives In-Reply-To: Message-ID: References: <20100607125828.GW4603@balbir.in.ibm.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Minchan Kim Cc: balbir@linux.vnet.ibm.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Mon, 7 Jun 2010, Minchan Kim wrote: > Yes. Although main thread detach mm, sub-thread still may have the mm. > As you have confused, I think this function name isn't good. > So I suggested following as. > I think the function name is fine, it describes exactly what it does: it finds the relevant mm for the task and returns it with task_lock() held. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id BD2196B01AD for ; Mon, 7 Jun 2010 15:50:22 -0400 (EDT) Received: from wpaz37.hot.corp.google.com (wpaz37.hot.corp.google.com [172.24.198.101]) by smtp-out.google.com with ESMTP id o57JoH5q004810 for ; Mon, 7 Jun 2010 12:50:18 -0700 Received: from pzk38 (pzk38.prod.google.com [10.243.19.166]) by wpaz37.hot.corp.google.com with ESMTP id o57JoGKn013429 for ; Mon, 7 Jun 2010 12:50:16 -0700 Received: by pzk38 with SMTP id 38so30807pzk.28 for ; Mon, 07 Jun 2010 12:50:16 -0700 (PDT) Date: Mon, 7 Jun 2010 12:50:13 -0700 (PDT) From: David Rientjes Subject: Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads In-Reply-To: <20100607121204.GV4603@balbir.in.ibm.com> Message-ID: References: <20100607121204.GV4603@balbir.in.ibm.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Balbir Singh Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Mon, 7 Jun 2010, Balbir Singh wrote: > > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this > > is not true due to use_mm(). > > > > Change the code to check PF_KTHREAD. > > > > Quick check are all kernel threads marked with PF_KTHREAD? daemonize() > marks threads as kernel threads and I suppose children of init_task > inherit the flag on fork. I suppose both should cover all kernel > threads, but just checking to see if we missed anything. > Right, it's the inheritance from init_task that is the key which gets cleared on exec for all user threads. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 466AB6B01E3 for ; Tue, 8 Jun 2010 07:42:01 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bfxvb012363 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:59 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 1554945DE4E for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id D889845DE4D for ; Tue, 8 Jun 2010 20:41:58 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id B137E1DB803C for ; Tue, 8 Jun 2010 20:41:58 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 6C8B41DB8043 for ; Tue, 8 Jun 2010 20:41:58 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: References: Message-Id: <20100608203216.765D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:57 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > It's possible to livelock the page allocator if a thread has mm->mmap_sem > and fails to make forward progress because the oom killer selects another > thread sharing the same ->mm to kill that cannot exit until the semaphore > is dropped. > > The oom killer will not kill multiple tasks at the same time; each oom > killed task must exit before another task may be killed. Thus, if one > thread is holding mm->mmap_sem and cannot allocate memory, all threads > sharing the same ->mm are blocked from exiting as well. In the oom kill > case, that means the thread holding mm->mmap_sem will never free > additional memory since it cannot get access to memory reserves and the > thread that depends on it with access to memory reserves cannot exit > because it cannot acquire the semaphore. Thus, the page allocators > livelocks. > > When the oom killer is called and current happens to have a pending > SIGKILL, this patch automatically gives it access to memory reserves and > returns. Upon returning to the page allocator, its allocation will > hopefully succeed so it can quickly exit and free its memory. If not, the > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > Acked-by: KOSAKI Motohiro > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > /* Got some memory back in the last second. */ > return; > > + /* > + * If current has a pending SIGKILL, then automatically select it. The > + * goal is to allow it to allocate so that it may quickly exit and free > + * its memory. > + */ > + if (fatal_signal_pending(current)) { > + set_thread_flag(TIF_MEMDIE); > + return; > + } > + > if (sysctl_panic_on_oom == 2) { > dump_header(NULL, gfp_mask, order, NULL); > panic("out of memory. Compulsory panic_on_oom is selected.\n"); Sorry, I had found this patch works incorrect. I don't pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id A90236B01E7 for ; Tue, 8 Jun 2010 07:42:02 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg0sM012374 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:00 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 247C845DE55 for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id E4BCF45DE51 for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id C640A1DB803A for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 5B6C31DB803C for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: Message-Id: <20100608203342.7663.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:58 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > Tasks that do not share the same set of allowed nodes with the task that > triggered the oom should not be considered as candidates for oom kill. > > Tasks in other cpusets with a disjoint set of mems would be unfairly > penalized otherwise because of oom conditions elsewhere; an extreme > example could unfairly kill all other applications on the system if a > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > more memory than allowed. > > Killing tasks outside of current's cpuset rarely would free memory for > current anyway. To use a sane heuristic, we must ensure that killing a > task would likely free memory for current and avoid needlessly killing > others at all costs just because their potential memory freeing is > unknown. It is better to kill current than another task needlessly. > > Acked-by: Rik van Riel > Acked-by: Nick Piggin > Acked-by: Balbir Singh > Acked-by: KOSAKI Motohiro > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 10 ++-------- > 1 files changed, 2 insertions(+), 8 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > points /= 4; > > /* > - * If p's nodes don't overlap ours, it may still help to kill p > - * because p may have allocated or otherwise mapped memory on > - * this node before. However it will be less likely. > - */ > - if (!has_intersects_mems_allowed(p)) > - points /= 8; > - > - /* > * Adjust the score by oom_adj. > */ > if (oom_adj) { > @@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; > + if (!has_intersects_mems_allowed(p)) > + continue; > > /* > * This task already has access to memory reserves and is pulled. but I'll merge my fix. and append historical remark. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id B2B7F6B01E8 for ; Tue, 8 Jun 2010 07:42:03 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg1BC014526 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:01 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id BE76C45DE51 for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 92BE445DE4F for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 5D2A71DB801A for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id D9F091DB8016 for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 08/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: Message-Id: <20100608203443.7666.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:59 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > When a task is chosen for oom kill, the oom killer first attempts to > sacrifice a child not sharing its parent's memory instead. Unfortunately, > this often kills in a seemingly random fashion based on the ordering of > the selected task's child list. Additionally, it is not guaranteed at all > to free a large amount of memory that we need to prevent additional oom > killing in the very near future. > > Instead, we now only attempt to sacrifice the worst child not sharing its > parent's memory, if one exists. The worst child is indicated with the > highest badness() score. This serves two advantages: we kill a > memory-hogging task more often, and we allow the configurable > /proc/pid/oom_adj value to be considered as a factor in which child to > kill. > > Reviewers may observe that the previous implementation would iterate > through the children and attempt to kill each until one was successful and > then the parent if none were found while the new code simply kills the > most memory-hogging task or the parent. Note that the only time > oom_kill_task() fails, however, is when a child does not have an mm or has > a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, > so the final oom_kill_task() will always succeed. > > Acked-by: Rik van Riel > Acked-by: Nick Piggin > Acked-by: Balbir Singh > Acked-by: KOSAKI Motohiro > Reviewed-by: KAMEZAWA Hiroyuki > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 23 +++++++++++++++++------ > 1 files changed, 17 insertions(+), 6 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > unsigned long points, struct mem_cgroup *mem, > const char *message) > { > + struct task_struct *victim = p; > struct task_struct *c; > struct task_struct *t = p; > + unsigned long victim_points = 0; > + struct timespec uptime; > > if (printk_ratelimit()) > dump_header(p, gfp_mask, order, mem); > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > return 0; > } > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > - message, task_pid_nr(p), p->comm, points); > + pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); > > - /* Try to kill a child first */ > + /* Try to sacrifice the worst child first */ > + do_posix_clock_monotonic_gettime(&uptime); > do { > + unsigned long cpoints; > + > list_for_each_entry(c, &t->children, sibling) { > if (c->mm == p->mm) > continue; > if (mem && !task_in_mem_cgroup(c, mem)) > continue; > - if (!oom_kill_task(c)) > - return 0; > + > + /* badness() returns 0 if the thread is unkillable */ > + cpoints = badness(c, uptime.tv_sec); > + if (cpoints > victim_points) { > + victim = c; > + victim_points = cpoints; > + } > } > } while_each_thread(p, t); > > - return oom_kill_task(p); > + return oom_kill_task(victim); > } > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR better version already is there in my patch kit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 2D38D6B01E3 for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg1nN008040 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:02 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 7781E45DE56 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 3A02845DE4E for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 0D6391DB8043 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id B4DD21DB8037 for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 10/18] oom: enable oom tasklist dump by default In-Reply-To: References: Message-Id: <20100608203540.766C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:00 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is > very helpful information in diagnosing why a user's task has been killed. > It emits useful information such as each eligible thread's memory usage > that can determine why the system is oom, so it should be enabled by > default. > > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > Documentation/sysctl/vm.txt | 2 +- > mm/oom_kill.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -511,7 +511,7 @@ information may not be desired. > If this is set to non-zero, this information is shown whenever the > OOM killer actually kills a memory-hogging task. > > -The default value is 0. > +The default value is 1 (enabled). > > ============================================================== > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index ef048c1..833de48 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -32,7 +32,7 @@ > > int sysctl_panic_on_oom; > int sysctl_oom_kill_allocating_task; > -int sysctl_oom_dump_tasks; > +int sysctl_oom_dump_tasks = 1; > static DEFINE_SPINLOCK(zone_scan_lock); > /* #define DEBUG */ > pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 743A56B01E6 for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg2nT012449 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:02 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 1C25C45DE58 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id CEC6C45DE52 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id A2B1F1DB8043 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 3514E1DB8040 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: References: Message-Id: <20100608203551.766F.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:00 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > If memory has been depleted in lowmem zones even with the protection > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that > killing current users will help. The memory is either reclaimable (or > migratable) already, in which case we should not invoke the oom killer at > all, or it is pinned by an application for I/O. Killing such an > application may leave the hardware in an unspecified state and there is no > guarantee that it will be able to make a timely exit. > > Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is > not used so that the task can perhaps recover or try again later. > > Previously, the heuristic provided some protection for those tasks with > CAP_SYS_RAWIO, but this is no longer necessary since we will not be > killing tasks for the purposes of ISA allocations. > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the > default for all allocations that are not __GFP_DMA, __GFP_DMA32, > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those > flags. Testing for high_zoneidx being less than ZONE_NORMAL will only > return true for allocations that have either __GFP_DMA or __GFP_DMA32. > > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/page_alloc.c | 29 ++++++++++++++++++++--------- > 1 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > /* The OOM killer will not help higher order allocs */ > if (order > PAGE_ALLOC_COSTLY_ORDER) > goto out; > + /* The OOM killer does not needlessly kill tasks for lowmem */ > + if (high_zoneidx < ZONE_NORMAL) > + goto out; > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > @@ -2052,15 +2055,23 @@ rebalance: > if (page) > goto got_pg; > > - /* > - * The OOM killer does not trigger for high-order > - * ~__GFP_NOFAIL allocations so if no progress is being > - * made, there are no other options and retrying is > - * unlikely to help. > - */ > - if (order > PAGE_ALLOC_COSTLY_ORDER && > - !(gfp_mask & __GFP_NOFAIL)) > - goto nopage; > + if (!(gfp_mask & __GFP_NOFAIL)) { > + /* > + * The oom killer is not called for high-order > + * allocations that may fail, so if no progress > + * is being made, there are no other options and > + * retrying is unlikely to help. > + */ > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + goto nopage; > + /* > + * The oom killer is not called for lowmem > + * allocations to prevent needlessly killing > + * innocent tasks. > + */ > + if (high_zoneidx < ZONE_NORMAL) > + goto nopage; > + } > > goto restart; > } pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 835896B01ED for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg1RL012427 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:02 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 49B3345DE51 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 2266345DE4E for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 06D511DB803F for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 9AE311DB8038 for ; Tue, 8 Jun 2010 20:41:57 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: References: Message-Id: <20100608194533.7657.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:56 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Hi > This a complete rewrite of the oom killer's badness() heuristic which is > used to determine which task to kill in oom conditions. The goal is to > make it as simple and predictable as possible so the results are better > understood and we end up killing the task which will lead to the most > memory freeing while still respecting the fine-tuning from userspace. > > Instead of basing the heuristic on mm->total_vm for each task, the task's > rss and swap space is used instead. This is a better indication of the > amount of memory that will be freeable if the oom killed task is chosen > and subsequently exits. This helps specifically in cases where KDE or > GNOME is chosen for oom kill on desktop systems instead of a memory > hogging task. > > The baseline for the heuristic is a proportion of memory that each task is > currently using in memory plus swap compared to the amount of "allowable" > memory. "Allowable," in this sense, means the system-wide resources for > unconstrained oom conditions, the set of mempolicy nodes, the mems > attached to current's cpuset, or a memory controller's limit. The > proportion is given on a scale of 0 (never kill) to 1000 (always kill), > roughly meaning that if a task has a badness() score of 500 that the task > consumes approximately 50% of allowable memory resident in RAM or in swap > space. > > The proportion is always relative to the amount of "allowable" memory and > not the total amount of RAM systemwide so that mempolicies and cpusets may > operate in isolation; they shall not need to know the true size of the > machine on which they are running if they are bound to a specific set of > nodes or mems, respectively. > > Root tasks are given 3% extra memory just like __vm_enough_memory() > provides in LSMs. In the event of two tasks consuming similar amounts of > memory, it is generally better to save root's task. > > Because of the change in the badness() heuristic's baseline, it is also > necessary to introduce a new user interface to tune it. It's not possible > to redefine the meaning of /proc/pid/oom_adj with a new scale since the > ABI cannot be changed for backward compatability. Instead, a new tunable, > /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may > be used to polarize the heuristic such that certain tasks are never > considered for oom kill while others may always be considered. The value > is added directly into the badness() score so a value of -500, for > example, means to discount 50% of its memory consumption in comparison to > other tasks either on the system, bound to the mempolicy, in the cpuset, > or sharing the same memory controller. > > /proc/pid/oom_adj is changed so that its meaning is rescaled into the > units used by /proc/pid/oom_score_adj, and vice versa. Changing one of > these per-task tunables will rescale the value of the other to an > equivalent meaning. Although /proc/pid/oom_adj was originally defined as > a bitshift on the badness score, it now shares the same linear growth as > /proc/pid/oom_score_adj but with different granularity. This is required > so the ABI is not broken with userspace applications and allows oom_adj to > be deprecated for future removal. > > Signed-off-by: David Rientjes > --- > Documentation/filesystems/proc.txt | 94 ++++++++----- > fs/proc/base.c | 99 ++++++++++++- > include/linux/memcontrol.h | 8 + > include/linux/oom.h | 14 ++- > include/linux/sched.h | 3 +- > kernel/fork.c | 1 + > mm/memcontrol.c | 18 +++ > mm/oom_kill.c | 279 ++++++++++++++++-------------------- > 8 files changed, 316 insertions(+), 200 deletions(-) > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -33,7 +33,8 @@ Table of Contents > 2 Modifying System Parameters > > 3 Per-Process Parameters > - 3.1 /proc//oom_adj - Adjust the oom-killer score > + 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer > + score > 3.2 /proc//oom_score - Display current oom-killer score > 3.3 /proc//io - Display the IO accounting fields > 3.4 /proc//coredump_filter - Core dump filtering settings > @@ -1234,42 +1235,61 @@ of the kernel. > CHAPTER 3: PER-PROCESS PARAMETERS > ------------------------------------------------------------------------------ > > -3.1 /proc//oom_adj - Adjust the oom-killer score > ------------------------------------------------------- > - > -This file can be used to adjust the score used to select which processes > -should be killed in an out-of-memory situation. Giving it a high score will > -increase the likelihood of this process being killed by the oom-killer. Valid > -values are in the range -16 to +15, plus the special value -17, which disables > -oom-killing altogether for this process. > - > -The process to be killed in an out-of-memory situation is selected among all others > -based on its badness score. This value equals the original memory size of the process > -and is then updated according to its CPU time (utime + stime) and the > -run time (uptime - start time). The longer it runs the smaller is the score. > -Badness score is divided by the square root of the CPU time and then by > -the double square root of the run time. > - > -Swapped out tasks are killed first. Half of each child's memory size is added to > -the parent's score if they do not share the same memory. Thus forking servers > -are the prime candidates to be killed. Having only one 'hungry' child will make > -parent less preferable than the child. > - > -/proc//oom_score shows process' current badness score. > - > -The following heuristics are then applied: > - * if the task was reniced, its score doubles > - * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE > - or CAP_SYS_RAWIO) have their score divided by 4 > - * if oom condition happened in one cpuset and checked process does not belong > - to it, its score is divided by 8 > - * the resulting score is multiplied by two to the power of oom_adj, i.e. > - points <<= oom_adj when it is positive and > - points >>= -(oom_adj) otherwise > - > -The task with the highest badness score is then selected and its children > -are killed, process itself will be killed in an OOM situation when it does > -not have children or some of them disabled oom like described above. > +3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score > +-------------------------------------------------------------------------------- > + > +These file can be used to adjust the badness heuristic used to select which > +process gets killed in out of memory conditions. > + > +The badness heuristic assigns a value to each candidate task ranging from 0 > +(never kill) to 1000 (always kill) to determine which process is targeted. The > +units are roughly a proportion along that range of allowed memory the process > +may allocate from based on an estimation of its current memory and swap use. > +For example, if a task is using all allowed memory, its badness score will be > +1000. If it is using half of its allowed memory, its score will be 500. > + > +There is an additional factor included in the badness score: root > +processes are given 3% extra memory over other tasks. > + > +The amount of "allowed" memory depends on the context in which the oom killer > +was called. If it is due to the memory assigned to the allocating task's cpuset > +being exhausted, the allowed memory represents the set of mems assigned to that > +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed > +memory represents the set of mempolicy nodes. If it is due to a memory > +limit (or swap limit) being reached, the allowed memory is that configured > +limit. Finally, if it is due to the entire system being out of memory, the > +allowed memory represents all allocatable resources. > + > +The value of /proc//oom_score_adj is added to the badness score before it > +is used to determine which task to kill. Acceptable values range from -1000 > +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to > +polarize the preference for oom killing either by always preferring a certain > +task or completely disabling it. The lowest possible value, -1000, is > +equivalent to disabling oom killing entirely for that task since it will always > +report a badness score of 0. > + > +Consequently, it is very simple for userspace to define the amount of memory to > +consider for each task. Setting a /proc//oom_score_adj value of +500, for > +example, is roughly equivalent to allowing the remainder of tasks sharing the > +same system, cpuset, mempolicy, or memory controller resources to use at least > +50% more memory. A value of -500, on the other hand, would be roughly > +equivalent to discounting 50% of the task's allowed memory from being considered > +as scoring against the task. > + > +For backwards compatibility with previous kernels, /proc//oom_adj may also > +be used to tune the badness score. Its acceptable values range from -16 > +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 > +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is > +scaled linearly with /proc//oom_score_adj. > + > +Writing to /proc//oom_score_adj or /proc//oom_adj will change the > +other with its scaled value. > + > +Caveat: when a parent task is selected, the oom killer will sacrifice any first > +generation children with seperate address spaces instead, if possible. This > +avoids servers and important system daemons from being killed and loses the > +minimal amount of work. > + > > 3.2 /proc//oom_score - Display current oom-killer score > ------------------------------------------------------------- > diff --git a/fs/proc/base.c b/fs/proc/base.c > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -63,6 +63,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = { > #endif > > /* The badness from the OOM killer */ > -unsigned long badness(struct task_struct *p, unsigned long uptime); > static int proc_oom_score(struct task_struct *task, char *buffer) > { > unsigned long points = 0; > - struct timespec uptime; > > - do_posix_clock_monotonic_gettime(&uptime); > read_lock(&tasklist_lock); > if (pid_alive(task)) > - points = badness(task, uptime.tv_sec); > + points = oom_badness(task->group_leader, > + global_page_state(NR_INACTIVE_ANON) + > + global_page_state(NR_ACTIVE_ANON) + > + global_page_state(NR_INACTIVE_FILE) + > + global_page_state(NR_ACTIVE_FILE) + > + total_swap_pages); Sorry I can't ack this. again and again, I try to explain why this is wrong (hopefully last) 1) incompatibility oom_score is one of ABI. then, we can't change this. from enduser view, this change is no merit. In general, an incompatibility is allowed on very limited situation such as that an end-user get much benefit than compatibility. In other word, old style ABI doesn't works fine from end user view. But, in this case, it isn't. 2) technically incorrect this math is not correct math. this is not represented "allowed memory". example, 1) this is not accumulated mlocked memory, but it can be freed task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not. if not, task killing doesn't free SYSV IPC memory. In additon, 3) This normalization doesn't works on asymmetric numa. total pages and oom are not related almostly. 4) scalability. if the system 10TB memory, 1 point oom score mean 10GB memory consumption. it seems too rough. generically, a value suppression itself is evil for scalability software. Then, we can't merge this our kernel. if your workload really need this, we consider following simplest hook instead. if (badness_hook_fn) points = badness_hook_fn(p) else points = oom_badness(p); Please implement your specific oom-score in your hook func. > read_unlock(&tasklist_lock); > return sprintf(buffer, "%lu\n", points); > } > @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, > } > > task->signal->oom_adj = oom_adjust; > - > + /* > + * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum > + * value is always attainable. > + */ > + if (task->signal->oom_adj == OOM_ADJUST_MAX) > + task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; > + else > + task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / > + -OOM_DISABLE; > unlock_task_sighand(task, &flags); > put_task_struct(task); Generically, I wasn't against the feature for rare use-case. but sorry, as far as I investigated, I haven't find any actual user. then, I don't put ack, because my reviewing basically stand on 1) how much user use this 2) how strongly required this from an users 3) how much side effect is there etc etc. not cool or not. A zero user feature is basically out of scope of mine. please separate this feature, and discuss another reviewers (e.g. Nick, Kamezawa-san). If you can get one or more reviewer ack, I don't put objection. I don't want dicuss this topic you anymore. I can't imazine I and you reach to agree this. > @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = { > .llseek = generic_file_llseek, > }; > > +static ssize_t oom_score_adj_read(struct file *file, char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); > + char buffer[PROC_NUMBUF]; > + int oom_score_adj = OOM_SCORE_ADJ_MIN; > + unsigned long flags; > + size_t len; > + > + if (!task) > + return -ESRCH; > + if (lock_task_sighand(task, &flags)) { > + oom_score_adj = task->signal->oom_score_adj; > + unlock_task_sighand(task, &flags); > + } > + put_task_struct(task); > + len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj); > + return simple_read_from_buffer(buf, count, ppos, buffer, len); > +} > + > +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct task_struct *task; > + char buffer[PROC_NUMBUF]; > + unsigned long flags; > + long oom_score_adj; > + int err; > + > + memset(buffer, 0, sizeof(buffer)); > + if (count > sizeof(buffer) - 1) > + count = sizeof(buffer) - 1; > + if (copy_from_user(buffer, buf, count)) > + return -EFAULT; > + > + err = strict_strtol(strstrip(buffer), 0, &oom_score_adj); > + if (err) > + return -EINVAL; > + if (oom_score_adj < OOM_SCORE_ADJ_MIN || > + oom_score_adj > OOM_SCORE_ADJ_MAX) > + return -EINVAL; > + > + task = get_proc_task(file->f_path.dentry->d_inode); > + if (!task) > + return -ESRCH; > + if (!lock_task_sighand(task, &flags)) { > + put_task_struct(task); > + return -ESRCH; > + } > + if (oom_score_adj < task->signal->oom_score_adj && > + !capable(CAP_SYS_RESOURCE)) { > + unlock_task_sighand(task, &flags); > + put_task_struct(task); > + return -EACCES; > + } > + > + task->signal->oom_score_adj = oom_score_adj; > + /* > + * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is > + * always attainable. > + */ > + if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) > + task->signal->oom_adj = OOM_DISABLE; > + else > + task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / > + OOM_SCORE_ADJ_MAX; > + unlock_task_sighand(task, &flags); > + put_task_struct(task); > + return count; > +} > + > +static const struct file_operations proc_oom_score_adj_operations = { > + .read = oom_score_adj_read, > + .write = oom_score_adj_write, > +}; > + > #ifdef CONFIG_AUDITSYSCALL > #define TMPBUFLEN 21 > static ssize_t proc_loginuid_read(struct file * file, char __user * buf, > @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = { > #endif > INF("oom_score", S_IRUGO, proc_oom_score), > REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), > + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), > #ifdef CONFIG_AUDITSYSCALL > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), > REG("sessionid", S_IRUGO, proc_sessionid_operations), > @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = { > #endif > INF("oom_score", S_IRUGO, proc_oom_score), > REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), > + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), > #ifdef CONFIG_AUDITSYSCALL > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), > REG("sessionid", S_IRUSR, proc_sessionid_operations), > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val); > unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > gfp_t gfp_mask, int nid, > int zid); > +u64 mem_cgroup_get_limit(struct mem_cgroup *mem); > + > #else /* CONFIG_CGROUP_MEM_RES_CTLR */ > struct mem_cgroup; > > @@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > return 0; > } > > +static inline > +u64 mem_cgroup_get_limit(struct mem_cgroup *mem) > +{ > + return 0; > +} > + > #endif /* CONFIG_CGROUP_MEM_CONT */ > > #endif /* _LINUX_MEMCONTROL_H */ > diff --git a/include/linux/oom.h b/include/linux/oom.h > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -1,14 +1,24 @@ > #ifndef __INCLUDE_LINUX_OOM_H > #define __INCLUDE_LINUX_OOM_H > > -/* /proc//oom_adj set to -17 protects from the oom-killer */ > +/* > + * /proc//oom_adj set to -17 protects from the oom-killer > + */ > #define OOM_DISABLE (-17) > /* inclusive */ > #define OOM_ADJUST_MIN (-16) > #define OOM_ADJUST_MAX 15 > > +/* > + * /proc//oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for > + * pid. > + */ > +#define OOM_SCORE_ADJ_MIN (-1000) > +#define OOM_SCORE_ADJ_MAX 1000 > + > #ifdef __KERNEL__ > > +#include > #include > #include > > @@ -25,6 +35,8 @@ enum oom_constraint { > CONSTRAINT_MEMCG, > }; > > +extern unsigned int oom_badness(struct task_struct *p, > + unsigned long totalpages); > extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); > extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); > > diff --git a/include/linux/sched.h b/include/linux/sched.h > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -629,7 +629,8 @@ struct signal_struct { > struct tty_audit_buf *tty_audit_buf; > #endif > > - int oom_adj; /* OOM kill score adjustment (bit shift) */ > + int oom_adj; /* OOM kill score adjustment (bit shift) */ > + int oom_score_adj; /* OOM kill score adjustment */ > }; > > /* Context switch must be unlocked if interrupts are to be enabled */ > diff --git a/kernel/fork.c b/kernel/fork.c > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) > tty_audit_fork(sig); > > sig->oom_adj = current->signal->oom_adj; > + sig->oom_score_adj = current->signal->oom_score_adj; > > return 0; > } > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem) > } > > /* > + * Return the memory (and swap, if configured) limit for a memcg. > + */ > +u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > +{ > + u64 limit; > + u64 memsw; > + > + limit = res_counter_read_u64(&memcg->res, RES_LIMIT) + > + total_swap_pages; > + memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT); > + /* > + * If memsw is finite and limits the amount of swap space available > + * to this memcg, return that limit. > + */ > + return min(limit, memsw); > +} > + > +/* > * Visit the first child (need not be the first child as per the ordering > * of the cgroup list, since we track last_scanned_child) of @mem and use > * that to reclaim free pages from. > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -4,6 +4,8 @@ > * Copyright (C) 1998,2000 Rik van Riel > * Thanks go out to Claus Fischer for some serious inspiration and > * for goading me into coding this file... > + * Copyright (C) 2010 Google, Inc. > + * Rewritten by David Rientjes don't put it. > * > * The routines in this file are used to kill a process when > * we're seriously out of memory. This gets called from __alloc_pages() > @@ -34,7 +36,6 @@ int sysctl_panic_on_oom; > int sysctl_oom_kill_allocating_task; > int sysctl_oom_dump_tasks = 1; > static DEFINE_SPINLOCK(zone_scan_lock); > -/* #define DEBUG */ > > /* > * Do all threads of the target process overlap our allowed nodes? > @@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p) > } > > /** > - * badness - calculate a numeric value for how bad this task has been > + * oom_badness - heuristic function to determine which candidate task to kill > * @p: task struct of which task we should calculate > - * @uptime: current uptime in seconds > + * @totalpages: total present RAM allowed for page allocation > * > - * The formula used is relatively simple and documented inline in the > - * function. The main rationale is that we want to select a good task > - * to kill when we run out of memory. > - * > - * Good in this context means that: > - * 1) we lose the minimum amount of work done > - * 2) we recover a large amount of memory > - * 3) we don't kill anything innocent of eating tons of memory > - * 4) we want to kill the minimum amount of processes (one) > - * 5) we try to kill the process the user expects us to kill, this > - * algorithm has been meticulously tuned to meet the principle > - * of least surprise ... (be careful when you change it) > + * The heuristic for determining which task to kill is made to be as simple and > + * predictable as possible. The goal is to return the highest value for the > + * task consuming the most memory to avoid subsequent oom failures. > */ > - > -unsigned long badness(struct task_struct *p, unsigned long uptime) > +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) > { > - unsigned long points, cpu_time, run_time; > - struct task_struct *child; > - struct task_struct *c, *t; > - int oom_adj = p->signal->oom_adj; > - struct task_cputime task_time; > - unsigned long utime; > - unsigned long stime; > - > - if (oom_adj == OOM_DISABLE) > - return 0; > + int points; > > p = find_lock_task_mm(p); > if (!p) > return 0; > > /* > - * The memory size of the process is the basis for the badness. > - */ > - points = p->mm->total_vm; > - > - /* > - * After this unlock we can no longer dereference local variable `mm' > - */ > - task_unlock(p); > - > - /* > - * swapoff can easily use up all memory, so kill those first. > + * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't > + * need to be executed for something that cannot be killed. > */ > - if (p->flags & PF_OOM_ORIGIN) > - return ULONG_MAX; > - > - /* > - * Processes which fork a lot of child processes are likely > - * a good choice. We add half the vmsize of the children if they > - * have an own mm. This prevents forking servers to flood the > - * machine with an endless amount of children. In case a single > - * child is eating the vast majority of memory, adding only half > - * to the parents will make the child our kill candidate of choice. > - */ > - t = p; > - do { > - list_for_each_entry(c, &t->children, sibling) { > - child = find_lock_task_mm(c); > - if (child) { > - if (child->mm != p->mm) > - points += child->mm->total_vm/2 + 1; > - task_unlock(child); > - } > - } > - } while_each_thread(p, t); > + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > + task_unlock(p); > + return 0; > + } > > /* > - * CPU time is in tens of seconds and run time is in thousands > - * of seconds. There is no particular reason for this other than > - * that it turned out to work very well in practice. > + * When the PF_OOM_ORIGIN bit is set, it indicates the task should have > + * priority for oom killing. > */ > - thread_group_cputime(p, &task_time); > - utime = cputime_to_jiffies(task_time.utime); > - stime = cputime_to_jiffies(task_time.stime); > - cpu_time = (utime + stime) >> (SHIFT_HZ + 3); > - > - > - if (uptime >= p->start_time.tv_sec) > - run_time = (uptime - p->start_time.tv_sec) >> 10; > - else > - run_time = 0; > - > - if (cpu_time) > - points /= int_sqrt(cpu_time); > - if (run_time) > - points /= int_sqrt(int_sqrt(run_time)); > + if (p->flags & PF_OOM_ORIGIN) { > + task_unlock(p); > + return 1000; > + } > > /* > - * Niced processes are most likely less important, so double > - * their badness points. > + * The memory controller may have a limit of 0 bytes, so avoid a divide > + * by zero if necessary. > */ > - if (task_nice(p) > 0) > - points *= 2; You removed - run time check - cpu time check - nice check but no described the reason. reviewers are puzzled. How do we review this though we don't get your point? please write - What benerit is there? - Why do you think no bad effect? - How confirm do you? > + if (!totalpages) > + totalpages = 1; > > /* > - * Superuser processes are usually more important, so we make it > - * less likely that we kill those. > + * The baseline for the badness score is the proportion of RAM that each > + * task's rss and swap space use. > */ > - if (has_capability_noaudit(p, CAP_SYS_ADMIN) || > - has_capability_noaudit(p, CAP_SYS_RESOURCE)) > - points /= 4; > + points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 / > + totalpages; > + task_unlock(p); > > /* > - * We don't want to kill a process with direct hardware access. > - * Not only could that mess up the hardware, but usually users > - * tend to only have this flag set on applications they think > - * of as important. > + * Root processes get 3% bonus, just like the __vm_enough_memory() > + * implementation used by LSMs. > */ > - if (has_capability_noaudit(p, CAP_SYS_RAWIO)) > - points /= 4; > + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > + points -= 30; CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive process. but killing interactive process only cause force logout. but killing system daemon can makes more catastrophic disaster. Last of all, I'll pulled this one. but only do cherry-pick. > > /* > - * Adjust the score by oom_adj. > + * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may > + * either completely disable oom killing or always prefer a certain > + * task. > */ > - if (oom_adj) { > - if (oom_adj > 0) { > - if (!points) > - points = 1; > - points <<= oom_adj; > - } else > - points >>= -(oom_adj); > - } > + points += p->signal->oom_score_adj; > > -#ifdef DEBUG > - printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n", > - p->pid, p->comm, points); > -#endif > - return points; > + if (points < 0) > + return 0; > + return (points < 1000) ? points : 1000; > } > > /* > @@ -224,12 +158,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > */ > #ifdef CONFIG_NUMA > static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > - gfp_t gfp_mask, nodemask_t *nodemask) > + gfp_t gfp_mask, nodemask_t *nodemask, > + unsigned long *totalpages) > { > struct zone *zone; > struct zoneref *z; > enum zone_type high_zoneidx = gfp_zone(gfp_mask); > + bool cpuset_limited = false; > + int nid; > > + /* Default to all anonymous memory, page cache, and swap */ > + *totalpages = global_page_state(NR_INACTIVE_ANON) + > + global_page_state(NR_ACTIVE_ANON) + > + global_page_state(NR_INACTIVE_FILE) + > + global_page_state(NR_ACTIVE_FILE) + > + total_swap_pages; > + > + if (!zonelist) > + return CONSTRAINT_NONE; > /* > * Reach here only when __GFP_NOFAIL is used. So, we should avoid > * to kill current.We have to random task kill in this case. > @@ -239,26 +185,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > return CONSTRAINT_NONE; > > /* > - * The nodemask here is a nodemask passed to alloc_pages(). Now, > - * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy > - * feature. mempolicy is an only user of nodemask here. > - * check mempolicy's nodemask contains all N_HIGH_MEMORY > + * This is not a __GFP_THISNODE allocation, so a truncated nodemask in > + * the page allocator means a mempolicy is in effect. Cpuset policy > + * is enforced in get_page_from_freelist(). > */ > - if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) > + if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) { > + *totalpages = total_swap_pages; > + for_each_node_mask(nid, *nodemask) > + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + > + node_page_state(nid, NR_ACTIVE_ANON) + > + node_page_state(nid, NR_INACTIVE_FILE) + > + node_page_state(nid, NR_ACTIVE_FILE); > return CONSTRAINT_MEMORY_POLICY; > + } > > /* Check this allocation failure is caused by cpuset's wall function */ > for_each_zone_zonelist_nodemask(zone, z, zonelist, > high_zoneidx, nodemask) > if (!cpuset_zone_allowed_softwall(zone, gfp_mask)) > - return CONSTRAINT_CPUSET; > - > + cpuset_limited = true; > + > + if (cpuset_limited) { > + *totalpages = total_swap_pages; > + for_each_node_mask(nid, cpuset_current_mems_allowed) > + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + > + node_page_state(nid, NR_ACTIVE_ANON) + > + node_page_state(nid, NR_INACTIVE_FILE) + > + node_page_state(nid, NR_ACTIVE_FILE); > + return CONSTRAINT_CPUSET; > + } > return CONSTRAINT_NONE; > } > #else > static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > - gfp_t gfp_mask, nodemask_t *nodemask) > + gfp_t gfp_mask, nodemask_t *nodemask, > + unsigned long *totalpages) > { > + *totalpages = global_page_state(NR_INACTIVE_ANON) + > + global_page_state(NR_ACTIVE_ANON) + > + global_page_state(NR_INACTIVE_FILE) + > + global_page_state(NR_ACTIVE_FILE) + > + total_swap_pages; > return CONSTRAINT_NONE; > } > #endif > @@ -269,18 +236,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > * > * (not docbooked, we don't want this one cluttering up the manual) > */ > -static struct task_struct *select_bad_process(unsigned long *ppoints, > - struct mem_cgroup *mem, enum oom_constraint constraint, > - const nodemask_t *mask) > +static struct task_struct *select_bad_process(unsigned int *ppoints, > + unsigned long totalpages, struct mem_cgroup *mem, > + enum oom_constraint constraint, const nodemask_t *mask) > { > struct task_struct *p; > struct task_struct *chosen = NULL; > - struct timespec uptime; > *ppoints = 0; > > - do_posix_clock_monotonic_gettime(&uptime); > for_each_process(p) { > - unsigned long points; > + unsigned int points; > > /* skip the init task and kthreads */ > if (is_global_init(p) || (p->flags & PF_KTHREAD)) > @@ -319,14 +284,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > return ERR_PTR(-1UL); > > chosen = p; > - *ppoints = ULONG_MAX; > + *ppoints = 1000; > } > > - if (p->signal->oom_adj == OOM_DISABLE) > - continue; > - > - points = badness(p, uptime.tv_sec); > - if (points > *ppoints || !chosen) { > + points = oom_badness(p, totalpages); > + if (points > *ppoints) { > chosen = p; > *ppoints = points; > } > @@ -341,7 +303,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > * > * Dumps the current memory state of all system tasks, excluding kernel threads. > * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj > - * score, and name. > + * value, oom_score_adj value, and name. > * > * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are > * shown. > @@ -354,7 +316,7 @@ static void dump_tasks(const struct mem_cgroup *mem) > struct task_struct *task; > > printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " > - "name\n"); > + "oom_score_adj name\n"); > for_each_process(p) { > /* > * We don't have is_global_init() check here, because the old > @@ -376,10 +338,11 @@ static void dump_tasks(const struct mem_cgroup *mem) > continue; > } > > - printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", > + pr_info("[%5d] %5d %5d %8lu %8lu %3d %3d %4d %s\n", > task->pid, __task_cred(task)->uid, task->tgid, > task->mm->total_vm, get_mm_rss(task->mm), > - (int)task_cpu(task), task->signal->oom_adj, p->comm); > + (int)task_cpu(task), task->signal->oom_adj, > + task->signal->oom_score_adj, p->comm); > task_unlock(task); > } > } > @@ -388,8 +351,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > struct mem_cgroup *mem) > { > pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " > - "oom_adj=%d\n", > - current->comm, gfp_mask, order, current->signal->oom_adj); > + "oom_adj=%d, oom_score_adj=%d\n", > + current->comm, gfp_mask, order, current->signal->oom_adj, > + current->signal->oom_score_adj); > task_lock(current); > cpuset_print_task_mems_allowed(current); > task_unlock(current); > @@ -404,7 +368,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, > static int oom_kill_task(struct task_struct *p) > { > p = find_lock_task_mm(p); > - if (!p || p->signal->oom_adj == OOM_DISABLE) { > + if (!p || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > task_unlock(p); > return 1; > } > @@ -422,14 +386,13 @@ static int oom_kill_task(struct task_struct *p) > #undef K > > static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > - unsigned long points, struct mem_cgroup *mem, > - const char *message) > + unsigned int points, unsigned long totalpages, > + struct mem_cgroup *mem, const char *message) > { > struct task_struct *victim = p; > struct task_struct *c; > struct task_struct *t = p; > - unsigned long victim_points = 0; > - struct timespec uptime; > + unsigned int victim_points = 0; > > if (printk_ratelimit()) > dump_header(p, gfp_mask, order, mem); > @@ -443,13 +406,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > return 0; > } > > - pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", > + pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n", > message, task_pid_nr(p), p->comm, points); > > /* Try to sacrifice the worst child first */ > - do_posix_clock_monotonic_gettime(&uptime); > do { > - unsigned long cpoints; > + unsigned int cpoints; > > list_for_each_entry(c, &t->children, sibling) { > if (c->mm == p->mm) > @@ -457,8 +419,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > if (mem && !task_in_mem_cgroup(c, mem)) > continue; > > - /* badness() returns 0 if the thread is unkillable */ > - cpoints = badness(c, uptime.tv_sec); > + /* > + * oom_badness() returns 0 if the thread is unkillable > + */ > + cpoints = oom_badness(c, totalpages); > if (cpoints > victim_points) { > victim = c; > victim_points = cpoints; > @@ -496,17 +460,19 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > { > - unsigned long points = 0; > + unsigned long limit; > + unsigned int points = 0; > struct task_struct *p; > > check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); > + limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT; > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); > + p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > - if (oom_kill_process(p, gfp_mask, 0, points, mem, > + if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, > "Memory cgroup out of memory")) > goto retry; > out: > @@ -619,22 +585,22 @@ static void clear_system_oom(void) > /* > * Must be called with tasklist_lock held for read. > */ > -static void __out_of_memory(gfp_t gfp_mask, int order, > +static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages, > enum oom_constraint constraint, const nodemask_t *mask) > { > struct task_struct *p; > - unsigned long points; > + unsigned int points; > > if (sysctl_oom_kill_allocating_task) > - if (!oom_kill_process(current, gfp_mask, order, 0, NULL, > - "Out of memory (oom_kill_allocating_task)")) > + if (!oom_kill_process(current, gfp_mask, order, 0, totalpages, > + NULL, "Out of memory (oom_kill_allocating_task)")) > return; > retry: > /* > * Rambo mode: Shoot down a process and hope it solves whatever > * issues we may have. > */ > - p = select_bad_process(&points, NULL, constraint, mask); > + p = select_bad_process(&points, totalpages, NULL, constraint, mask); > > if (PTR_ERR(p) == -1UL) > return; > @@ -646,7 +612,7 @@ retry: > panic("Out of memory and no killable processes...\n"); > } > > - if (oom_kill_process(p, gfp_mask, order, points, NULL, > + if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, > "Out of memory")) > goto retry; > } > @@ -666,6 +632,7 @@ retry: > void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *nodemask) > { > + unsigned long totalpages; > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > > @@ -688,11 +655,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > * Check if there were limitations on the allocation (only relevant for > * NUMA) that may require different handling. > */ > - if (zonelist) > - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > + constraint = constrained_alloc(zonelist, gfp_mask, nodemask, > + &totalpages); > check_panic_on_oom(constraint, gfp_mask, order); > read_lock(&tasklist_lock); > - __out_of_memory(gfp_mask, order, constraint, nodemask); > + __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask); > read_unlock(&tasklist_lock); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id DFFAE6B01E2 for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg3V5012476 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:03 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id BFF0845DE57 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8EA5245DE51 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 209621DB803E for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id A4C7E1DB803C for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 12/18] oom: extract panic helper function In-Reply-To: References: Message-Id: <20100608203611.7672.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:00 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > There are various points in the oom killer where the kernel must > determine whether to panic or not. It's better to extract this to a > helper function to remove all the confusion as to its semantics. > > Also fix a call to dump_header() where tasklist_lock is not read- > locked, as required. > > There's no functional change with this patch. > > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > include/linux/oom.h | 1 + > mm/oom_kill.c | 53 +++++++++++++++++++++++++++----------------------- > 2 files changed, 30 insertions(+), 24 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -22,6 +22,7 @@ enum oom_constraint { > CONSTRAINT_NONE, > CONSTRAINT_CPUSET, > CONSTRAINT_MEMORY_POLICY, > + CONSTRAINT_MEMCG, > }; > > extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -505,17 +505,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > return oom_kill_task(victim); > } > > +/* > + * Determines whether the kernel must panic because of the panic_on_oom sysctl. > + */ > +static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, > + int order) > +{ > + if (likely(!sysctl_panic_on_oom)) > + return; > + if (sysctl_panic_on_oom != 2) { > + /* > + * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel > + * does not panic for cpuset, mempolicy, or memcg allocation > + * failures. > + */ > + if (constraint != CONSTRAINT_NONE) > + return; > + } > + read_lock(&tasklist_lock); > + dump_header(NULL, gfp_mask, order, NULL); > + read_unlock(&tasklist_lock); > + panic("Out of memory: %s panic_on_oom is enabled\n", > + sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); > +} > + > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > { > unsigned long points = 0; > struct task_struct *p; > > - if (sysctl_panic_on_oom == 2) > - panic("out of memory(memcg). panic_on_oom is selected.\n"); > + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); > + p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > @@ -616,8 +639,8 @@ retry: > > /* Found nothing?!?! Either we hang forever, or we panic. */ > if (!p) { > - read_unlock(&tasklist_lock); > dump_header(NULL, gfp_mask, order, NULL); > + read_unlock(&tasklist_lock); > panic("Out of memory and no killable processes...\n"); > } > > @@ -639,9 +662,7 @@ void pagefault_out_of_memory(void) > /* Got some memory back in the last second. */ > return; > > - if (sysctl_panic_on_oom) > - panic("out of memory from page fault. panic_on_oom is selected.\n"); > - > + check_panic_on_oom(CONSTRAINT_NONE, 0, 0); > read_lock(&tasklist_lock); > /* unknown gfp_mask and order */ > __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > @@ -688,29 +709,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > return; > } > > - if (sysctl_panic_on_oom == 2) { > - dump_header(NULL, gfp_mask, order, NULL); > - panic("out of memory. Compulsory panic_on_oom is selected.\n"); > - } > - > /* > * Check if there were limitations on the allocation (only relevant for > * NUMA) that may require different handling. > */ > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > + check_panic_on_oom(constraint, gfp_mask, order); > read_lock(&tasklist_lock); > - if (unlikely(sysctl_panic_on_oom)) { > - /* > - * panic_on_oom only affects CONSTRAINT_NONE, the kernel > - * should not panic for cpuset or mempolicy induced memory > - * failures. > - */ > - if (constraint == CONSTRAINT_NONE) { > - dump_header(NULL, gfp_mask, order, NULL); > - read_unlock(&tasklist_lock); > - panic("Out of memory: panic_on_oom is enabled\n"); > - } > - } > __out_of_memory(gfp_mask, order, constraint, nodemask); > read_unlock(&tasklist_lock); > pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 08C226B01EE for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg2hP008093 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:03 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 8E9BE45DE56 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 5C75945DE51 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 23EC0E08002 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id D2140E38001 for ; Tue, 8 Jun 2010 20:41:58 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: References: Message-Id: <20100608203250.7660.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:58 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > It's unnecessary to SIGKILL a task that is already PF_EXITING and can > actually cause a NULL pointer dereference of the sighand if it has already > been detached. Instead, simply set TIF_MEMDIE so it has access to memory > reserves and can quickly exit as the comment implies. > > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > if (p->flags & PF_EXITING) { > - __oom_kill_task(p, 0); > + set_tsk_thread_flag(p, TIF_MEMDIE); > return 0; > } > I don't pulled PF_EXITING related thing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E7C996B01EC for ; Tue, 8 Jun 2010 07:42:04 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg2uX008051 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:02 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 8926E45DE56 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 518AE45DE52 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 142281DB8015 for ; Tue, 8 Jun 2010 20:42:01 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 497A21DB8017 for ; Tue, 8 Jun 2010 20:42:00 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: References: Message-Id: <20100608203529.7669.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:59 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. > > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. > > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > include/linux/mempolicy.h | 13 +++++++- > mm/mempolicy.c | 44 ++++++++++++++++++++++++ > mm/oom_kill.c | 80 +++++++++++++++++++++++++++----------------- > 3 files changed, 105 insertions(+), 32 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, > return node_zonelist(0, gfp_flags); > } > > -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } > +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) > +{ > + return false; > +} > + > +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + return false; > +} > > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > } > #endif > > +/* > + * mempolicy_nodemask_intersects > + * > + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default > + * policy. Otherwise, check for intersection between mask and the policy > + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local' > + * policy, always return true since it may allocate elsewhere on fallback. > + * > + * Takes task_lock(tsk) to prevent freeing of its mempolicy. > + */ > +bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + struct mempolicy *mempolicy; > + bool ret = true; > + > + if (!mask) > + return ret; > + task_lock(tsk); > + mempolicy = tsk->mempolicy; > + if (!mempolicy) > + goto out; > + > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + /* > + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to > + * allocate from, they may fallback to other nodes when oom. > + * Thus, it's possible for tsk to have allocated memory from > + * nodes in mask. > + */ > + break; > + case MPOL_BIND: > + case MPOL_INTERLEAVE: > + ret = nodes_intersects(mempolicy->v.nodes, *mask); > + break; > + default: > + BUG(); > + } > +out: > + task_unlock(tsk); > + return ret; > +} > + > /* Allocate a page in interleaved policy. > Own path because it needs to do special accounting. */ > static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > > int sysctl_panic_on_oom; > @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock); > /* #define DEBUG */ > > /* > - * Is all threads of the target process nodes overlap ours? > + * Do all threads of the target process overlap our allowed nodes? > + * @tsk: task struct of which task to consider > + * @mask: nodemask passed to page allocator for mempolicy ooms > */ > -static int has_intersects_mems_allowed(struct task_struct *tsk) > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > + const nodemask_t *mask) > { > - struct task_struct *t; > + struct task_struct *start = tsk; > > - t = tsk; > do { > - if (cpuset_mems_allowed_intersects(current, t)) > - return 1; > - t = next_thread(t); > - } while (t != tsk); > - > - return 0; > + if (mask) { > + /* > + * If this is a mempolicy constrained oom, tsk's > + * cpuset is irrelevant. Only return true if its > + * mempolicy intersects current, otherwise it may be > + * needlessly killed. > + */ > + if (mempolicy_nodemask_intersects(tsk, mask)) > + return true; > + } else { > + /* > + * This is not a mempolicy constrained oom, so only > + * check the mems of tsk's cpuset. > + */ > + if (cpuset_mems_allowed_intersects(current, tsk)) > + return true; > + } > + tsk = next_thread(tsk); > + } while (tsk != start); > + return false; > } > > static struct task_struct *find_lock_task_mm(struct task_struct *p) > @@ -253,7 +270,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > * (not docbooked, we don't want this one cluttering up the manual) > */ > static struct task_struct *select_bad_process(unsigned long *ppoints, > - struct mem_cgroup *mem) > + struct mem_cgroup *mem, enum oom_constraint constraint, > + const nodemask_t *mask) > { > struct task_struct *p; > struct task_struct *chosen = NULL; > @@ -269,7 +287,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; > - if (!has_intersects_mems_allowed(p)) > + if (!has_intersects_mems_allowed(p, > + constraint == CONSTRAINT_MEMORY_POLICY ? mask : > + NULL)) > continue; > > /* > @@ -495,7 +515,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > panic("out of memory(memcg). panic_on_oom is selected.\n"); > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem); > + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > @@ -574,7 +594,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > /* > * Must be called with tasklist_lock held for read. > */ > -static void __out_of_memory(gfp_t gfp_mask, int order) > +static void __out_of_memory(gfp_t gfp_mask, int order, > + enum oom_constraint constraint, const nodemask_t *mask) > { > struct task_struct *p; > unsigned long points; > @@ -588,7 +609,7 @@ retry: > * Rambo mode: Shoot down a process and hope it solves whatever > * issues we may have. > */ > - p = select_bad_process(&points, NULL); > + p = select_bad_process(&points, NULL, constraint, mask); > > if (PTR_ERR(p) == -1UL) > return; > @@ -622,7 +643,8 @@ void pagefault_out_of_memory(void) > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > read_lock(&tasklist_lock); > - __out_of_memory(0, 0); /* unknown gfp_mask and order */ > + /* unknown gfp_mask and order */ > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > read_unlock(&tasklist_lock); > > /* > @@ -638,6 +660,7 @@ void pagefault_out_of_memory(void) > * @zonelist: zonelist pointer > * @gfp_mask: memory allocation flags > * @order: amount of memory being requested as a power of 2 > + * @nodemask: nodemask passed to page allocator > * > * If we run out of memory, we have the choice between either > * killing a random task (bad), letting the system crash (worse) > @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > */ > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > read_lock(&tasklist_lock); > - > - switch (constraint) { > - case CONSTRAINT_MEMORY_POLICY: > - oom_kill_process(current, gfp_mask, order, 0, NULL, > - "No available memory (MPOL_BIND)"); > - break; > - > - case CONSTRAINT_NONE: > - if (sysctl_panic_on_oom) { > + if (unlikely(sysctl_panic_on_oom)) { > + /* > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > + * should not panic for cpuset or mempolicy induced memory > + * failures. > + */ > + if (constraint == CONSTRAINT_NONE) { > dump_header(NULL, gfp_mask, order, NULL); > - panic("out of memory. panic_on_oom is selected\n"); > + read_unlock(&tasklist_lock); > + panic("Out of memory: panic_on_oom is enabled\n"); > } > - /* Fall-through */ > - case CONSTRAINT_CPUSET: > - __out_of_memory(gfp_mask, order); > - break; > } > - > + __out_of_memory(gfp_mask, order, constraint, nodemask); > read_unlock(&tasklist_lock); > > /* pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id C5EE16B0215 for ; Tue, 8 Jun 2010 07:42:05 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg3AO008097 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:03 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 99D4745DE52 for ; Tue, 8 Jun 2010 20:42:03 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 6FD5845DE56 for ; Tue, 8 Jun 2010 20:42:03 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 4A27B1DB8040 for ; Tue, 8 Jun 2010 20:42:03 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id D541E1DB805B for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: References: Message-Id: <20100608194514.7654.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:02 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > + /* > + * Warn that /proc/pid/oom_adj is deprecated, see > + * Documentation/feature-removal-schedule.txt. > + */ > + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " > + "please use /proc/%d/oom_score_adj instead.\n", > + current->comm, task_pid_nr(current), > + task_pid_nr(task), task_pid_nr(task)); > task->signal->oom_adj = oom_adjust; Sorry, we can't accept this. oom_adj is one of most freqently used tuning knob. putting this one makes a lot of confusion. In addition, this knob is used from some applications (please google by google code search or something else). that said, an enduser can't stop the warning. that makes a lot of frustration. NO. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id B0F336B0211 for ; Tue, 8 Jun 2010 07:42:06 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg420008131 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:04 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id E571C45DE7D for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 1EC5845DE6F for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id CECB7E38005 for ; Tue, 8 Jun 2010 20:41:59 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 051D81DB8043 for ; Tue, 8 Jun 2010 20:41:58 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 17/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: References: Message-Id: <20100608203146.765A.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:57 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > Add a forkbomb penalty for processes that fork an excessively large > number of children to penalize that group of tasks and not others. A > threshold is configurable from userspace to determine how many first- > generation execve children (those with their own address spaces) a task > may have before it is considered a forkbomb. This can be tuned by > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > 1000. > > When a task has more than 1000 first-generation children with different > address spaces than itself, a penalty of > > (average rss of children) * (# of 1st generation execve children) > ----------------------------------------------------------------- > oom_forkbomb_thres > > is assessed. So, for example, using the default oom_forkbomb_thres of > 1000, the penalty is twice the average rss of all its execve children if > there are 2000 such tasks. A task is considered to count toward the > threshold if its total runtime is less than one second; for 1000 of such > tasks to exist, the parent process must be forking at an extremely high > rate either erroneously or maliciously. > > Even though a particular task may be designated a forkbomb and selected as > the victim, the oom killer will still kill the 1st generation execve child > with the highest badness() score in its place. The avoids killing > important servers or system daemons. When a web server forks a very large > number of threads for client connections, for example, it is much better > to kill one of those threads than to kill the server and make it > unresponsive. Today, I've test this patch. but I can't observed this works. test way prepare: make 500M memory cgroup console1: run memtoy (consume 100M memory) console2: run forkbomb bash script ":(){ :|:& };:" AFAIK, this is most typical forkbom. see http://en.wikipedia.org/wiki/Fork_bomb each bash consume about 100KB and about 4000 bash process consume rest 400M. oom_score list is here. 1) almost bash don't get forkbomb bonus at all 2) maxmumly root bash get 2x bonus and the score changed from 90 to 180. but memtoy (100MB process) have score 25840. Still 143 times score difference is there. pid uid total_vs anonrss(kb) filerss(kb) oom_adj oom_score comm ----------------------------------------------------------------------------------- [ 1865] 0 2880 448 1264 | 0 415 bash [ 1887] 0 12076 284 1056 | 0 325 su [ 1889] 1264 6313 992 1604 | 0 649 zsh [ 1906] 1264 29317 102660 700 | 0 25840 memtoy [ 2006] 0 26999 448 1376 | 0 442 bash [ 2024] 0 36195 292 1160 | 0 352 su [ 2025] 1268 26968 360 1380 | 0 435 bash [ 5555] 1268 26968 364 300 | 0 166 bash [ 5623] 1268 26968 364 300 | 0 166 bash [ 5688] 1268 26968 364 300 | 0 166 bash [ 5711] 1268 26968 364 300 | 0 166 bash [ 5742] 1268 26968 364 300 | 0 166 bash [ 5749] 1268 26968 364 300 | 0 166 bash [ 5752] 1268 26968 364 388 | 0 188 bash [ 5755] 1268 26968 364 300 | 0 166 bash [ 5765] 1268 26968 364 300 | 0 166 bash [ 5791] 1268 26968 364 300 | 0 166 bash [ 5808] 1268 26968 364 300 | 0 166 bash [ 5819] 1268 26968 364 324 | 0 172 bash [ 5835] 1268 26968 364 300 | 0 166 bash [ 5889] 1268 26968 364 300 | 0 166 bash [ 5903] 1268 26968 364 300 | 0 166 bash [ 5924] 1268 26968 364 424 | 0 197 bash ..... (continue to very much bash) [10198] 1268 26968 368 20 | 0 97 bash [10199] 1268 26968 368 20 | 0 97 bash [10200] 1268 26968 368 20 | 0 97 bash [10201] 1268 26968 368 20 | 0 97 bash [10202] 1268 26968 368 20 | 0 97 bash [10203] 1268 26968 368 20 | 0 97 bash [10204] 1268 26968 368 20 | 0 97 bash [10205] 1268 26968 368 20 | 0 97 bash [10206] 1268 26968 368 20 | 0 97 bash [10207] 1268 26968 364 20 | 0 96 bash [10208] 1268 26968 364 20 | 0 96 bash [10209] 1268 26968 368 20 | 0 97 bash [10210] 1268 26968 368 20 | 0 97 bash [10211] 1268 26968 368 20 | 0 97 bash [10212] 1268 26968 368 20 | 0 97 bash [10213] 1268 26968 368 20 | 0 97 bash [10214] 1268 26968 368 20 | 0 97 bash [10215] 1268 26968 368 20 | 0 97 bash [10216] 1268 26968 368 20 | 0 97 bash [10217] 1268 26968 368 20 | 0 97 bash [10218] 1268 26968 368 20 | 0 97 bash Memory cgroup out of memory: Kill process 1906 (memtoy) with score 25840 or sacrifice child Killed process 1906 (memtoy) vsz:117268kB, anon-rss:102660kB, file-rss:700kB At least, the patch author must define which problem is called as "forkbomb" in this description. I don't pulled this one. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 26BDD6B0210 for ; Tue, 8 Jun 2010 07:42:08 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg5lx008138 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:05 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id B704A45DE70 for ; Tue, 8 Jun 2010 20:42:04 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 612EA45DE87 for ; Tue, 8 Jun 2010 20:42:04 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id E5A441DB8043 for ; Tue, 8 Jun 2010 20:42:03 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 2A471E38010 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 13/18] oom: remove special handling for pagefault ooms In-Reply-To: References: Message-Id: <20100608203659.7675.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:01 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > It is possible to remove the special pagefault oom handler by simply oom > locking all system zones and then calling directly into out_of_memory(). > > All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a > parallel oom killing in progress that will lead to eventual memory freeing > so it's not necessary to needlessly kill another task. The context in > which the pagefault is allocating memory is unknown to the oom killer, so > this is done on a system-wide level. > > If a task has already been oom killed and hasn't fully exited yet, this > will be a no-op since select_bad_process() recognizes tasks across the > system with TIF_MEMDIE set. > > Acked-by: Nick Piggin > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 86 +++++++++++++++++++++++++++++++++++++------------------- > 1 files changed, 57 insertions(+), 29 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -615,6 +615,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > } > > /* > + * Try to acquire the oom killer lock for all system zones. Returns zero if a > + * parallel oom killing is taking place, otherwise locks all zones and returns > + * non-zero. > + */ > +static int try_set_system_oom(void) > +{ > + struct zone *zone; > + int ret = 1; > + > + spin_lock(&zone_scan_lock); > + for_each_populated_zone(zone) > + if (zone_is_oom_locked(zone)) { > + ret = 0; > + goto out; > + } > + for_each_populated_zone(zone) > + zone_set_flag(zone, ZONE_OOM_LOCKED); > +out: > + spin_unlock(&zone_scan_lock); > + return ret; > +} > + > +/* > + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation > + * attempts or page faults may now recall the oom killer, if necessary. > + */ > +static void clear_system_oom(void) > +{ > + struct zone *zone; > + > + spin_lock(&zone_scan_lock); > + for_each_populated_zone(zone) > + zone_clear_flag(zone, ZONE_OOM_LOCKED); > + spin_unlock(&zone_scan_lock); > +} > + > + > +/* > * Must be called with tasklist_lock held for read. > */ > static void __out_of_memory(gfp_t gfp_mask, int order, > @@ -649,33 +687,6 @@ retry: > goto retry; > } > > -/* > - * pagefault handler calls into here because it is out of memory but > - * doesn't know exactly how or why. > - */ > -void pagefault_out_of_memory(void) > -{ > - unsigned long freed = 0; > - > - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > - if (freed > 0) > - /* Got some memory back in the last second. */ > - return; > - > - check_panic_on_oom(CONSTRAINT_NONE, 0, 0); > - read_lock(&tasklist_lock); > - /* unknown gfp_mask and order */ > - __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > - read_unlock(&tasklist_lock); > - > - /* > - * Give "p" a good chance of killing itself before we > - * retry to allocate memory. > - */ > - if (!test_thread_flag(TIF_MEMDIE)) > - schedule_timeout_uninterruptible(1); > -} > - > /** > * out_of_memory - kill the "best" process when we run out of memory > * @zonelist: zonelist pointer > @@ -692,7 +703,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > int order, nodemask_t *nodemask) > { > unsigned long freed = 0; > - enum oom_constraint constraint; > + enum oom_constraint constraint = CONSTRAINT_NONE; > > blocking_notifier_call_chain(&oom_notify_list, 0, &freed); > if (freed > 0) > @@ -713,7 +724,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > * Check if there were limitations on the allocation (only relevant for > * NUMA) that may require different handling. > */ > - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > + if (zonelist) > + constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > check_panic_on_oom(constraint, gfp_mask, order); > read_lock(&tasklist_lock); > __out_of_memory(gfp_mask, order, constraint, nodemask); > @@ -726,3 +738,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > if (!test_thread_flag(TIF_MEMDIE)) > schedule_timeout_uninterruptible(1); > } > + > +/* > + * The pagefault handler calls here because it is out of memory, so kill a > + * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel > + * oom killing is already in progress so do nothing. If a task is found with > + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit. > + */ > +void pagefault_out_of_memory(void) > +{ > + if (try_set_system_oom()) { > + out_of_memory(NULL, 0, 0, NULL); > + clear_system_oom(); > + } > + if (!test_thread_flag(TIF_MEMDIE)) > + schedule_timeout_uninterruptible(1); > +} this one is already there in my patch kit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 7394C6B01E3 for ; Tue, 8 Jun 2010 07:42:08 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bg6Rh012501 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:42:06 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 3673845DE7A for ; Tue, 8 Jun 2010 20:42:05 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 086C945DE80 for ; Tue, 8 Jun 2010 20:42:05 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 99681E3800B for ; Tue, 8 Jun 2010 20:42:04 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id CC811E38014 for ; Tue, 8 Jun 2010 20:42:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 14/18] oom: move sysctl declarations to oom.h In-Reply-To: References: Message-Id: <20100608203733.7678.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:42:01 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > The three oom killer sysctl variables (sysctl_oom_dump_tasks, > sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better > declared in include/linux/oom.h rather than kernel/sysctl.c. > > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > include/linux/oom.h | 5 +++++ > kernel/sysctl.c | 4 +--- > 2 files changed, 6 insertions(+), 3 deletions(-) > > diff --git a/include/linux/oom.h b/include/linux/oom.h > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -44,5 +44,10 @@ static inline void oom_killer_enable(void) > { > oom_killer_disabled = false; > } > + > +/* sysctls */ > +extern int sysctl_oom_dump_tasks; > +extern int sysctl_oom_kill_allocating_task; > +extern int sysctl_panic_on_oom; > #endif /* __KERNEL__*/ > #endif /* _INCLUDE_LINUX_OOM_H */ > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -55,6 +55,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -87,9 +88,6 @@ > /* External variables not in a header file. */ > extern int sysctl_overcommit_memory; > extern int sysctl_overcommit_ratio; > -extern int sysctl_panic_on_oom; > -extern int sysctl_oom_kill_allocating_task; > -extern int sysctl_oom_dump_tasks; > extern int max_threads; > extern int core_uses_pid; > extern int suid_dumpable; pulled. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id A46E76B01C4 for ; Tue, 8 Jun 2010 14:47:45 -0400 (EDT) Received: from kpbe12.cbf.corp.google.com (kpbe12.cbf.corp.google.com [172.25.105.76]) by smtp-out.google.com with ESMTP id o58IlOIN008371 for ; Tue, 8 Jun 2010 11:47:24 -0700 Received: from pxi19 (pxi19.prod.google.com [10.243.27.19]) by kpbe12.cbf.corp.google.com with ESMTP id o58IlJao015160 for ; Tue, 8 Jun 2010 11:47:23 -0700 Received: by pxi19 with SMTP id 19so2324335pxi.17 for ; Tue, 08 Jun 2010 11:47:23 -0700 (PDT) Date: Tue, 8 Jun 2010 11:47:17 -0700 (PDT) From: David Rientjes Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: <20100608203216.765D.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203216.765D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > It's possible to livelock the page allocator if a thread has mm->mmap_sem > > and fails to make forward progress because the oom killer selects another > > thread sharing the same ->mm to kill that cannot exit until the semaphore > > is dropped. > > > > The oom killer will not kill multiple tasks at the same time; each oom > > killed task must exit before another task may be killed. Thus, if one > > thread is holding mm->mmap_sem and cannot allocate memory, all threads > > sharing the same ->mm are blocked from exiting as well. In the oom kill > > case, that means the thread holding mm->mmap_sem will never free > > additional memory since it cannot get access to memory reserves and the > > thread that depends on it with access to memory reserves cannot exit > > because it cannot acquire the semaphore. Thus, the page allocators > > livelocks. > > > > When the oom killer is called and current happens to have a pending > > SIGKILL, this patch automatically gives it access to memory reserves and > > returns. Upon returning to the page allocator, its allocation will > > hopefully succeed so it can quickly exit and free its memory. If not, the > > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > > > Acked-by: KOSAKI Motohiro > > Reviewed-by: KAMEZAWA Hiroyuki > > Signed-off-by: David Rientjes > > --- > > mm/oom_kill.c | 10 ++++++++++ > > 1 files changed, 10 insertions(+), 0 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > /* Got some memory back in the last second. */ > > return; > > > > + /* > > + * If current has a pending SIGKILL, then automatically select it. The > > + * goal is to allow it to allocate so that it may quickly exit and free > > + * its memory. > > + */ > > + if (fatal_signal_pending(current)) { > > + set_thread_flag(TIF_MEMDIE); > > + return; > > + } > > + > > if (sysctl_panic_on_oom == 2) { > > dump_header(NULL, gfp_mask, order, NULL); > > panic("out of memory. Compulsory panic_on_oom is selected.\n"); > > Sorry, I had found this patch works incorrect. I don't pulled. > You're taking back your ack? Why does this not work? It's not killing a potentially immune task, the task is already dying. We're simply giving it access to memory reserves so that it may quickly exit and die. OOM_DISABLE does not imply that a task cannot exit on its own or be killed by another application or user, we simply don't want to needlessly kill another task when current is dying in the first place without being able to allocate memory. Please reconsider your thought. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 0C6426B01CA for ; Tue, 8 Jun 2010 14:48:46 -0400 (EDT) Received: from kpbe17.cbf.corp.google.com (kpbe17.cbf.corp.google.com [172.25.105.81]) by smtp-out.google.com with ESMTP id o58Imhg7022973 for ; Tue, 8 Jun 2010 11:48:43 -0700 Received: from pxi19 (pxi19.prod.google.com [10.243.27.19]) by kpbe17.cbf.corp.google.com with ESMTP id o58ImgUs026553 for ; Tue, 8 Jun 2010 11:48:42 -0700 Received: by pxi19 with SMTP id 19so1834574pxi.3 for ; Tue, 08 Jun 2010 11:48:41 -0700 (PDT) Date: Tue, 8 Jun 2010 11:48:37 -0700 (PDT) From: David Rientjes Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: <20100608203250.7660.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203250.7660.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > It's unnecessary to SIGKILL a task that is already PF_EXITING and can > > actually cause a NULL pointer dereference of the sighand if it has already > > been detached. Instead, simply set TIF_MEMDIE so it has access to memory > > reserves and can quickly exit as the comment implies. > > > > Reviewed-by: KAMEZAWA Hiroyuki > > Signed-off-by: David Rientjes > > --- > > mm/oom_kill.c | 2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > * its children or threads, just set TIF_MEMDIE so it can die quickly > > */ > > if (p->flags & PF_EXITING) { > > - __oom_kill_task(p, 0); > > + set_tsk_thread_flag(p, TIF_MEMDIE); > > return 0; > > } > > > > I don't pulled PF_EXITING related thing. > What are you pulling? You're not a maintainer! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 9BFD26B01D1 for ; Tue, 8 Jun 2010 14:51:52 -0400 (EDT) Received: from wpaz29.hot.corp.google.com (wpaz29.hot.corp.google.com [172.24.198.93]) by smtp-out.google.com with ESMTP id o58IpkUh026908 for ; Tue, 8 Jun 2010 11:51:47 -0700 Received: from pwi6 (pwi6.prod.google.com [10.241.219.6]) by wpaz29.hot.corp.google.com with ESMTP id o58IpUEC024526 for ; Tue, 8 Jun 2010 11:51:45 -0700 Received: by pwi6 with SMTP id 6so1833113pwi.28 for ; Tue, 08 Jun 2010 11:51:45 -0700 (PDT) Date: Tue, 8 Jun 2010 11:51:32 -0700 (PDT) From: David Rientjes Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608203342.7663.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203342.7663.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro , Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > > points /= 4; > > > > /* > > - * If p's nodes don't overlap ours, it may still help to kill p > > - * because p may have allocated or otherwise mapped memory on > > - * this node before. However it will be less likely. > > - */ > > - if (!has_intersects_mems_allowed(p)) > > - points /= 8; > > - > > - /* > > * Adjust the score by oom_adj. > > */ > > if (oom_adj) { > > @@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > continue; > > if (mem && !task_in_mem_cgroup(p, mem)) > > continue; > > + if (!has_intersects_mems_allowed(p)) > > + continue; > > > > /* > > * This task already has access to memory reserves and is > > pulled. but I'll merge my fix. and append historical remark. > Andrew, are you the maintainer for these fixes or is KOSAKI? I've been posting this particular patch for at least three months with five acks: Acked-by: Rik van Riel Acked-by: Nick Piggin Acked-by: Balbir Singh Acked-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki and now he's saying he'll merge his own fix and rewrite the changelog and pull it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 323AC6B01D2 for ; Tue, 8 Jun 2010 14:53:34 -0400 (EDT) Received: from wpaz13.hot.corp.google.com (wpaz13.hot.corp.google.com [172.24.198.77]) by smtp-out.google.com with ESMTP id o58IrUAL029234 for ; Tue, 8 Jun 2010 11:53:30 -0700 Received: from pxi1 (pxi1.prod.google.com [10.243.27.1]) by wpaz13.hot.corp.google.com with ESMTP id o58IpuS1020160 for ; Tue, 8 Jun 2010 11:53:29 -0700 Received: by pxi1 with SMTP id 1so2329792pxi.8 for ; Tue, 08 Jun 2010 11:53:29 -0700 (PDT) Date: Tue, 8 Jun 2010 11:53:17 -0700 (PDT) From: David Rientjes Subject: Re: [patch 08/18] oom: sacrifice child with highest badness score for parent In-Reply-To: <20100608203443.7666.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203443.7666.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > unsigned long points, struct mem_cgroup *mem, > > const char *message) > > { > > + struct task_struct *victim = p; > > struct task_struct *c; > > struct task_struct *t = p; > > + unsigned long victim_points = 0; > > + struct timespec uptime; > > > > if (printk_ratelimit()) > > dump_header(p, gfp_mask, order, mem); > > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > return 0; > > } > > > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > > - message, task_pid_nr(p), p->comm, points); > > + pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", > > + message, task_pid_nr(p), p->comm, points); > > > > - /* Try to kill a child first */ > > + /* Try to sacrifice the worst child first */ > > + do_posix_clock_monotonic_gettime(&uptime); > > do { > > + unsigned long cpoints; > > + > > list_for_each_entry(c, &t->children, sibling) { > > if (c->mm == p->mm) > > continue; > > if (mem && !task_in_mem_cgroup(c, mem)) > > continue; > > - if (!oom_kill_task(c)) > > - return 0; > > + > > + /* badness() returns 0 if the thread is unkillable */ > > + cpoints = badness(c, uptime.tv_sec); > > + if (cpoints > victim_points) { > > + victim = c; > > + victim_points = cpoints; > > + } > > } > > } while_each_thread(p, t); > > > > - return oom_kill_task(p); > > + return oom_kill_task(victim); > > } > > > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > > better version already is there in my patch kit. > Would you like to review this one? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 5C4CD6B01D4 for ; Tue, 8 Jun 2010 14:56:40 -0400 (EDT) Received: from wpaz37.hot.corp.google.com (wpaz37.hot.corp.google.com [172.24.198.101]) by smtp-out.google.com with ESMTP id o58IubQ7001933 for ; Tue, 8 Jun 2010 11:56:37 -0700 Received: from pwj9 (pwj9.prod.google.com [10.241.219.73]) by wpaz37.hot.corp.google.com with ESMTP id o58IuGt8013391 for ; Tue, 8 Jun 2010 11:56:36 -0700 Received: by pwj9 with SMTP id 9so431887pwj.32 for ; Tue, 08 Jun 2010 11:56:36 -0700 (PDT) Date: Tue, 8 Jun 2010 11:56:30 -0700 (PDT) From: David Rientjes Subject: Re: [patch 10/18] oom: enable oom tasklist dump by default In-Reply-To: <20100608203540.766C.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203540.766C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > > --- a/Documentation/sysctl/vm.txt > > +++ b/Documentation/sysctl/vm.txt > > @@ -511,7 +511,7 @@ information may not be desired. > > If this is set to non-zero, this information is shown whenever the > > OOM killer actually kills a memory-hogging task. > > > > -The default value is 0. > > +The default value is 1 (enabled). > > > > ============================================================== > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index ef048c1..833de48 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -32,7 +32,7 @@ > > > > int sysctl_panic_on_oom; > > int sysctl_oom_kill_allocating_task; > > -int sysctl_oom_dump_tasks; > > +int sysctl_oom_dump_tasks = 1; > > static DEFINE_SPINLOCK(zone_scan_lock); > > /* #define DEBUG */ > > > > pulled. > What the heck? You're not a maintainer, what are you pulling? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 594FF6B01D7 for ; Tue, 8 Jun 2010 14:57:54 -0400 (EDT) Received: from hpaq13.eem.corp.google.com (hpaq13.eem.corp.google.com [172.25.149.13]) by smtp-out.google.com with ESMTP id o58IvlUg019520 for ; Tue, 8 Jun 2010 11:57:47 -0700 Received: from pxi10 (pxi10.prod.google.com [10.243.27.10]) by hpaq13.eem.corp.google.com with ESMTP id o58IvjdL007721 for ; Tue, 8 Jun 2010 11:57:46 -0700 Received: by pxi10 with SMTP id 10so2742822pxi.7 for ; Tue, 08 Jun 2010 11:57:45 -0700 (PDT) Date: Tue, 8 Jun 2010 11:57:38 -0700 (PDT) From: David Rientjes Subject: Re: [patch 13/18] oom: remove special handling for pagefault ooms In-Reply-To: <20100608203659.7675.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608203659.7675.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > this one is already there in my patch kit. > I think you need a reality check in your position as a kernel hacker and not a kernel maintainer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 2963E6B01D8 for ; Tue, 8 Jun 2010 15:00:13 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id o58J0AqP023484 for ; Tue, 8 Jun 2010 12:00:10 -0700 Received: from pxi12 (pxi12.prod.google.com [10.243.27.12]) by wpaz17.hot.corp.google.com with ESMTP id o58J09l0005274 for ; Tue, 8 Jun 2010 12:00:09 -0700 Received: by pxi12 with SMTP id 12so1975983pxi.28 for ; Tue, 08 Jun 2010 12:00:09 -0700 (PDT) Date: Tue, 8 Jun 2010 12:00:04 -0700 (PDT) From: David Rientjes Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: <20100608194514.7654.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608194514.7654.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > + /* > > + * Warn that /proc/pid/oom_adj is deprecated, see > > + * Documentation/feature-removal-schedule.txt. > > + */ > > + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " > > + "please use /proc/%d/oom_score_adj instead.\n", > > + current->comm, task_pid_nr(current), > > + task_pid_nr(task), task_pid_nr(task)); > > task->signal->oom_adj = oom_adjust; > > Sorry, we can't accept this. oom_adj is one of most freqently used > tuning knob. putting this one makes a lot of confusion. > We? Who are you representing? The deprecation of this tunable was suggested by Andrew since it is replaced with a more powerful and finer-grained tunable, oom_score_adj. The deprecation date is two years from now which gives plenty of opportunity for users to use the new, well-documented interface. > In addition, this knob is used from some applications (please google > by google code search or something else). that said, an enduser can't > stop the warning. that makes a lot of frustration. NO. > They can report it over the two year period and hopefully get it fixed up, this isn't a BUG(), it's a printk_once(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 64F0A6B01D4 for ; Tue, 8 Jun 2010 15:27:55 -0400 (EDT) Date: Tue, 8 Jun 2010 12:27:40 -0700 From: Andrew Morton Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset Message-Id: <20100608122740.8f045c78.akpm@linux-foundation.org> In-Reply-To: References: <20100608203342.7663.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 11:51:32 -0700 (PDT) David Rientjes wrote: > Andrew, are you the maintainer for these fixes or is KOSAKI? I am, thanks. Kosaki-san, you're making this harder than it should be. Please either ack David's patches or promptly work with him on finalising them. I realise that you have additional oom-killer patches but it's too complex to try to work on two patch series concurrently. So let's concentrate on get David's work sorted out and merged and then please rebase yours on the result. I certainly don't have the time or inclination to go through two patchsets and work out what the similarities and differences are so I'll be concentrating on David's ones first. The order in which we do this doesn't really matter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 1FDEF6B01D6 for ; Tue, 8 Jun 2010 15:33:31 -0400 (EDT) Date: Tue, 8 Jun 2010 12:33:20 -0700 From: Andrew Morton Subject: Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads Message-Id: <20100608123320.11e501a4.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:00 -0700 (PDT) David Rientjes wrote: > From: Oleg Nesterov > > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this > is not true due to use_mm(). > > Change the code to check PF_KTHREAD. > > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 9 +++------ > 1 files changed, 3 insertions(+), 6 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > for_each_process(p) { > unsigned long points; > > - /* > - * skip kernel threads and tasks which have already released > - * their mm. > - */ > + /* skip tasks that have already released their mm */ > if (!p->mm) > continue; > - /* skip the init task */ > - if (is_global_init(p)) > + /* skip the init task and kthreads */ > + if (is_global_init(p) || (p->flags & PF_KTHREAD)) > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; Applied, thanks. A minor bugfix. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id C613B6B01DB for ; Tue, 8 Jun 2010 15:42:54 -0400 (EDT) Date: Tue, 8 Jun 2010 12:42:46 -0700 From: Andrew Morton Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives Message-Id: <20100608124246.9258ccab.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:03 -0700 (PDT) David Rientjes wrote: > From: Oleg Nesterov > > Almost all ->mm == NUL checks in oom_kill.c are wrong. > > The current code assumes that the task without ->mm has already > released its memory and ignores the process. However this is not > necessarily true when this process is multithreaded, other live > sub-threads can use this ->mm. > > - Remove the "if (!p->mm)" check in select_bad_process(), it is > just wrong. > > - Add the new helper, find_lock_task_mm(), which finds the live > thread which uses the memory and takes task_lock() to pin ->mm > > - change oom_badness() to use this helper instead of just checking > ->mm != NULL. > > - As David pointed out, select_bad_process() must never choose the > task without ->mm, but no matter what oom_badness() returns the > task can be chosen if nothing else has been found yet. > > Change oom_badness() to return int, change it to return -1 if > find_lock_task_mm() fails, and change select_bad_process() to > check points >= 0. > > Note! This patch is not enough, we need more changes. > > - oom_badness() was fixed, but oom_kill_task() still ignores > the task without ->mm > > - oom_forkbomb_penalty() should use find_lock_task_mm() too, > and it also needs other changes to actually find the first > first-descendant children > > This will be addressed later. > > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes I assume from the above that we should have a Signed-off-by:kosaki here. I didn't make that change yet - please advise. > mm/oom_kill.c | 74 +++++++++++++++++++++++++++++++++------------------------ > 1 files changed, 43 insertions(+), 31 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > return 0; > } > > +static struct task_struct *find_lock_task_mm(struct task_struct *p) > +{ > + struct task_struct *t = p; > + > + do { > + task_lock(t); > + if (likely(t->mm)) > + return t; > + task_unlock(t); > + } while_each_thread(p, t); > + > + return NULL; > +} What pins `p'? Ah, caller must hold tasklist_lock. > /** > * badness - calculate a numeric value for how bad this task has been > * @p: task struct of which task we should calculate > @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > unsigned long badness(struct task_struct *p, unsigned long uptime) > { > unsigned long points, cpu_time, run_time; > - struct mm_struct *mm; > struct task_struct *child; > + struct task_struct *c, *t; > int oom_adj = p->signal->oom_adj; > struct task_cputime task_time; > unsigned long utime; > @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > if (oom_adj == OOM_DISABLE) > return 0; > > - task_lock(p); > - mm = p->mm; > - if (!mm) { > - task_unlock(p); > + p = find_lock_task_mm(p); > + if (!p) > return 0; > - } > > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = p->mm->total_vm; > > /* > * After this unlock we can no longer dereference local variable `mm' This comment is stale. Replace with p->mm. > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > * child is eating the vast majority of memory, adding only half > * to the parents will make the child our kill candidate of choice. > */ > - list_for_each_entry(child, &p->children, sibling) { > - task_lock(child); > - if (child->mm != mm && child->mm) > - points += child->mm->total_vm/2 + 1; > - task_unlock(child); > - } > + t = p; > + do { > + list_for_each_entry(c, &t->children, sibling) { > + child = find_lock_task_mm(c); > + if (child) { > + if (child->mm != p->mm) > + points += child->mm->total_vm/2 + 1; What if 1000 children share the same mm? Doesn't this give a grossly wrong result? > + task_unlock(child); > + } > + } > + } while_each_thread(p, t); > > /* > * CPU time is in tens of seconds and run time is in thousands > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > for_each_process(p) { > unsigned long points; > > - /* skip tasks that have already released their mm */ > - if (!p->mm) > - continue; > /* skip the init task and kthreads */ > if (is_global_init(p) || (p->flags & PF_KTHREAD)) > continue; > @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose) > return; > } > > - task_lock(p); > - if (!p->mm) { > - WARN_ON(1); > - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n", > - task_pid_nr(p), p->comm); > - task_unlock(p); > + p = find_lock_task_mm(p); > + if (!p) > return; > - } > > if (verbose) > printk(KERN_ERR "Killed process %d (%s) " > @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > const char *message) > { > struct task_struct *c; > + struct task_struct *t = p; > > if (printk_ratelimit()) > dump_header(p, gfp_mask, order, mem); > @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > message, task_pid_nr(p), p->comm, points); > > /* Try to kill a child first */ It'd be nice to improve the comments a bit. This one tells us the "what" (which is usually obvious) but didn't tell us "why", which is often the unobvious. > - list_for_each_entry(c, &p->children, sibling) { > - if (c->mm == p->mm) > - continue; > - if (mem && !task_in_mem_cgroup(c, mem)) > - continue; > - if (!oom_kill_task(c)) > - return 0; > - } > + do { > + list_for_each_entry(c, &t->children, sibling) { > + if (c->mm == p->mm) > + continue; > + if (mem && !task_in_mem_cgroup(c, mem)) > + continue; > + if (!oom_kill_task(c)) > + return 0; > + } > + } while_each_thread(p, t); > + > return oom_kill_task(p); > } I'll apply this for now.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id A07056B01DB for ; Tue, 8 Jun 2010 15:55:43 -0400 (EDT) Date: Tue, 8 Jun 2010 12:55:33 -0700 From: Andrew Morton Subject: Re: [patch 03/18] oom: dump_tasks use find_lock_task_mm too Message-Id: <20100608125533.086a4191.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:12 -0700 (PDT) David Rientjes wrote: > From: KOSAKI Motohiro > > dump_task() should use find_lock_task_mm() too. It is necessary for > protecting task-exiting race. A full description of the race would help people understand the code and the change. > Signed-off-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 39 +++++++++++++++++++++------------------ > 1 files changed, 21 insertions(+), 18 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > */ > static void dump_tasks(const struct mem_cgroup *mem) The comment over this function needs to be updated to describe the role of incoming argument `mem'. > { > - struct task_struct *g, *p; > + struct task_struct *p; > + struct task_struct *task; > > printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " > "name\n"); > - do_each_thread(g, p) { > - struct mm_struct *mm; > - > - if (mem && !task_in_mem_cgroup(p, mem)) > + for_each_process(p) { The switch from do_each_thread() to for_each_process() is unchangelogged. It looks like a little cleanup to me. > + /* > + * We don't have is_global_init() check here, because the old > + * code do that. printing init process is not big matter. But > + * we don't hope to make unnecessary compatibility breaking. > + */ When merging others' patches, please do review and if necessary fix or enhance the comments and the changelog. I don't think people take offense. Also, I don't think it's really valuable to document *changes* within the code comments. This comment is referring to what the old code did versus the new code. Generally it's best to just document the code as it presently stands and leave the documentation of the delta to the changelog. That's not always true, of course - we should document oddball code which is left there for userspace-visible back-compatibility reasons. > + if (p->flags & PF_KTHREAD) > continue; > - if (!thread_group_leader(p)) > + if (mem && !task_in_mem_cgroup(p, mem)) > continue; > > - task_lock(p); > - mm = p->mm; > - if (!mm) { > + task = find_lock_task_mm(p); > + if (!task) { > /* > - * total_vm and rss sizes do not exist for tasks with no > - * mm so there's no need to report them; they can't be > - * oom killed anyway. > + * Probably oom vs task-exiting race was happen and ->mm > + * have been detached. thus there's no need to report > + * them; they can't be oom killed anyway. > */ OK, that hinted at the race but still didn't really tell readers what it is. > - task_unlock(p); > continue; > } > + > printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", > - p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm, > - get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj, > - p->comm); > - task_unlock(p); > - } while_each_thread(g, p); > + task->pid, __task_cred(task)->uid, task->tgid, > + task->mm->total_vm, get_mm_rss(task->mm), > + (int)task_cpu(task), task->signal->oom_adj, p->comm); No need to cast the task_cpu() return value - just use %u. > + task_unlock(task); > + } > } > > static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 582E96B01D9 for ; Tue, 8 Jun 2010 16:00:38 -0400 (EDT) Date: Tue, 8 Jun 2010 13:00:30 -0700 From: Andrew Morton Subject: Re: [patch 04/18] oom: PF_EXITING check should take mm into account Message-Id: <20100608130030.0ed9f4f4.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:15 -0700 (PDT) David Rientjes wrote: > From: Oleg Nesterov > > select_bad_process() checks PF_EXITING to detect the task which is going > to release its memory, but the logic is very wrong. > > - a single process P with the dead group leader disables > select_bad_process() completely, it will always return > ERR_PTR() while P can live forever > > - if the PF_EXITING task has already released its ->mm > it doesn't make sense to expect it is goiing to free > more memory (except task_struct/etc) > > Change the code to ignore the PF_EXITING tasks without ->mm. > > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > * the process of exiting and releasing its resources. > * Otherwise we could get an easy OOM deadlock. > */ > - if (p->flags & PF_EXITING) { > + if ((p->flags & PF_EXITING) && p->mm) { > if (p != current) > return ERR_PTR(-1UL); Looks good to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id EE80B6B01DD for ; Tue, 8 Jun 2010 16:08:42 -0400 (EDT) Date: Tue, 8 Jun 2010 13:08:04 -0700 From: Andrew Morton Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed Message-Id: <20100608130804.8794d029.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:18 -0700 (PDT) David Rientjes wrote: > It's possible to livelock the page allocator if a thread has mm->mmap_sem What is the state of this thread? Trying to allocate memory, I assume. > and fails to make forward progress because the oom killer selects another > thread sharing the same ->mm to kill that cannot exit until the semaphore > is dropped. > > The oom killer will not kill multiple tasks at the same time; each oom > killed task must exit before another task may be killed. This sounds like a quite risky design. The possibility that we'll cause other dead/livelocks similar to this one seems pretty high. It applies to all sleeping locks in the entire kernel, doesn't it? If so: it's unfortunate that the kernel doesn't dsitinguish between D-state-for-locks and D-state-for-disk-io. Otherwise we could just skip over D-state-for-locks processes. Or maybe I'm wrong ;) > Thus, if one > thread is holding mm->mmap_sem and cannot allocate memory, all threads > sharing the same ->mm are blocked from exiting as well. In the oom kill > case, that means the thread holding mm->mmap_sem will never free > additional memory since it cannot get access to memory reserves and the > thread that depends on it with access to memory reserves cannot exit > because it cannot acquire the semaphore. Thus, the page allocators > livelocks. > > When the oom killer is called and current happens to have a pending > SIGKILL, this patch automatically gives it access to memory reserves and > returns. Upon returning to the page allocator, its allocation will > hopefully succeed so it can quickly exit and free its memory. If not, the > page allocator will fail the allocation if it is not __GFP_NOFAIL. You said "hopefully". Does it actually work? Any real-world testing results? If so, they'd be a useful addition to the changelog. > Acked-by: KOSAKI Motohiro > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > /* Got some memory back in the last second. */ > return; > > + /* > + * If current has a pending SIGKILL, then automatically select it. The > + * goal is to allow it to allocate so that it may quickly exit and free > + * its memory. > + */ > + if (fatal_signal_pending(current)) { > + set_thread_flag(TIF_MEMDIE); > + return; > + } > + > if (sysctl_panic_on_oom == 2) { > dump_header(NULL, gfp_mask, order, NULL); > panic("out of memory. Compulsory panic_on_oom is selected.\n"); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 100176B01DD for ; Tue, 8 Jun 2010 16:12:20 -0400 (EDT) Date: Tue, 8 Jun 2010 13:12:11 -0700 From: Andrew Morton Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed Message-Id: <20100608131211.e769e3a1.akpm@linux-foundation.org> In-Reply-To: <20100608203216.765D.A69D9226@jp.fujitsu.com> References: <20100608203216.765D.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 20:41:57 +0900 (JST) KOSAKI Motohiro wrote: > > + > > if (sysctl_panic_on_oom == 2) { > > dump_header(NULL, gfp_mask, order, NULL); > > panic("out of memory. Compulsory panic_on_oom is selected.\n"); > > Sorry, I had found this patch works incorrect. I don't pulled. Saying "it doesn't work and I'm not telling you why" is unhelpful. In fact it's the opposite of helpful because it blocks merging of the fix and doesn't give us any way to move forward. So what can I do? Hard. What I shall do is to merge the patch in the hope that someone else will discover the undescribed problem and we will fix it then. That's very inefficient. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 440796B01DD for ; Tue, 8 Jun 2010 16:15:32 -0400 (EDT) Date: Tue, 8 Jun 2010 22:14:03 +0200 From: Oleg Nesterov Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives Message-ID: <20100608201403.GA10264@redhat.com> References: <20100608124246.9258ccab.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100608124246.9258ccab.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: David Rientjes , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/08, Andrew Morton wrote: > > On Sun, 6 Jun 2010 15:34:03 -0700 (PDT) > David Rientjes wrote: > > > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] > > Signed-off-by: Oleg Nesterov > > Signed-off-by: David Rientjes > > I assume from the above that we should have a Signed-off-by:kosaki > here. I didn't make that change yet - please advise. Yes. The patch mixes 2 changes: find_lock_task_mm patch + "do not forget about the sub-thread's children". The changelog doesn't match the actual changes. > > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > > * child is eating the vast majority of memory, adding only half > > * to the parents will make the child our kill candidate of choice. > > */ > > - list_for_each_entry(child, &p->children, sibling) { > > - task_lock(child); > > - if (child->mm != mm && child->mm) > > - points += child->mm->total_vm/2 + 1; > > - task_unlock(child); > > - } > > + t = p; > > + do { > > + list_for_each_entry(c, &t->children, sibling) { > > + child = find_lock_task_mm(c); > > + if (child) { > > + if (child->mm != p->mm) > > + points += child->mm->total_vm/2 + 1; > > What if 1000 children share the same mm? Doesn't this give a grossly > wrong result? Can't answer. Obviusly it is hard to explain what is the "right" result here. But otoh, without this change we can't account children. Kosaki sent this as a separate change. > > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > for_each_process(p) { > > unsigned long points; > > > > - /* skip tasks that have already released their mm */ > > - if (!p->mm) > > - continue; We shouldn't remove this without removing OR updating the PF_EXITING check below. That is why we had another patch. This change alone allows to trivially disable oom-kill. If we have a process with the dead leader, select_bad_process() will always return -1. We either need another patch from Kosaki's series - if (p->flags & PF_EXITING) + if (p->flags & PF_EXITING && p->mm) or remove this check (David objects). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 20EA16B01DE for ; Tue, 8 Jun 2010 16:17:39 -0400 (EDT) Date: Tue, 8 Jun 2010 13:17:19 -0700 From: Andrew Morton Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-Id: <20100608131719.226b62ef.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:22 -0700 (PDT) David Rientjes wrote: > It's unnecessary to SIGKILL a task that is already PF_EXITING and can > actually cause a NULL pointer dereference of the sighand if it has already > been detached. Instead, simply set TIF_MEMDIE so it has access to memory > reserves and can quickly exit as the comment implies. > > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > if (p->flags & PF_EXITING) { > - __oom_kill_task(p, 0); > + set_tsk_thread_flag(p, TIF_MEMDIE); > return 0; > } Well, we lose a lot of other stuff here. We can set TIF_MEMDIE on the is_global_init() task (how can that get PF_EXITING?). We don't print the "Killed process %d" info. We don't bump the task's timeslice. These are unchangelogged alterations and I for one can't tell whether or not they were deliberate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id A286C6B01DE for ; Tue, 8 Jun 2010 16:19:06 -0400 (EDT) Date: Tue, 8 Jun 2010 22:17:39 +0200 From: Oleg Nesterov Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives Message-ID: <20100608201739.GA11028@redhat.com> References: <20100608124246.9258ccab.akpm@linux-foundation.org> <20100608201403.GA10264@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100608201403.GA10264@redhat.com> Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: David Rientjes , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/08, Oleg Nesterov wrote: > > On 06/08, Andrew Morton wrote: > > > > > - /* skip tasks that have already released their mm */ > > > - if (!p->mm) > > > - continue; > > We shouldn't remove this without removing OR updating the PF_EXITING check > below. That is why we had another patch. > > This change alone allows to trivially disable oom-kill. If we have a process > with the dead leader, select_bad_process() will always return -1. > > We either need another patch from Kosaki's series > > - if (p->flags & PF_EXITING) > + if (p->flags & PF_EXITING && p->mm) OOPS, sorry. I didn't understand you are going to merge this change too. Probably oom-pf_exiting-check-should-take-mm-into-account.patch should go ahead of this one for bisecting. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 860BE6B01E1 for ; Tue, 8 Jun 2010 16:24:15 -0400 (EDT) Date: Tue, 8 Jun 2010 13:23:39 -0700 From: Andrew Morton Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset Message-Id: <20100608132339.54db2317.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:25 -0700 (PDT) David Rientjes wrote: > Tasks that do not share the same set of allowed nodes with the task that > triggered the oom should not be considered as candidates for oom kill. > > Tasks in other cpusets with a disjoint set of mems would be unfairly > penalized otherwise because of oom conditions elsewhere; an extreme > example could unfairly kill all other applications on the system if a > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > more memory than allowed. > > Killing tasks outside of current's cpuset rarely would free memory for > current anyway. To use a sane heuristic, we must ensure that killing a > task would likely free memory for current and avoid needlessly killing > others at all costs just because their potential memory freeing is > unknown. It is better to kill current than another task needlessly. This is all a bit arbitrary, isn't it? The key word here is "rarely". If indeed this task had allocated gobs of memory from `current's nodes and then sneakily switched nodes, this will be a big regression! So.. It's not completely clear to me how we justify this decision. Are we erring too far on the side of keep-tasks-running? Is failing to clear the oom a lot bigger problem than killing an innocent task? I think so. In which case we should err towards slaughtering the innocent? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 093636B01E1 for ; Tue, 8 Jun 2010 16:27:38 -0400 (EDT) Date: Tue, 8 Jun 2010 22:26:11 +0200 From: Oleg Nesterov Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-ID: <20100608202611.GA11284@redhat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: To clarify, I am not going to review this patch ;) As I said many times I can only understand what oom_kill.c does, but now why. On 06/06, David Rientjes wrote: > > It's unnecessary to SIGKILL a task that is already PF_EXITING This probably needs some explanation. PF_EXITING doesn't necessarily mean this process is exiting. > and can > actually cause a NULL pointer dereference of the sighand Yes. Another reason to avoid force_sig(). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 599D56B01D9 for ; Tue, 8 Jun 2010 16:34:17 -0400 (EDT) Date: Tue, 8 Jun 2010 13:33:56 -0700 From: Andrew Morton Subject: Re: [patch 08/18] oom: sacrifice child with highest badness score for parent Message-Id: <20100608133356.6e941d20.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:28 -0700 (PDT) David Rientjes wrote: > When a task is chosen for oom kill, the oom killer first attempts to > sacrifice a child not sharing its parent's memory instead. Unfortunately, > this often kills in a seemingly random fashion based on the ordering of > the selected task's child list. Additionally, it is not guaranteed at all > to free a large amount of memory that we need to prevent additional oom > killing in the very near future. > > Instead, we now only attempt to sacrifice the worst child not sharing its > parent's memory, if one exists. The worst child is indicated with the > highest badness() score. This serves two advantages: we kill a > memory-hogging task more often, and we allow the configurable > /proc/pid/oom_adj value to be considered as a factor in which child to > kill. > > Reviewers may observe that the previous implementation would iterate > through the children and attempt to kill each until one was successful and > then the parent if none were found while the new code simply kills the > most memory-hogging task or the parent. Note that the only time > oom_kill_task() fails, however, is when a child does not have an mm or has > a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, > so the final oom_kill_task() will always succeed. > > Acked-by: Rik van Riel > Acked-by: Nick Piggin > Acked-by: Balbir Singh > Acked-by: KOSAKI Motohiro > Reviewed-by: KAMEZAWA Hiroyuki > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 23 +++++++++++++++++------ > 1 files changed, 17 insertions(+), 6 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > unsigned long points, struct mem_cgroup *mem, > const char *message) > { > + struct task_struct *victim = p; > struct task_struct *c; > struct task_struct *t = p; > + unsigned long victim_points = 0; > + struct timespec uptime; > > if (printk_ratelimit()) > dump_header(p, gfp_mask, order, mem); > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > return 0; > } > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > - message, task_pid_nr(p), p->comm, points); > + pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); fyi, access to another task's ->comm is racy against prctl(). Fixable with get_task_comm(). But that takes task_lock(), which is risky in this code. The world wouldn't end if we didn't fix this ;) > - /* Try to kill a child first */ > + /* Try to sacrifice the worst child first */ > + do_posix_clock_monotonic_gettime(&uptime); > do { > + unsigned long cpoints; This could be local to the list_for_each_entry() block. What does "cpoints" mean? > list_for_each_entry(c, &t->children, sibling) { I'm surprised we don't have a sched.h helper for this. Maybe it's not a very common thing to do. > if (c->mm == p->mm) > continue; > if (mem && !task_in_mem_cgroup(c, mem)) > continue; > - if (!oom_kill_task(c)) > - return 0; > + > + /* badness() returns 0 if the thread is unkillable */ > + cpoints = badness(c, uptime.tv_sec); > + if (cpoints > victim_points) { > + victim = c; > + victim_points = cpoints; > + } > } > } while_each_thread(p, t); > > - return oom_kill_task(p); > + return oom_kill_task(victim); > } And this function is secretly called under tasklist_lock, which is what pins *victim, yes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id CAC2A6B01DD for ; Tue, 8 Jun 2010 17:08:26 -0400 (EDT) Date: Tue, 8 Jun 2010 14:08:18 -0700 From: Andrew Morton Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms Message-Id: <20100608140818.b413c335.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:31 -0700 (PDT) David Rientjes wrote: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. > > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. > > > ... > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > > int sysctl_panic_on_oom; > @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock); > /* #define DEBUG */ > > /* > - * Is all threads of the target process nodes overlap ours? > + * Do all threads of the target process overlap our allowed nodes? > + * @tsk: task struct of which task to consider > + * @mask: nodemask passed to page allocator for mempolicy ooms The comment uses kerneldoc annotation but isn't a kerneldoc comment. > */ > -static int has_intersects_mems_allowed(struct task_struct *tsk) > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > + const nodemask_t *mask) > { > - struct task_struct *t; > + struct task_struct *start = tsk; > > - t = tsk; > do { > - if (cpuset_mems_allowed_intersects(current, t)) > - return 1; > - t = next_thread(t); > - } while (t != tsk); > - > - return 0; > + if (mask) { > + /* > + * If this is a mempolicy constrained oom, tsk's > + * cpuset is irrelevant. Only return true if its > + * mempolicy intersects current, otherwise it may be > + * needlessly killed. > + */ > + if (mempolicy_nodemask_intersects(tsk, mask)) > + return true; The comment refers to `current' but the code does not? > + } else { > + /* > + * This is not a mempolicy constrained oom, so only > + * check the mems of tsk's cpuset. > + */ The comment doesn't refer to `current', but the code does. Confused. > + if (cpuset_mems_allowed_intersects(current, tsk)) > + return true; > + } > + tsk = next_thread(tsk); hm, next_thread() uses list_entry_rcu(). What are the locking rules here? It's one of both of rcu_read_lock() and read_lock(&tasklist_lock), I think? > + } while (tsk != start); > + return false; > } This is all bloat and overhead for non-NUMA builds. I doubt if gcc is able to eliminate the task_struct walk (although I didn't check). The function isn't oom-killer-specific at all - give it a better name then move it to mempolicy.c or similar? If so, the text "oom" shouldn't appear in the comments. > > ... > > @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > */ > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > read_lock(&tasklist_lock); > - > - switch (constraint) { > - case CONSTRAINT_MEMORY_POLICY: > - oom_kill_process(current, gfp_mask, order, 0, NULL, > - "No available memory (MPOL_BIND)"); > - break; > - > - case CONSTRAINT_NONE: > - if (sysctl_panic_on_oom) { > + if (unlikely(sysctl_panic_on_oom)) { > + /* > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > + * should not panic for cpuset or mempolicy induced memory > + * failures. > + */ This wasn't changelogged? > + if (constraint == CONSTRAINT_NONE) { > dump_header(NULL, gfp_mask, order, NULL); > - panic("out of memory. panic_on_oom is selected\n"); > + read_unlock(&tasklist_lock); > + panic("Out of memory: panic_on_oom is enabled\n"); > } > - /* Fall-through */ > - case CONSTRAINT_CPUSET: > - __out_of_memory(gfp_mask, order); > - break; > } > - > + __out_of_memory(gfp_mask, order, constraint, nodemask); > read_unlock(&tasklist_lock); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id A4D216B01DD for ; Tue, 8 Jun 2010 17:14:03 -0400 (EDT) Date: Tue, 8 Jun 2010 14:13:42 -0700 From: Andrew Morton Subject: Re: [patch 10/18] oom: enable oom tasklist dump by default Message-Id: <20100608141342.114156ac.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:35 -0700 (PDT) David Rientjes wrote: > The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is > very helpful information in diagnosing why a user's task has been killed. > It emits useful information such as each eligible thread's memory usage > that can determine why the system is oom, so it should be enabled by > default. Unclear. On a large system the poor thing will now spend half an hour squirting junk out the diagnostic port. Probably interspersed with the occasional whine from the softlockup detector. And for many applications, spending a long time stuck in the kernel printing diagnostics is equivalent to an outage. I guess people can turn it off again if this happens, but they'll get justifiably grumpy at us. I wonder if this change is too developer-friendly and insufficiently operator-friendly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id AFD646B01DD for ; Tue, 8 Jun 2010 17:19:30 -0400 (EDT) Date: Tue, 8 Jun 2010 23:17:48 +0200 From: Oleg Nesterov Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms Message-ID: <20100608211748.GA13542@redhat.com> References: <20100608140818.b413c335.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100608140818.b413c335.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: David Rientjes , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/08, Andrew Morton wrote: > > On Sun, 6 Jun 2010 15:34:31 -0700 (PDT) > David Rientjes wrote: > > > + if (cpuset_mems_allowed_intersects(current, tsk)) > > + return true; > > + } > > + tsk = next_thread(tsk); > > hm, next_thread() uses list_entry_rcu(). What are the locking rules > here? It's one of both of rcu_read_lock() and read_lock(&tasklist_lock), > I think? Yes, next_thread() is safe under tasklist/rcu/siglock. > > + } while (tsk != start); > > + return false; > > } > > This is all bloat and overhead for non-NUMA builds. I doubt if gcc is > able to eliminate the task_struct walk (although I didn't check). I'd also suggest while_each_thread() instead if next_thread() + "tsk != start", but this is really minor nit. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 263086B01DF for ; Tue, 8 Jun 2010 17:20:00 -0400 (EDT) Date: Tue, 8 Jun 2010 14:19:28 -0700 From: Andrew Morton Subject: Re: [patch 11/18] oom: avoid oom killer for lowmem allocations Message-Id: <20100608141928.113a89b2.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: > oom: avoid oom killer for lowmem allocations I think the terminology is poor. My 256MB test box only has lowmem! In the past we've used the term "lower zone" here, which is I think what you want? On Sun, 6 Jun 2010 15:34:38 -0700 (PDT) David Rientjes wrote: > If memory has been depleted in lowmem zones even with the protection > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that > killing current users will help. The memory is either reclaimable (or > migratable) already, in which case we should not invoke the oom killer at > all, or it is pinned by an application for I/O. Killing such an > application may leave the hardware in an unspecified state and there is no > guarantee that it will be able to make a timely exit. Killing an application can leave hardware in an unspecified state? How so? That means a ^C kills the box! > Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is > not used so that the task can perhaps recover or try again later. > > Previously, the heuristic provided some protection for those tasks with > CAP_SYS_RAWIO, but this is no longer necessary since we will not be > killing tasks for the purposes of ISA allocations. > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the > default for all allocations that are not __GFP_DMA, __GFP_DMA32, > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those > flags. Testing for high_zoneidx being less than ZONE_NORMAL will only > return true for allocations that have either __GFP_DMA or __GFP_DMA32. > > Acked-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > mm/page_alloc.c | 29 ++++++++++++++++++++--------- > 1 files changed, 20 insertions(+), 9 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > /* The OOM killer will not help higher order allocs */ > if (order > PAGE_ALLOC_COSTLY_ORDER) > goto out; > + /* The OOM killer does not needlessly kill tasks for lowmem */ a) terminology is scary b) comment doesn't explain _why_, which is the most important thing to explain. > + if (high_zoneidx < ZONE_NORMAL) > + goto out; > /* > * GFP_THISNODE contains __GFP_NORETRY and we never hit this. > * Sanity check for bare calls of __GFP_THISNODE, not real OOM. > @@ -2052,15 +2055,23 @@ rebalance: > if (page) > goto got_pg; > > - /* > - * The OOM killer does not trigger for high-order > - * ~__GFP_NOFAIL allocations so if no progress is being > - * made, there are no other options and retrying is > - * unlikely to help. > - */ > - if (order > PAGE_ALLOC_COSTLY_ORDER && > - !(gfp_mask & __GFP_NOFAIL)) > - goto nopage; > + if (!(gfp_mask & __GFP_NOFAIL)) { > + /* > + * The oom killer is not called for high-order > + * allocations that may fail, so if no progress > + * is being made, there are no other options and > + * retrying is unlikely to help. > + */ > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + goto nopage; > + /* > + * The oom killer is not called for lowmem > + * allocations to prevent needlessly killing > + * innocent tasks. > + */ s/lowmem/somethingelse/ > + if (high_zoneidx < ZONE_NORMAL) > + goto nopage; > + } > > goto restart; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 17A0D6B01C4 for ; Tue, 8 Jun 2010 17:27:25 -0400 (EDT) Date: Tue, 8 Jun 2010 14:27:19 -0700 From: Andrew Morton Subject: Re: [patch 13/18] oom: remove special handling for pagefault ooms Message-Id: <20100608142719.02d4f61a.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:44 -0700 (PDT) David Rientjes wrote: > It is possible to remove the special pagefault oom handler It'd be useful to describe what services that handler provides and to then describe how these services are retained in the new version. > by simply oom > locking all system zones and then calling directly into out_of_memory(). > > All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a > parallel oom killing in progress that will lead to eventual memory freeing > so it's not necessary to needlessly kill another task. Should that have read "otherwise if there is"? (the code comments actually clarify all this) > The context in > which the pagefault is allocating memory is unknown to the oom killer, so > this is done on a system-wide level. > > If a task has already been oom killed and hasn't fully exited yet, this > will be a no-op since select_bad_process() recognizes tasks across the > system with TIF_MEMDIE set. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 97CE66B01C4 for ; Tue, 8 Jun 2010 17:34:46 -0400 (EDT) Date: Tue, 8 Jun 2010 14:34:01 -0700 From: Andrew Morton Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives Message-Id: <20100608143401.65d7c932.akpm@linux-foundation.org> In-Reply-To: <20100608201739.GA11028@redhat.com> References: <20100608124246.9258ccab.akpm@linux-foundation.org> <20100608201403.GA10264@redhat.com> <20100608201739.GA11028@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: David Rientjes , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 22:17:39 +0200 Oleg Nesterov wrote: > On 06/08, Oleg Nesterov wrote: > > > > On 06/08, Andrew Morton wrote: > > > > > > > - /* skip tasks that have already released their mm */ > > > > - if (!p->mm) > > > > - continue; > > > > We shouldn't remove this without removing OR updating the PF_EXITING check > > below. That is why we had another patch. > > > > This change alone allows to trivially disable oom-kill. If we have a process > > with the dead leader, select_bad_process() will always return -1. > > > > We either need another patch from Kosaki's series > > > > - if (p->flags & PF_EXITING) > > + if (p->flags & PF_EXITING && p->mm) > > OOPS, sorry. > > I didn't understand you are going to merge this change too. > > Probably oom-pf_exiting-check-should-take-mm-into-account.patch should > go ahead of this one for bisecting. OK, thanks, I did that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 2D6366B01C1 for ; Tue, 8 Jun 2010 18:58:15 -0400 (EDT) Date: Tue, 8 Jun 2010 15:58:02 -0700 From: Andrew Morton Subject: Re: [patch 16/18] oom: badness heuristic rewrite Message-Id: <20100608155802.cdd4aff3.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:54 -0700 (PDT) David Rientjes wrote: > This a complete rewrite of the oom killer's badness() heuristic which is > used to determine which task to kill in oom conditions. The goal is to > make it as simple and predictable as possible so the results are better > understood and we end up killing the task which will lead to the most > memory freeing while still respecting the fine-tuning from userspace. It's not obvious from this description that then end result is better! Have you any testcases or scenarios which got improved? > Instead of basing the heuristic on mm->total_vm for each task, the task's > rss and swap space is used instead. This is a better indication of the > amount of memory that will be freeable if the oom killed task is chosen > and subsequently exits. Again, why should we optimise for the amount of memory which a killing will yield (if that's what you mean). We only need to free enough memory to unblock the oom condition then proceed. The last thing we want to do is to kill a process which has consumed 1000 CPU hours, or which is providing some system-critical service or whatever. Amount-of-memory-freeable is a relatively minor criterion. > This helps specifically in cases where KDE or > GNOME is chosen for oom kill on desktop systems instead of a memory > hogging task. It helps how? Examples and test cases? > The baseline for the heuristic is a proportion of memory that each task is > currently using in memory plus swap compared to the amount of "allowable" > memory. What does "swap" mean? swapspace includes swap-backed swapcache, un-swap-backed swapcache and non-resident swap. Which of all these is being used here and for what reason? > "Allowable," in this sense, means the system-wide resources for > unconstrained oom conditions, the set of mempolicy nodes, the mems > attached to current's cpuset, or a memory controller's limit. The > proportion is given on a scale of 0 (never kill) to 1000 (always kill), > roughly meaning that if a task has a badness() score of 500 that the task > consumes approximately 50% of allowable memory resident in RAM or in swap > space. So is a new aim of this code to also free up swap space? Confused. > The proportion is always relative to the amount of "allowable" memory and > not the total amount of RAM systemwide so that mempolicies and cpusets may > operate in isolation; they shall not need to know the true size of the > machine on which they are running if they are bound to a specific set of > nodes or mems, respectively. > > Root tasks are given 3% extra memory just like __vm_enough_memory() > provides in LSMs. In the event of two tasks consuming similar amounts of > memory, it is generally better to save root's task. > > Because of the change in the badness() heuristic's baseline, it is also > necessary to introduce a new user interface to tune it. It's not possible > to redefine the meaning of /proc/pid/oom_adj with a new scale since the > ABI cannot be changed for backward compatability. Instead, a new tunable, > /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may > be used to polarize the heuristic such that certain tasks are never > considered for oom kill while others may always be considered. The value > is added directly into the badness() score so a value of -500, for > example, means to discount 50% of its memory consumption in comparison to > other tasks either on the system, bound to the mempolicy, in the cpuset, > or sharing the same memory controller. > > /proc/pid/oom_adj is changed so that its meaning is rescaled into the > units used by /proc/pid/oom_score_adj, and vice versa. Changing one of > these per-task tunables will rescale the value of the other to an > equivalent meaning. Although /proc/pid/oom_adj was originally defined as > a bitshift on the badness score, it now shares the same linear growth as > /proc/pid/oom_score_adj but with different granularity. This is required > so the ABI is not broken with userspace applications and allows oom_adj to > be deprecated for future removal. It was a mistake to add oom_adj in the first place. Because it's a user-visible knob which us tied to a particular in-kernel implementation. As we're seeing now, the presence of that knob locks us into a particular implementation. Given that oom_score_adj is just a rescaled version of oom_adj (correct?), I guess things haven't got a lot worse on that front as a result of these changes. General observation regarding the patch description: I'm not seeing a lot of reason for merging the patch! What value does it bring to our users? What problems got solved? Some of Kosaki's observations sounded fairly serious so I'll go into wait-and-see mode on this patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id B11FE6B01C3 for ; Tue, 8 Jun 2010 19:02:34 -0400 (EDT) Date: Tue, 8 Jun 2010 16:02:16 -0700 From: Andrew Morton Subject: Re: [patch 16/18] oom: badness heuristic rewrite Message-Id: <20100608160216.bc52112b.akpm@linux-foundation.org> In-Reply-To: <20100608194533.7657.A69D9226@jp.fujitsu.com> References: <20100608194533.7657.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 20:41:56 +0900 (JST) KOSAKI Motohiro wrote: > > ... > > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -4,6 +4,8 @@ > > * Copyright (C) 1998,2000 Rik van Riel > > * Thanks go out to Claus Fischer for some serious inspiration and > > * for goading me into coding this file... > > + * Copyright (C) 2010 Google, Inc. > > + * Rewritten by David Rientjes > > don't put it. > Seems OK to me. It's a fairly substantial change and people have added their (c) in the past for smaller kernel changes. I guess one could even do this for a one-liner. > > ... > > > /* > > - * Niced processes are most likely less important, so double > > - * their badness points. > > + * The memory controller may have a limit of 0 bytes, so avoid a divide > > + * by zero if necessary. > > */ > > - if (task_nice(p) > 0) > > - points *= 2; > > You removed > - run time check > - cpu time check > - nice check > > but no described the reason. reviewers are puzzled. How do we review > this though we don't get your point? please write > > - What benerit is there? > - Why do you think no bad effect? > - How confirm do you? yup. > > > + if (!totalpages) > > + totalpages = 1; > > > > /* > > - * Superuser processes are usually more important, so we make it > > - * less likely that we kill those. > > + * The baseline for the badness score is the proportion of RAM that each > > + * task's rss and swap space use. > > */ > > - if (has_capability_noaudit(p, CAP_SYS_ADMIN) || > > - has_capability_noaudit(p, CAP_SYS_RESOURCE)) > > - points /= 4; > > + points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 / > > + totalpages; > > + task_unlock(p); > > > > /* > > - * We don't want to kill a process with direct hardware access. > > - * Not only could that mess up the hardware, but usually users > > - * tend to only have this flag set on applications they think > > - * of as important. > > + * Root processes get 3% bonus, just like the __vm_enough_memory() > > + * implementation used by LSMs. > > */ > > - if (has_capability_noaudit(p, CAP_SYS_RAWIO)) > > - points /= 4; > > + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > > + points -= 30; > > > CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive > process. but killing interactive process only cause force logout. but > killing system daemon can makes more catastrophic disaster. > > > Last of all, I'll pulled this one. but only do cherry-pick. > This change was unchangelogged, I don't know what it's for and I don't understand your comment about it. Apart from that, I'm doing great! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 06C406B01C3 for ; Tue, 8 Jun 2010 19:16:19 -0400 (EDT) Date: Tue, 8 Jun 2010 16:15:41 -0700 From: Andrew Morton Subject: Re: [patch 17/18] oom: add forkbomb penalty to badness heuristic Message-Id: <20100608161541.6b43b48c.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Sun, 6 Jun 2010 15:34:58 -0700 (PDT) David Rientjes wrote: > Add a forkbomb penalty for processes that fork an excessively large > number of children to penalize that group of tasks and not others. A > threshold is configurable from userspace to determine how many first- > generation execve children (those with their own address spaces) a task > may have before it is considered a forkbomb. This can be tuned by > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > 1000. > > When a task has more than 1000 first-generation children with different > address spaces than itself, a penalty of > > (average rss of children) * (# of 1st generation execve children) > ----------------------------------------------------------------- > oom_forkbomb_thres > > is assessed. So, for example, using the default oom_forkbomb_thres of > 1000, the penalty is twice the average rss of all its execve children if > there are 2000 such tasks. A task is considered to count toward the > threshold if its total runtime is less than one second; for 1000 of such > tasks to exist, the parent process must be forking at an extremely high > rate either erroneously or maliciously. > > Even though a particular task may be designated a forkbomb and selected as > the victim, the oom killer will still kill the 1st generation execve child > with the highest badness() score in its place. The avoids killing > important servers or system daemons. When a web server forks a very large > number of threads for client connections, for example, it is much better > to kill one of those threads than to kill the server and make it > unresponsive. > - "oom_forkbomb_thresh" or "oom_forkbomb_threshold", please. - No new proc knobs! They lock us into implementation details. - Let's go outside the box: forkbomb is just a workload. Why does one particular workload need special-casing in the oom-killer? If the oom-kill was working well then when a forkbomb causes an oom, the oom-killer would kill whatever is necessary to unlock the system and will then let things proceed. IOW, if the oom-killer can't handle this particular workload gracefully without special-casing then it isn't working well enough. Now, maybe there is an argument that a forkbomb is sufficiently damaging to warrant adding special-case handling in the kernel. But if so, it should be detected and handled at sys_fork() (RLIMIT_NPROC?), not in the oom-killer. Or, better, the kernel should be fixed so that whatever damage the forkbomb causes doesn't get caused any more. (otoh, the oom-killer is already stuffed full of heuristics and this is just another one. But it should work correctly without it, dammit!) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id BE4546B01B2 for ; Tue, 8 Jun 2010 19:19:30 -0400 (EDT) Date: Tue, 8 Jun 2010 16:18:44 -0700 From: Andrew Morton Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable Message-Id: <20100608161844.04d2f2a1.akpm@linux-foundation.org> In-Reply-To: <20100608194514.7654.A69D9226@jp.fujitsu.com> References: <20100608194514.7654.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 20:42:02 +0900 (JST) KOSAKI Motohiro wrote: > > + /* > > + * Warn that /proc/pid/oom_adj is deprecated, see > > + * Documentation/feature-removal-schedule.txt. > > + */ > > + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " > > + "please use /proc/%d/oom_score_adj instead.\n", > > + current->comm, task_pid_nr(current), > > + task_pid_nr(task), task_pid_nr(task)); > > task->signal->oom_adj = oom_adjust; > > Sorry, we can't accept this. oom_adj is one of most freqently used > tuning knob. putting this one makes a lot of confusion. > > In addition, this knob is used from some applications (please google > by google code search or something else). that said, an enduser can't > stop the warning. that makes a lot of frustration. NO. > I think it's OK. We made a mistake in adding oom_adj in the first place and now we get to live with the consequences. We'll be stuck with oom_adj for the next 200 years if we don't tell people to stop using it, and a printk_once() is a good way of doing that. It could be that in two years time we decide that we can't remove oom_adj yet because too many people are still using it. Maybe it will take ten years - but unless we add the above printk, oom_adj will remain forever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 210366B01C1 for ; Tue, 8 Jun 2010 19:40:15 -0400 (EDT) Received: from kpbe12.cbf.corp.google.com (kpbe12.cbf.corp.google.com [172.25.105.76]) by smtp-out.google.com with ESMTP id o58NeB9Y027941 for ; Tue, 8 Jun 2010 16:40:11 -0700 Received: from pvh1 (pvh1.prod.google.com [10.241.210.193]) by kpbe12.cbf.corp.google.com with ESMTP id o58NeA1g021454 for ; Tue, 8 Jun 2010 16:40:10 -0700 Received: by pvh1 with SMTP id 1so186857pvh.1 for ; Tue, 08 Jun 2010 16:40:10 -0700 (PDT) Date: Tue, 8 Jun 2010 16:40:07 -0700 (PDT) From: David Rientjes Subject: Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads In-Reply-To: <20100608123320.11e501a4.akpm@linux-foundation.org> Message-ID: References: <20100608123320.11e501a4.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > From: Oleg Nesterov > > > > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this > > is not true due to use_mm(). > > > > Change the code to check PF_KTHREAD. > > > > Reviewed-by: KAMEZAWA Hiroyuki > > Signed-off-by: Oleg Nesterov > > Signed-off-by: David Rientjes > > --- > > mm/oom_kill.c | 9 +++------ > > 1 files changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > for_each_process(p) { > > unsigned long points; > > > > - /* > > - * skip kernel threads and tasks which have already released > > - * their mm. > > - */ > > + /* skip tasks that have already released their mm */ > > if (!p->mm) > > continue; > > - /* skip the init task */ > > - if (is_global_init(p)) > > + /* skip the init task and kthreads */ > > + if (is_global_init(p) || (p->flags & PF_KTHREAD)) > > continue; > > if (mem && !task_in_mem_cgroup(p, mem)) > > continue; > > Applied, thanks. A minor bugfix. > Thanks! I didn't see it added to -mm, though, so I'll assume it's being queued for 2.6.35-rc3 instead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 294226B01C6 for ; Tue, 8 Jun 2010 19:44:15 -0400 (EDT) Date: Tue, 8 Jun 2010 16:43:25 -0700 From: Andrew Morton Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms Message-Id: <20100608164325.a5fcdb39.akpm@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org, Andrea Arcangeli List-ID: On Sun, 6 Jun 2010 15:34:31 -0700 (PDT) David Rientjes wrote: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. Well OK. But we don't necesarily *want* to "free a substantial amount of memory". We want to resolve the oom within `current'. That's the sole responsibility of the oom-killer. It doesn't have to free up large amounts of additional memory in the expectation that sometime in the future some other task will get an oom as well. if the oom-killer is working well, we can defer those actions until the problem actually occurs. Plus: if `current' isn't using much memory then it's probably a short-lived or not-very-important process anyway. > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. Well... *why* is it better? Needs more justification/explanation IMO. A long time ago Andrea changed the oom-killer so that it basically always killed `current', iirc. I think that shipped in the Suse kernel. Maybe it was only in the case where `current' got an oom when satisfying a pagefault, I forget the details. But according to Andrea, this design provided a simple and practical solution to ooms. So I think this policy change would benefit from a more convincing justification. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id AC2A76B01D0 for ; Tue, 8 Jun 2010 19:50:13 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id o58No981024300 for ; Tue, 8 Jun 2010 16:50:09 -0700 Received: from pvg3 (pvg3.prod.google.com [10.241.210.131]) by wpaz1.hot.corp.google.com with ESMTP id o58NnfAK028183 for ; Tue, 8 Jun 2010 16:50:08 -0700 Received: by pvg3 with SMTP id 3so1406880pvg.18 for ; Tue, 08 Jun 2010 16:50:08 -0700 (PDT) Date: Tue, 8 Jun 2010 16:50:02 -0700 (PDT) From: David Rientjes Subject: Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives In-Reply-To: <20100608124246.9258ccab.akpm@linux-foundation.org> Message-ID: References: <20100608124246.9258ccab.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > From: Oleg Nesterov > > > > Almost all ->mm == NUL checks in oom_kill.c are wrong. > > > > The current code assumes that the task without ->mm has already > > released its memory and ignores the process. However this is not > > necessarily true when this process is multithreaded, other live > > sub-threads can use this ->mm. > > > > - Remove the "if (!p->mm)" check in select_bad_process(), it is > > just wrong. > > > > - Add the new helper, find_lock_task_mm(), which finds the live > > thread which uses the memory and takes task_lock() to pin ->mm > > > > - change oom_badness() to use this helper instead of just checking > > ->mm != NULL. > > > > - As David pointed out, select_bad_process() must never choose the > > task without ->mm, but no matter what oom_badness() returns the > > task can be chosen if nothing else has been found yet. > > > > Change oom_badness() to return int, change it to return -1 if > > find_lock_task_mm() fails, and change select_bad_process() to > > check points >= 0. > > > > Note! This patch is not enough, we need more changes. > > > > - oom_badness() was fixed, but oom_kill_task() still ignores > > the task without ->mm > > > > - oom_forkbomb_penalty() should use find_lock_task_mm() too, > > and it also needs other changes to actually find the first > > first-descendant children > > > > This will be addressed later. > > > > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()] > > Signed-off-by: Oleg Nesterov > > Signed-off-by: David Rientjes > > I assume from the above that we should have a Signed-off-by:kosaki > here. I didn't make that change yet - please advise. > Oops, that was accidently dropped, sorry about that. I folded two of his patches into this one since it introduces find_lock_task_mm() and it needs to be used in the places KOSAKI fixed as well. His original patches are at http://marc.info/?l=linux-mm&m=127537136419677 http://marc.info/?l=linux-mm&m=127537153619893 along with his sign-off. > > > mm/oom_kill.c | 74 +++++++++++++++++++++++++++++++++------------------------ > > 1 files changed, 43 insertions(+), 31 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > > return 0; > > } > > > > +static struct task_struct *find_lock_task_mm(struct task_struct *p) > > +{ > > + struct task_struct *t = p; > > + > > + do { > > + task_lock(t); > > + if (likely(t->mm)) > > + return t; > > + task_unlock(t); > > + } while_each_thread(p, t); > > + > > + return NULL; > > +} > > What pins `p'? Ah, caller must hold tasklist_lock. > I'll add a comment about this in a followup patch, it should remove the the confusion others have had about the naming of the function as well, which I think is good but could use some explanation. > > /** > > * badness - calculate a numeric value for how bad this task has been > > * @p: task struct of which task we should calculate > > @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk) > > unsigned long badness(struct task_struct *p, unsigned long uptime) > > { > > unsigned long points, cpu_time, run_time; > > - struct mm_struct *mm; > > struct task_struct *child; > > + struct task_struct *c, *t; > > int oom_adj = p->signal->oom_adj; > > struct task_cputime task_time; > > unsigned long utime; > > @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > > if (oom_adj == OOM_DISABLE) > > return 0; > > > > - task_lock(p); > > - mm = p->mm; > > - if (!mm) { > > - task_unlock(p); > > + p = find_lock_task_mm(p); > > + if (!p) > > return 0; > > - } > > > > /* > > * The memory size of the process is the basis for the badness. > > */ > > - points = mm->total_vm; > > + points = p->mm->total_vm; > > > > /* > > * After this unlock we can no longer dereference local variable `mm' > > This comment is stale. Replace with p->mm. > Indeed, find_lock_task_mm() returns with task_lock() held for p->mm here so the deference is always safe. I'll send a followup. > > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) > > * child is eating the vast majority of memory, adding only half > > * to the parents will make the child our kill candidate of choice. > > */ > > - list_for_each_entry(child, &p->children, sibling) { > > - task_lock(child); > > - if (child->mm != mm && child->mm) > > - points += child->mm->total_vm/2 + 1; > > - task_unlock(child); > > - } > > + t = p; > > + do { > > + list_for_each_entry(c, &t->children, sibling) { > > + child = find_lock_task_mm(c); > > + if (child) { > > + if (child->mm != p->mm) > > + points += child->mm->total_vm/2 + 1; > > What if 1000 children share the same mm? Doesn't this give a grossly > wrong result? > It does, and that's why there has been large criticism about this particular part of the heuristic over the past few months. It gets removed in my badness() rewrite, but the change here is concerned solely about the use_mm() race so closes a gap that currently exists. > > + task_unlock(child); > > + } > > + } > > + } while_each_thread(p, t); > > > > /* > > * CPU time is in tens of seconds and run time is in thousands > > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > for_each_process(p) { > > unsigned long points; > > > > - /* skip tasks that have already released their mm */ > > - if (!p->mm) > > - continue; > > /* skip the init task and kthreads */ > > if (is_global_init(p) || (p->flags & PF_KTHREAD)) > > continue; > > @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose) > > return; > > } > > > > - task_lock(p); > > - if (!p->mm) { > > - WARN_ON(1); > > - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n", > > - task_pid_nr(p), p->comm); > > - task_unlock(p); > > + p = find_lock_task_mm(p); > > + if (!p) > > return; > > - } > > > > if (verbose) > > printk(KERN_ERR "Killed process %d (%s) " > > @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > const char *message) > > { > > struct task_struct *c; > > + struct task_struct *t = p; > > > > if (printk_ratelimit()) > > dump_header(p, gfp_mask, order, mem); > > @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > message, task_pid_nr(p), p->comm, points); > > > > /* Try to kill a child first */ > > It'd be nice to improve the comments a bit. This one tells us the > "what" (which is usually obvious) but didn't tell us "why", which is > often the unobvious. > This gets modified in oom-sacrifice-child-with-highest-badness-score-for-parent.patch, so I'll expand upon it there and post a followup patch since it's already merged. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 81D526B01D2 for ; Tue, 8 Jun 2010 19:53:01 -0400 (EDT) Date: Tue, 8 Jun 2010 16:52:53 -0700 From: Andrew Morton Subject: Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads Message-Id: <20100608165253.fc4871cb.akpm@linux-foundation.org> In-Reply-To: References: <20100608123320.11e501a4.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 16:40:07 -0700 (PDT) David Rientjes wrote: > > > > Applied, thanks. A minor bugfix. > > > > Thanks! I didn't see it added to -mm, though, doh. > so I'll assume it's being > queued for 2.6.35-rc3 instead. Linus is being all strict - "regression and oops fixes only", and I don't think a fix of this magnitude passes the test. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 43D6C6B01DD for ; Tue, 8 Jun 2010 20:06:43 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o5906dsK012739 for ; Tue, 8 Jun 2010 17:06:40 -0700 Received: from pva18 (pva18.prod.google.com [10.241.209.18]) by kpbe15.cbf.corp.google.com with ESMTP id o5906a00019882 for ; Tue, 8 Jun 2010 17:06:38 -0700 Received: by pva18 with SMTP id 18so6116027pva.0 for ; Tue, 08 Jun 2010 17:06:36 -0700 (PDT) Date: Tue, 8 Jun 2010 17:06:34 -0700 (PDT) From: David Rientjes Subject: Re: [patch 03/18] oom: dump_tasks use find_lock_task_mm too In-Reply-To: <20100608125533.086a4191.akpm@linux-foundation.org> Message-ID: References: <20100608125533.086a4191.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton , KOSAKI Motohiro Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > From: KOSAKI Motohiro > > > > dump_task() should use find_lock_task_mm() too. It is necessary for > > protecting task-exiting race. > > A full description of the race would help people understand the code > and the change. > Ok, here's a description of it that you can add to KOSAKI's changelog if you'd like: dump_tasks() currently filters any task that does not have an attached ->mm since it incorrectly assumes that it must either be in process of exiting and has detached its memory or that it's a kernel thread; multithreaded tasks may actually have subthreads that have a valid ->mm pointer and thus those threads should actually be displayed. This change finds those threads, if they exist, and emit its information along with the rest of the candidate tasks for kill. > > Signed-off-by: KOSAKI Motohiro > > Signed-off-by: David Rientjes > > --- > > mm/oom_kill.c | 39 +++++++++++++++++++++------------------ > > 1 files changed, 21 insertions(+), 18 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > */ > > static void dump_tasks(const struct mem_cgroup *mem) > > The comment over this function needs to be updated to describe the role > of incoming argument `mem'. > Ok, I can take care of this as another comment cleanup in a followup patch. > > { > > - struct task_struct *g, *p; > > + struct task_struct *p; > > + struct task_struct *task; > > > > printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " > > "name\n"); > > - do_each_thread(g, p) { > > - struct mm_struct *mm; > > - > > - if (mem && !task_in_mem_cgroup(p, mem)) > > + for_each_process(p) { > > The switch from do_each_thread() to for_each_process() is > unchangelogged. It looks like a little cleanup to me. > > > + /* > > + * We don't have is_global_init() check here, because the old > > + * code do that. printing init process is not big matter. But > > + * we don't hope to make unnecessary compatibility breaking. > > + */ > > When merging others' patches, please do review and if necessary fix or > enhance the comments and the changelog. I don't think people take > offense. > Ok, I wasn't sure of the etiquette and I didn't want anything else holding this work up. > Also, I don't think it's really valuable to document *changes* within > the code comments. This comment is referring to what the old code did > versus the new code. Generally it's best to just document the code as > it presently stands and leave the documentation of the delta to the > changelog. > > That's not always true, of course - we should document oddball code > which is left there for userspace-visible back-compatibility reasons. > Agreed, I think KOSAKI might be working on a patch that moves all of this tasklist filtering logic to a helper function and would probably fix this up. KOSAKI? > > > + if (p->flags & PF_KTHREAD) > > continue; > > - if (!thread_group_leader(p)) > > + if (mem && !task_in_mem_cgroup(p, mem)) > > continue; > > > > - task_lock(p); > > - mm = p->mm; > > - if (!mm) { > > + task = find_lock_task_mm(p); > > + if (!task) { > > /* > > - * total_vm and rss sizes do not exist for tasks with no > > - * mm so there's no need to report them; they can't be > > - * oom killed anyway. > > + * Probably oom vs task-exiting race was happen and ->mm > > + * have been detached. thus there's no need to report > > + * them; they can't be oom killed anyway. > > */ > > OK, that hinted at the race but still didn't really tell readers what it is. > It's actually mostly incorrect, it does short-circuit the iteration when a task is found to have already exited or detached its memory while we're holding tasklist_lock, but the old comment was probably better. The scenario where this condition will be true 99% of the time is when iterating through the tasklist and finding a kthread. I'll fix this up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id DB2AF6B01DF for ; Tue, 8 Jun 2010 20:14:48 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o590Ekmu024643 for ; Tue, 8 Jun 2010 17:14:47 -0700 Received: from pzk7 (pzk7.prod.google.com [10.243.19.135]) by wpaz5.hot.corp.google.com with ESMTP id o590EiHp012507 for ; Tue, 8 Jun 2010 17:14:45 -0700 Received: by pzk7 with SMTP id 7so2938428pzk.30 for ; Tue, 08 Jun 2010 17:14:44 -0700 (PDT) Date: Tue, 8 Jun 2010 17:14:42 -0700 (PDT) From: David Rientjes Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: <20100608130804.8794d029.akpm@linux-foundation.org> Message-ID: References: <20100608130804.8794d029.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > It's possible to livelock the page allocator if a thread has mm->mmap_sem > > What is the state of this thread? Trying to allocate memory, I assume. > Right, which I agree is a bad scenario to be in but indeed does happen (and we have a workaround at Google that identifies these particular cases and kills the holder of the writelock on mm->mmap_sem). We have one thread holding a readlock on mm->mmap_sem while trying to allocate memory so the oom killer becomes a no-op to prevent needless task killing while waiting for the killed task to exit, but that killed task can't exit because it requires a writelock on the same semaphore. > > and fails to make forward progress because the oom killer selects another > > thread sharing the same ->mm to kill that cannot exit until the semaphore > > is dropped. > > > > The oom killer will not kill multiple tasks at the same time; each oom > > killed task must exit before another task may be killed. > > This sounds like a quite risky design. The possibility that we'll > cause other dead/livelocks similar to this one seems pretty high. It > applies to all sleeping locks in the entire kernel, doesn't it? > It applies to any writelock that is taken during the exitpath of an oom killed task if a thread holding a readlock is trying to allocate memory itself. This is how it's always been done at least within the past few years and we haven't had a problem other than with mm->mmap_sem. At one point we used an oom killer timeout to kill other tasks after a period of time had elapsed, but that hasn't been required since we've been killing the thread holding the writelock on mm->mmap_sem. > > Thus, if one > > thread is holding mm->mmap_sem and cannot allocate memory, all threads > > sharing the same ->mm are blocked from exiting as well. In the oom kill > > case, that means the thread holding mm->mmap_sem will never free > > additional memory since it cannot get access to memory reserves and the > > thread that depends on it with access to memory reserves cannot exit > > because it cannot acquire the semaphore. Thus, the page allocators > > livelocks. > > > > When the oom killer is called and current happens to have a pending > > SIGKILL, this patch automatically gives it access to memory reserves and > > returns. Upon returning to the page allocator, its allocation will > > hopefully succeed so it can quickly exit and free its memory. If not, the > > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > You said "hopefully". > "hopefully" in this case means that the allocation better succeed or we've depleted all memory reserves and we're deadlocked, it doesn't mean that this is a speculative change that may or may not work. > Does it actually work? Any real-world testing results? If so, they'd > be a useful addition to the changelog. > It certain does, and prevents needlessly killing another task when we know current is exiting. The nice thing about that is that we don't need to do anything like checking if a child should be sacrified or if current is OOM_DISABLE: we already know it's dying so it should simply get access to memory reserves either to return and handle its pending SIGKILL or continue down the exitpath. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id B0BFA6B01DF for ; Tue, 8 Jun 2010 20:25:35 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id o590PXQn026009 for ; Tue, 8 Jun 2010 17:25:33 -0700 Received: from pzk33 (pzk33.prod.google.com [10.243.19.161]) by wpaz1.hot.corp.google.com with ESMTP id o590PEoV027716 for ; Tue, 8 Jun 2010 17:25:32 -0700 Received: by pzk33 with SMTP id 33so5554763pzk.17 for ; Tue, 08 Jun 2010 17:25:31 -0700 (PDT) Date: Tue, 8 Jun 2010 17:25:26 -0700 (PDT) From: David Rientjes Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608132339.54db2317.akpm@linux-foundation.org> Message-ID: References: <20100608132339.54db2317.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > Tasks that do not share the same set of allowed nodes with the task that > > triggered the oom should not be considered as candidates for oom kill. > > > > Tasks in other cpusets with a disjoint set of mems would be unfairly > > penalized otherwise because of oom conditions elsewhere; an extreme > > example could unfairly kill all other applications on the system if a > > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > > more memory than allowed. > > > > Killing tasks outside of current's cpuset rarely would free memory for > > current anyway. To use a sane heuristic, we must ensure that killing a > > task would likely free memory for current and avoid needlessly killing > > others at all costs just because their potential memory freeing is > > unknown. It is better to kill current than another task needlessly. > > This is all a bit arbitrary, isn't it? The key word here is "rarely". "rarely" certainly is an arbitrary term in this case because it depends heavily on the memory usage of other cpuset's on the system. Consider a cpuset with 16G of memory and a single task which consumes most of that memory. Then consider a cpuset with a single 1G node and a task that ooms within it; the 16G task in the other cpuset gets killed. There must either be a complete exclusion or inclusion of a task for candidacy if the scale of memory usage amongst our cpusets cannot be properly attributed with a single heuristic (such as divide by 4, divide by 8, etc). To me, it never seems approprate to penalize another cpuset's tasks by the small chance that it may have allocated atomic memory elsewhere or the nodes have been recently changed. The goal is to be more predictable about oom killing decisions without negatively impacting other cpusets, and this is a step in that direction. > If indeed this task had allocated gobs of memory from `current's nodes > and then sneakily switched nodes, this will be a big regression! > It could be, but that's the fault of userspace for allocating a node that is almost full to a new cpuset and expecting it to be completely free. In other words, we can arrange our cpusets with mems however we want but we need some guarantee that giving a cpuset completely free memory and then killing a task within it because another cpuset went oom doesn't happen. > So.. It's not completely clear to me how we justify this decision. > Are we erring too far on the side of keep-tasks-running? Is failing to > clear the oom a lot bigger problem than killing an innocent task? I > think so. In which case we should err towards slaughtering the > innocent? > The one thing we know is that if the victim's mems_allowed is truly disjoint from current that there's no guarantee we'll be freeing memory at all. And if we free any, it's the result of the GFP_ATOMIC allocations that are allowed anywhere or was previously allocated on one of current's mems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 360CE6B01D2 for ; Tue, 8 Jun 2010 20:30:32 -0400 (EDT) Received: from hpaq7.eem.corp.google.com (hpaq7.eem.corp.google.com [172.25.149.7]) by smtp-out.google.com with ESMTP id o590URMt024073 for ; Tue, 8 Jun 2010 17:30:29 -0700 Received: from pvc7 (pvc7.prod.google.com [10.241.209.135]) by hpaq7.eem.corp.google.com with ESMTP id o590UPaO015672 for ; Tue, 8 Jun 2010 17:30:26 -0700 Received: by pvc7 with SMTP id 7so38533pvc.34 for ; Tue, 08 Jun 2010 17:30:25 -0700 (PDT) Date: Tue, 8 Jun 2010 17:30:21 -0700 (PDT) From: David Rientjes Subject: Re: [patch 08/18] oom: sacrifice child with highest badness score for parent In-Reply-To: <20100608133356.6e941d20.akpm@linux-foundation.org> Message-ID: References: <20100608133356.6e941d20.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > unsigned long points, struct mem_cgroup *mem, > > const char *message) > > { > > + struct task_struct *victim = p; > > struct task_struct *c; > > struct task_struct *t = p; > > + unsigned long victim_points = 0; > > + struct timespec uptime; > > > > if (printk_ratelimit()) > > dump_header(p, gfp_mask, order, mem); > > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > return 0; > > } > > > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > > - message, task_pid_nr(p), p->comm, points); > > + pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n", > > + message, task_pid_nr(p), p->comm, points); > > fyi, access to another task's ->comm is racy against prctl(). Fixable > with get_task_comm(). But that takes task_lock(), which is risky in > this code. The world wouldn't end if we didn't fix this ;) > I'll look into doing that, thanks! > > - /* Try to kill a child first */ > > + /* Try to sacrifice the worst child first */ > > + do_posix_clock_monotonic_gettime(&uptime); > > do { > > + unsigned long cpoints; > > This could be local to the list_for_each_entry() block. > Ok. > What does "cpoints" mean? > child points :) I'll send an incremental patch. > > list_for_each_entry(c, &t->children, sibling) { > > I'm surprised we don't have a sched.h helper for this. Maybe it's not > a very common thing to do. > > > if (c->mm == p->mm) > > continue; > > if (mem && !task_in_mem_cgroup(c, mem)) > > continue; > > - if (!oom_kill_task(c)) > > - return 0; > > + > > + /* badness() returns 0 if the thread is unkillable */ > > + cpoints = badness(c, uptime.tv_sec); > > + if (cpoints > victim_points) { > > + victim = c; > > + victim_points = cpoints; > > + } > > } > > } while_each_thread(p, t); > > > > - return oom_kill_task(p); > > + return oom_kill_task(victim); > > } > > And this function is secretly called under tasklist_lock, which is what > pins *victim, yes? > All of the out_of_memory() helper functions are called under tasklist_lock, which is what makes all these iterations safe. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 560D96B01D2 for ; Tue, 8 Jun 2010 20:40:54 -0400 (EDT) Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o590eoDH019988 for ; Tue, 8 Jun 2010 17:40:50 -0700 Received: from pvg2 (pvg2.prod.google.com [10.241.210.130]) by kpbe11.cbf.corp.google.com with ESMTP id o590emt5018412 for ; Tue, 8 Jun 2010 17:40:49 -0700 Received: by pvg2 with SMTP id 2so1589826pvg.30 for ; Tue, 08 Jun 2010 17:40:48 -0700 (PDT) Date: Tue, 8 Jun 2010 17:40:45 -0700 (PDT) From: David Rientjes Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: <20100608164325.a5fcdb39.akpm@linux-foundation.org> Message-ID: References: <20100608164325.a5fcdb39.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org, Andrea Arcangeli List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > The oom killer presently kills current whenever there is no more memory > > free or reclaimable on its mempolicy's nodes. There is no guarantee that > > current is a memory-hogging task or that killing it will free any > > substantial amount of memory, however. > > Well OK. But we don't necesarily *want* to "free a substantial amount > of memory". We want to resolve the oom within `current'. That's the > sole responsibility of the oom-killer. It doesn't have to free up > large amounts of additional memory in the expectation that sometime in > the future some other task will get an oom as well. if the oom-killer > is working well, we can defer those actions until the problem actually > occurs. > The oom killer has always attempted to kill a task that frees a large amount of memory: look at goal #2 in today's badness() heuristic (we recover a large amount of memory). By doing this, we avoid endless loops where anything we fork or our bash shell is constantly being oom killed or a large number of tasks that only free minimal amounts of memory get killed. The current behavior of killing current rarely works as a single remedy without being followed up by additional kills or user intervention. > Plus: if `current' isn't using much memory then it's probably a > short-lived or not-very-important process anyway. > That potentially prevents anything bound to that mempolicy from ever getting forked. > > In such situations, it is better to scan the tasklist for nodes that are > > allowed to allocate on current's set of nodes and kill the task with the > > highest badness() score. This ensures that the most memory-hogging task, > > or the one configured by the user with /proc/pid/oom_adj, is always > > selected in such scenarios. > > Well... *why* is it better? Needs more justification/explanation IMO. > This unifies mempolicy oom conditions with the same behavior of cpuset or memcg oom conditions: we want to utilize the badness() heuristic to kill the best candidate task and not nuke tons of processes for little benefit or, for instance, kill all other tasks sharing those same mempolicy nodes at the benefit of a memory hogger. Userspace has the ability to influence this heuristic (and even more powerfully with my heuristic rewrite coming later in this series) so it can better tune how the kernel reacts to mempolicy ooms, which is a key objective of this work. Simply killing current leaves no userspace intervention and can kill meaningful (and innocent) tasks which loses work for no reason. > A long time ago Andrea changed the oom-killer so that it basically > always killed `current', iirc. I think that shipped in the Suse > kernel. You can do that for the entire oom killer by enabling /proc/sys/vm/oom_kill_allocating_task. SGI wanted that to avoid these lengthy tasklist scans. > Maybe it was only in the case where `current' got an oom when > satisfying a pagefault, I forget the details. But according to Andrea, > this design provided a simple and practical solution to ooms. > Right, VM_FAULT_OOM always killed current and that was recently changed to invoke the pagefault oom handler. Nick has now converted the remaining architectures which were not using it to do so, so there is actually no difference for pagefaults anymore. In an earlier revision of this rewrite, I wanted pagefault ooms to try killing current first if it were killable and then backup to the tasklist scan and heuristic use, but that was argued against for not conforming to other memory allocation failures. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 9DCD76B01CC for ; Tue, 8 Jun 2010 20:46:54 -0400 (EDT) Received: from kpbe19.cbf.corp.google.com (kpbe19.cbf.corp.google.com [172.25.105.83]) by smtp-out.google.com with ESMTP id o590kpju017376 for ; Tue, 8 Jun 2010 17:46:51 -0700 Received: from pzk1 (pzk1.prod.google.com [10.243.19.129]) by kpbe19.cbf.corp.google.com with ESMTP id o590kok0002836 for ; Tue, 8 Jun 2010 17:46:50 -0700 Received: by pzk1 with SMTP id 1so3769149pzk.8 for ; Tue, 08 Jun 2010 17:46:50 -0700 (PDT) Date: Tue, 8 Jun 2010 17:46:45 -0700 (PDT) From: David Rientjes Subject: Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: <20100608140818.b413c335.akpm@linux-foundation.org> Message-ID: References: <20100608140818.b413c335.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > The oom killer presently kills current whenever there is no more memory > > free or reclaimable on its mempolicy's nodes. There is no guarantee that > > current is a memory-hogging task or that killing it will free any > > substantial amount of memory, however. > > > > In such situations, it is better to scan the tasklist for nodes that are > > allowed to allocate on current's set of nodes and kill the task with the > > highest badness() score. This ensures that the most memory-hogging task, > > or the one configured by the user with /proc/pid/oom_adj, is always > > selected in such scenarios. > > > > > > ... > > > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > int sysctl_panic_on_oom; > > @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock); > > /* #define DEBUG */ > > > > /* > > - * Is all threads of the target process nodes overlap ours? > > + * Do all threads of the target process overlap our allowed nodes? > > + * @tsk: task struct of which task to consider > > + * @mask: nodemask passed to page allocator for mempolicy ooms > > The comment uses kerneldoc annotation but isn't a kerneldoc comment. > I'll fix it. > > */ > > -static int has_intersects_mems_allowed(struct task_struct *tsk) > > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > > + const nodemask_t *mask) > > { > > - struct task_struct *t; > > + struct task_struct *start = tsk; > > > > - t = tsk; > > do { > > - if (cpuset_mems_allowed_intersects(current, t)) > > - return 1; > > - t = next_thread(t); > > - } while (t != tsk); > > - > > - return 0; > > + if (mask) { > > + /* > > + * If this is a mempolicy constrained oom, tsk's > > + * cpuset is irrelevant. Only return true if its > > + * mempolicy intersects current, otherwise it may be > > + * needlessly killed. > > + */ > > + if (mempolicy_nodemask_intersects(tsk, mask)) > > + return true; > > The comment refers to `current' but the code does not? > mempolicy_nodemask_intersects() compares tsk's mempolicy to current's, we don't need to pass current into the function (and we optimize for that since we don't need to do task_lock(current): nothing else can change its mempolicy). > > + } else { > > + /* > > + * This is not a mempolicy constrained oom, so only > > + * check the mems of tsk's cpuset. > > + */ > > The comment doesn't refer to `current', but the code does. Confused. > This simply compares the cpuset mems_allowed of both tasks passed into the function. > > + if (cpuset_mems_allowed_intersects(current, tsk)) > > + return true; > > + } > > + tsk = next_thread(tsk); > > hm, next_thread() uses list_entry_rcu(). What are the locking rules > here? It's one of both of rcu_read_lock() and read_lock(&tasklist_lock), > I think? > Oleg addressed this in his response. > > + } while (tsk != start); > > + return false; > > } > > This is all bloat and overhead for non-NUMA builds. I doubt if gcc is > able to eliminate the task_struct walk (although I didn't check). > > The function isn't oom-killer-specific at all - give it a better name > then move it to mempolicy.c or similar? If so, the text "oom" > shouldn't appear in the comments. > It's the only place where we want to filter tasks based on whether they share mempolicy nodes or cpuset mems, though, so I think it's appropriately placed in mm/oom_kill.c. I agree that we can add a #ifndef CONFIG_NUMA variant and I'll do so, thanks. > > > > ... > > > > @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > */ > > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > > read_lock(&tasklist_lock); > > - > > - switch (constraint) { > > - case CONSTRAINT_MEMORY_POLICY: > > - oom_kill_process(current, gfp_mask, order, 0, NULL, > > - "No available memory (MPOL_BIND)"); > > - break; > > - > > - case CONSTRAINT_NONE: > > - if (sysctl_panic_on_oom) { > > + if (unlikely(sysctl_panic_on_oom)) { > > + /* > > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > > + * should not panic for cpuset or mempolicy induced memory > > + * failures. > > + */ > > This wasn't changelogged? > It's not a functional change, sysctl_panic_on_oom == 2 is already handled earlier in the function. This was intended to elaborate on why we're only concerned about CONSTRAINT_NONE here since the switch statement was removed. > > + if (constraint == CONSTRAINT_NONE) { > > dump_header(NULL, gfp_mask, order, NULL); > > - panic("out of memory. panic_on_oom is selected\n"); > > + read_unlock(&tasklist_lock); > > + panic("Out of memory: panic_on_oom is enabled\n"); > > } > > - /* Fall-through */ > > - case CONSTRAINT_CPUSET: > > - __out_of_memory(gfp_mask, order); > > - break; > > } > > - > > + __out_of_memory(gfp_mask, order, constraint, nodemask); > > read_unlock(&tasklist_lock); > > > > /* > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 97BAF6B01D5 for ; Tue, 8 Jun 2010 20:52:51 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id o590qkrk005955 for ; Tue, 8 Jun 2010 17:52:46 -0700 Received: from pzk13 (pzk13.prod.google.com [10.243.19.141]) by kpbe16.cbf.corp.google.com with ESMTP id o590q71Z010371 for ; Tue, 8 Jun 2010 17:52:45 -0700 Received: by pzk13 with SMTP id 13so914288pzk.13 for ; Tue, 08 Jun 2010 17:52:44 -0700 (PDT) Date: Tue, 8 Jun 2010 17:52:42 -0700 (PDT) From: David Rientjes Subject: Re: [patch 10/18] oom: enable oom tasklist dump by default In-Reply-To: <20100608141342.114156ac.akpm@linux-foundation.org> Message-ID: References: <20100608141342.114156ac.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is > > very helpful information in diagnosing why a user's task has been killed. > > It emits useful information such as each eligible thread's memory usage > > that can determine why the system is oom, so it should be enabled by > > default. > > Unclear. On a large system the poor thing will now spend half an hour > squirting junk out the diagnostic port. Probably interspersed with the > occasional whine from the softlockup detector. And for many > applications, spending a long time stuck in the kernel printing > diagnostics is equivalent to an outage. > > I guess people can turn it off again if this happens, but they'll get > justifiably grumpy at us. I wonder if this change is too > developer-friendly and insufficiently operator-friendly. > This is one of the main reasons why I wanted to unify both oom_kill_allocating_task and oom_dump_tasks into a single sysctl: oom_kill_quick, but that was nacked. Both of the former sysctls have the same audience: those that want to avoid lengthy tasklist scans, namely companies like SGI, by enabling the first and disabling the second. If we were to extend the oom killer in the future and need to add special handling for these customers, it would have been easy with the unified sysctl, but I'm not going to wage that war again. I think this is more helpful than harmful, however, solely because it gives users a better indication of what caused their system to be oom in the first place and can be disabled at runtime. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 615416B01D2 for ; Wed, 9 Jun 2010 02:32:41 -0400 (EDT) Received: from kpbe18.cbf.corp.google.com (kpbe18.cbf.corp.google.com [172.25.105.82]) by smtp-out.google.com with ESMTP id o596Wbq2028101 for ; Tue, 8 Jun 2010 23:32:37 -0700 Received: from pxi6 (pxi6.prod.google.com [10.243.27.6]) by kpbe18.cbf.corp.google.com with ESMTP id o596WZUn032079 for ; Tue, 8 Jun 2010 23:32:36 -0700 Received: by pxi6 with SMTP id 6so2449971pxi.1 for ; Tue, 08 Jun 2010 23:32:35 -0700 (PDT) Date: Tue, 8 Jun 2010 23:32:27 -0700 (PDT) From: David Rientjes Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: <20100608202611.GA11284@redhat.com> Message-ID: References: <20100608202611.GA11284@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Oleg Nesterov wrote: > > It's unnecessary to SIGKILL a task that is already PF_EXITING > > This probably needs some explanation. PF_EXITING doesn't necessarily > mean this process is exiting. > I hope that my sentence didn't imply that it was, the point is that sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit, it's already along the right path. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 3F2246B01B5 for ; Wed, 9 Jun 2010 12:27:12 -0400 (EDT) Date: Wed, 9 Jun 2010 18:25:23 +0200 From: Oleg Nesterov Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-ID: <20100609162523.GA30464@redhat.com> References: <20100608202611.GA11284@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/08, David Rientjes wrote: > > On Tue, 8 Jun 2010, Oleg Nesterov wrote: > > > > It's unnecessary to SIGKILL a task that is already PF_EXITING > > > > This probably needs some explanation. PF_EXITING doesn't necessarily > > mean this process is exiting. > > I hope that my sentence didn't imply that it was, the point is that > sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit, > it's already along the right path. Well, probably this is right... David, currently I do not know how the code looks with all patches applied, could you please confirm there is no problem here? I am looking at Linus's tree, mem_cgroup_out_of_memory: p = select_bad_process(); oom_kill_process(p); Now, again, select_bad_process() can return the dead group-leader of the memory-hog-thread-group. In that case set_tsk_thread_flag(TIF_MEMDIE) buys nothing, this thread has aleady exited, but we do want to kill this process. If this is not true due to other changes - great. Otherwise, perhaps this needs - if (PF_EXITING) + if (PF_EXITING && mm) too? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 5CEDF6B01AD for ; Wed, 9 Jun 2010 15:44:48 -0400 (EDT) Received: from wpaz9.hot.corp.google.com (wpaz9.hot.corp.google.com [172.24.198.73]) by smtp-out.google.com with ESMTP id o59JigbY012414 for ; Wed, 9 Jun 2010 12:44:42 -0700 Received: from pzk33 (pzk33.prod.google.com [10.243.19.161]) by wpaz9.hot.corp.google.com with ESMTP id o59JiXtJ032679 for ; Wed, 9 Jun 2010 12:44:41 -0700 Received: by pzk33 with SMTP id 33so6105490pzk.17 for ; Wed, 09 Jun 2010 12:44:40 -0700 (PDT) Date: Wed, 9 Jun 2010 12:44:34 -0700 (PDT) From: David Rientjes Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: <20100609162523.GA30464@redhat.com> Message-ID: References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Wed, 9 Jun 2010, Oleg Nesterov wrote: > > I hope that my sentence didn't imply that it was, the point is that > > sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit, > > it's already along the right path. > > Well, probably this is right... > > David, currently I do not know how the code looks with all patches > applied, could you please confirm there is no problem here? I am > looking at Linus's tree, > > mem_cgroup_out_of_memory: > > p = select_bad_process(); > oom_kill_process(p); > mem_cgroup_out_of_memory() does this under tasklist_lock: retry: p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; if (oom_kill_process(p, gfp_mask, 0, points, mem, "Memory cgroup out of memory")) goto retry; out: ... > Now, again, select_bad_process() can return the dead group-leader > of the memory-hog-thread-group. > select_bad_process() already has: if ((p->flags & PF_EXITING) && p->mm) { if (p != current) return ERR_PTR(-1UL); chosen = p; *ppoints = ULONG_MAX; } so we can disregard the check for p == current in this case since it would not be allocating memory without p->mm. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 65A0B6B0071 for ; Wed, 9 Jun 2010 16:16:01 -0400 (EDT) Date: Wed, 9 Jun 2010 22:14:30 +0200 From: Oleg Nesterov Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-ID: <20100609201430.GA8210@redhat.com> References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/09, David Rientjes wrote: > > On Wed, 9 Jun 2010, Oleg Nesterov wrote: > > > David, currently I do not know how the code looks with all patches > > applied, could you please confirm there is no problem here? I am > > looking at Linus's tree, > > > > mem_cgroup_out_of_memory: > > > > p = select_bad_process(); > > oom_kill_process(p); > > > > mem_cgroup_out_of_memory() does this under tasklist_lock: > > retry: > p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > if (oom_kill_process(p, gfp_mask, 0, points, mem, > "Memory cgroup out of memory")) > goto retry; > out: > ... > > > Now, again, select_bad_process() can return the dead group-leader > > of the memory-hog-thread-group. > > > > select_bad_process() already has: > > if ((p->flags & PF_EXITING) && p->mm) { > if (p != current) > return ERR_PTR(-1UL); > > chosen = p; > *ppoints = ULONG_MAX; > } > > so we can disregard the check for p == current Not sure I understand... We can just ignore this check, in this case p->mm == NULL. > in this case since it would > not be allocating memory without p->mm. This thread will not allocate the memory, yes. But its sub-threads can. And select_bad_process() can constantly return the same (dead) thread P, badness() inspects ->mm under find_lock_task_mm() which finds the thread with the valid ->mm. OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task) was fixed or not, but currently it also looks at task->mm and thus have the same boring problem: it is trivial to make the memory-hog process invisible to oom. Unless I missed something, of course. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 645906B0071 for ; Wed, 9 Jun 2010 20:20:30 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5A0KR00030260 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Thu, 10 Jun 2010 09:20:27 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 2F99045DE70 for ; Thu, 10 Jun 2010 09:20:27 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id D990945DE60 for ; Thu, 10 Jun 2010 09:20:26 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 9DF9E1DB8042 for ; Thu, 10 Jun 2010 09:20:26 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 1FC371DB803E for ; Thu, 10 Jun 2010 09:20:26 +0900 (JST) Date: Thu, 10 Jun 2010 09:15:47 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-Id: <20100610091547.d2c88d4c.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100609201430.GA8210@redhat.com> References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> <20100609201430.GA8210@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Wed, 9 Jun 2010 22:14:30 +0200 Oleg Nesterov wrote: > > in this case since it would > > not be allocating memory without p->mm. > > This thread will not allocate the memory, yes. But its sub-threads can. > And select_bad_process() can constantly return the same (dead) thread P, > badness() inspects ->mm under find_lock_task_mm() which finds the thread > with the valid ->mm. > > OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task) > was fixed or not, but currently it also looks at task->mm and thus have > the same boring problem: it is trivial to make the memory-hog process > invisible to oom. Unless I missed something, of course. > HmHm...your concern is that there is a case when mem_cgroup_out_of_memory() can't kill anything ? Now, memcg doesn't return -ENOMEM in usual case. So, it loops until there are some available memory under its limit. Then, if memory_cgroup_out_of_memory() can kill a process in several trial, we'll not have terrible problem. (even if it's slow.) Hmm. What I can't understand is whether there is a case when PF_EXITING thread never exit. If so, we need some care (in memcg?) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 6188F6B0071 for ; Wed, 9 Jun 2010 21:22:30 -0400 (EDT) Date: Thu, 10 Jun 2010 03:21:01 +0200 From: Oleg Nesterov Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-ID: <20100610012101.GA5412@redhat.com> References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> <20100609201430.GA8210@redhat.com> <20100610091547.d2c88d4c.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100610091547.d2c88d4c.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/10, KAMEZAWA Hiroyuki wrote: > > On Wed, 9 Jun 2010 22:14:30 +0200 > Oleg Nesterov wrote: > > > > in this case since it would > > > not be allocating memory without p->mm. > > > > This thread will not allocate the memory, yes. But its sub-threads can. > > And select_bad_process() can constantly return the same (dead) thread P, > > badness() inspects ->mm under find_lock_task_mm() which finds the thread > > with the valid ->mm. > > > > OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task) > > was fixed or not, but currently it also looks at task->mm and thus have > > the same boring problem: it is trivial to make the memory-hog process > > invisible to oom. Unless I missed something, of course. > > HmHm...your concern is that there is a case when mem_cgroup_out_of_memory() > can't kill anything ? Or it can kill the wrong task. But once again, I am only speculating looking at the current code. > Now, memcg doesn't return -ENOMEM in usual case. > So, it loops until there are some available memory under its limit. > Then, if memory_cgroup_out_of_memory() can kill a process in several trial, > we'll not have terrible problem. (even if it's slow.) > > Hmm. What I can't understand is whether there is a case when PF_EXITING > thread never exit. If so, we need some care (in memcg?) void *thread_func(void *) { for (;;) malloc(); } int main(void) { pthread_create(..., thread_func, ...); pthread_exit(); } This process runs with the dead group-leader (PF_EXITING is set, ->mm == NULL). mem_cgroup_out_of_memory()->select_bad_process() can't see it due to task_in_mem_cgroup() check. Afaics - task_in_mem_cgroup() should use find_lock_task_mm() too - oom_kill_process() should check "PF_EXITING && p->mm", like select_bad_process() does. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id AB0706B01AD for ; Wed, 9 Jun 2010 21:47:36 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5A1lULM018009 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Thu, 10 Jun 2010 10:47:30 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id F057C45DE50 for ; Thu, 10 Jun 2010 10:47:29 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id C971D45DE4F for ; Thu, 10 Jun 2010 10:47:29 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id B2B7D1DB803B for ; Thu, 10 Jun 2010 10:47:29 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 5E0A01DB8037 for ; Thu, 10 Jun 2010 10:47:29 +0900 (JST) Date: Thu, 10 Jun 2010 10:43:09 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-Id: <20100610104309.f7559f31.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100610012101.GA5412@redhat.com> References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> <20100609201430.GA8210@redhat.com> <20100610091547.d2c88d4c.kamezawa.hiroyu@jp.fujitsu.com> <20100610012101.GA5412@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Thu, 10 Jun 2010 03:21:01 +0200 Oleg Nesterov wrote: > On 06/10, KAMEZAWA Hiroyuki wrote: > > > > On Wed, 9 Jun 2010 22:14:30 +0200 > > Oleg Nesterov wrote: > > > > > > in this case since it would > > > > not be allocating memory without p->mm. > > > > > > This thread will not allocate the memory, yes. But its sub-threads can. > > > And select_bad_process() can constantly return the same (dead) thread P, > > > badness() inspects ->mm under find_lock_task_mm() which finds the thread > > > with the valid ->mm. > > > > > > OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task) > > > was fixed or not, but currently it also looks at task->mm and thus have > > > the same boring problem: it is trivial to make the memory-hog process > > > invisible to oom. Unless I missed something, of course. > > > > HmHm...your concern is that there is a case when mem_cgroup_out_of_memory() > > can't kill anything ? > > Or it can kill the wrong task. But once again, I am only speculating > looking at the current code. > > > Now, memcg doesn't return -ENOMEM in usual case. > > So, it loops until there are some available memory under its limit. > > Then, if memory_cgroup_out_of_memory() can kill a process in several trial, > > we'll not have terrible problem. (even if it's slow.) > > > > Hmm. What I can't understand is whether there is a case when PF_EXITING > > thread never exit. If so, we need some care (in memcg?) > > void *thread_func(void *) > { > for (;;) > malloc(); > } > > int main(void) > { > pthread_create(..., thread_func, ...); > pthread_exit(); > } > > This process runs with the dead group-leader (PF_EXITING is set, ->mm == NULL). > mem_cgroup_out_of_memory()->select_bad_process() can't see it due to > task_in_mem_cgroup() check. > > Afaics > > - task_in_mem_cgroup() should use find_lock_task_mm() too > > - oom_kill_process() should check "PF_EXITING && p->mm", > like select_bad_process() does. > Hm. I'd like to look into that when the next mmotm is shipped. (too many pactches in flight..) The problem is for (walking each 'process') if (task_in_mem_cgroup(p, memcg)) can't check 'p' containes threads belongs to given memcg because p->mm can be NULL. So, task_in_mem_cgroup should call find_lock_task_mm() when getting "mm" struct. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id E32436B01AD for ; Wed, 9 Jun 2010 21:53:09 -0400 (EDT) Date: Thu, 10 Jun 2010 03:51:36 +0200 From: Oleg Nesterov Subject: Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL Message-ID: <20100610015136.GA7595@redhat.com> References: <20100608202611.GA11284@redhat.com> <20100609162523.GA30464@redhat.com> <20100609201430.GA8210@redhat.com> <20100610091547.d2c88d4c.kamezawa.hiroyu@jp.fujitsu.com> <20100610012101.GA5412@redhat.com> <20100610104309.f7559f31.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100610104309.f7559f31.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On 06/10, KAMEZAWA Hiroyuki wrote: > > > Afaics > > > > - task_in_mem_cgroup() should use find_lock_task_mm() too > > > > - oom_kill_process() should check "PF_EXITING && p->mm", > > like select_bad_process() does. > > > > Hm. I'd like to look into that when the next mmotm is shipped. > (too many pactches in flight..) Me too ;) > The problem is > > for (walking each 'process') > if (task_in_mem_cgroup(p, memcg)) > > can't check 'p' containes threads belongs to given memcg because p->mm can > be NULL. So, task_in_mem_cgroup should call find_lock_task_mm() when > getting "mm" struct. Yes, this is what I meant. And after we do this change we should tweak oom_kill_process() too, otherwise we have another problem. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id AB14F6B01BA for ; Sun, 13 Jun 2010 07:24:58 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBOue1022697 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:24:56 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 58B2F45DE55 for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 2C94C45DD77 for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 109201DB803B for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id C1A631DB803A for ; Sun, 13 Jun 2010 20:24:55 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: <20100608160216.bc52112b.akpm@linux-foundation.org> References: <20100608194533.7657.A69D9226@jp.fujitsu.com> <20100608160216.bc52112b.akpm@linux-foundation.org> Message-Id: <20100613193529.618D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:55 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > > > * Copyright (C) 1998,2000 Rik van Riel > > > * Thanks go out to Claus Fischer for some serious inspiration and > > > * for goading me into coding this file... > > > + * Copyright (C) 2010 Google, Inc. > > > + * Rewritten by David Rientjes > > > > don't put it. > > > > Seems OK to me. It's a fairly substantial change and people have added > their (c) in the past for smaller kernel changes. I guess one could even > do this for a one-liner. If you are OK, I have no objection. I'm not lawyer. But, at least in japan, usually include co-developers to author notice. (of cource, it's not me...) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 37D456B01B5 for ; Sun, 13 Jun 2010 07:24:59 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBOvpQ021763 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:24:57 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id CB0D545DE50 for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id A95AD45DE4D for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 804C41DB8038 for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 2E2F71DB803E for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608122740.8f045c78.akpm@linux-foundation.org> References: <20100608122740.8f045c78.akpm@linux-foundation.org> Message-Id: <20100613201257.6199.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:55 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: Sorry for the delay. > On Tue, 8 Jun 2010 11:51:32 -0700 (PDT) > David Rientjes wrote: > > > Andrew, are you the maintainer for these fixes or is KOSAKI? > > I am, thanks. Kosaki-san, you're making this harder than it should be. > Please either ack David's patches or promptly work with him on > finalising them. Thanks, Andrew, David. I agree with you. I don't find any end users harm and regressions in latest David's patch series. So, I'm glad to join his work. Unfortunatelly, I don't have enough time now. then, I expect my next review is not quite soon. but I'll promise I'll do. thanks. > > I realise that you have additional oom-killer patches but it's too > complex to try to work on two patch series concurrently. So let's > concentrate on get David's work sorted out and merged and then please > rebase yours on the result. > > I certainly don't have the time or inclination to go through two > patchsets and work out what the similarities and differences are so > I'll be concentrating on David's ones first. The order in which we > do this doesn't really matter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 9FC566B01B5 for ; Sun, 13 Jun 2010 07:25:02 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBP0Il022723 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:25:00 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 307FA45DE51 for ; Sun, 13 Jun 2010 20:25:00 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0E69D45DD77 for ; Sun, 13 Jun 2010 20:25:00 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id EAE8C1DB8038 for ; Sun, 13 Jun 2010 20:24:59 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 9861E1DB803E for ; Sun, 13 Jun 2010 20:24:56 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: <20100608161844.04d2f2a1.akpm@linux-foundation.org> References: <20100608194514.7654.A69D9226@jp.fujitsu.com> <20100608161844.04d2f2a1.akpm@linux-foundation.org> Message-Id: <20100613201922.619C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:55 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > On Tue, 8 Jun 2010 20:42:02 +0900 (JST) > KOSAKI Motohiro wrote: > > > > + /* > > > + * Warn that /proc/pid/oom_adj is deprecated, see > > > + * Documentation/feature-removal-schedule.txt. > > > + */ > > > + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " > > > + "please use /proc/%d/oom_score_adj instead.\n", > > > + current->comm, task_pid_nr(current), > > > + task_pid_nr(task), task_pid_nr(task)); > > > task->signal->oom_adj = oom_adjust; > > > > Sorry, we can't accept this. oom_adj is one of most freqently used > > tuning knob. putting this one makes a lot of confusion. > > > > In addition, this knob is used from some applications (please google > > by google code search or something else). that said, an enduser can't > > stop the warning. that makes a lot of frustration. NO. > > > > I think it's OK. We made a mistake in adding oom_adj in the first > place and now we get to live with the consequences. > > We'll be stuck with oom_adj for the next 200 years if we don't tell > people to stop using it, and a printk_once() is a good way of doing > that. > > It could be that in two years time we decide that we can't remove oom_adj > yet because too many people are still using it. Maybe it will take ten > years - but unless we add the above printk, oom_adj will remain > forever. But oom_score_adj have no benefit form end-uses view. That's problem. Please consider to make end-user friendly good patch at first. I mean, I'm not against better knob deprecate old one. but I require 'better' mean end-users better. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id E86DD6B01BE for ; Sun, 13 Jun 2010 07:25:02 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBP0xD021786 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:25:00 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 9225C45DE4D for ; Sun, 13 Jun 2010 20:25:00 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 6F34045DE4E for ; Sun, 13 Jun 2010 20:25:00 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 5AA0C1DB8038 for ; Sun, 13 Jun 2010 20:25:00 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 0BB441DB803B for ; Sun, 13 Jun 2010 20:24:57 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: <20100608131211.e769e3a1.akpm@linux-foundation.org> References: <20100608203216.765D.A69D9226@jp.fujitsu.com> <20100608131211.e769e3a1.akpm@linux-foundation.org> Message-Id: <20100613202009.619F.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:56 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > On Tue, 8 Jun 2010 20:41:57 +0900 (JST) > KOSAKI Motohiro wrote: > > > > + > > > if (sysctl_panic_on_oom == 2) { > > > dump_header(NULL, gfp_mask, order, NULL); > > > panic("out of memory. Compulsory panic_on_oom is selected.\n"); > > > > Sorry, I had found this patch works incorrect. I don't pulled. > > Saying "it doesn't work and I'm not telling you why" is unhelpful. In > fact it's the opposite of helpful because it blocks merging of the fix > and doesn't give us any way to move forward. > > So what can I do? Hard. > > What I shall do is to merge the patch in the hope that someone else will > discover the undescribed problem and we will fix it then. That's very > inefficient. Please see 5 minute before positng e-mail. thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 8435D6B01C4 for ; Mon, 14 Jun 2010 07:08:27 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5EB8OOO007551 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 14 Jun 2010 20:08:24 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id DAACD45DE51 for ; Mon, 14 Jun 2010 20:08:23 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id B8E3645DE4F for ; Mon, 14 Jun 2010 20:08:23 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 631AE1DB8038 for ; Mon, 14 Jun 2010 20:08:23 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id B281E1DB805B for ; Mon, 14 Jun 2010 20:08:22 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 05/18] oom: give current access to memory reserves if it has been killed In-Reply-To: References: <20100608203216.765D.A69D9226@jp.fujitsu.com> Message-Id: <20100614195055.9DAE.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 14 Jun 2010 20:08:21 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > > > + /* > > > + * If current has a pending SIGKILL, then automatically select it. The > > > + * goal is to allow it to allocate so that it may quickly exit and free > > > + * its memory. > > > + */ > > > + if (fatal_signal_pending(current)) { > > > + set_thread_flag(TIF_MEMDIE); > > > + return; > > > + } > > > + > > > if (sysctl_panic_on_oom == 2) { > > > dump_header(NULL, gfp_mask, order, NULL); > > > panic("out of memory. Compulsory panic_on_oom is selected.\n"); > > > > Sorry, I had found this patch works incorrect. I don't pulled. > > > > You're taking back your ack? > > Why does this not work? It's not killing a potentially immune task, the > task is already dying. We're simply giving it access to memory reserves > so that it may quickly exit and die. OOM_DISABLE does not imply that a > task cannot exit on its own or be killed by another application or user, > we simply don't want to needlessly kill another task when current is dying > in the first place without being able to allocate memory. > > Please reconsider your thought. Oh, I didn't talk about OOM_DISABLE. probably my explanation was too poor. My point is, the above code assume SIGKILL is good sign of the task is going exit soon. but It is not always true. Only if the task is regular userland process, it's true. kernel module author freely makes very strange kernel thread. note: Linux is one of most popular generic purpose OS in the world and we have million out of funny drivers. Plus, If false positive occur, setting TIF_MEMDIE is very dangerous because if there is TIF_MEMDIE task, our kernl don't send next OOM-Kill. It mean the systam can reach dead lock. In the other hand, false negative is relatively safe. It cause one innocent task kill. but the system doesn't cause lockup. Then, we have strongly motivation to avoid false positive. I hope you add some conservative check. I don't disagree your patch concept. I only worry about the dangerousness. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 132196B01D0 for ; Wed, 16 Jun 2010 23:36:24 -0400 (EDT) Received: from hpaq3.eem.corp.google.com (hpaq3.eem.corp.google.com [172.25.149.3]) by smtp-out.google.com with ESMTP id o5H3aMGn006816 for ; Wed, 16 Jun 2010 20:36:22 -0700 Received: from pxi18 (pxi18.prod.google.com [10.243.27.18]) by hpaq3.eem.corp.google.com with ESMTP id o5H3aK9h023458 for ; Wed, 16 Jun 2010 20:36:20 -0700 Received: by pxi18 with SMTP id 18so707378pxi.40 for ; Wed, 16 Jun 2010 20:36:19 -0700 (PDT) Date: Wed, 16 Jun 2010 20:36:18 -0700 (PDT) From: David Rientjes Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: <20100613201922.619C.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608194514.7654.A69D9226@jp.fujitsu.com> <20100608161844.04d2f2a1.akpm@linux-foundation.org> <20100613201922.619C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Sun, 13 Jun 2010, KOSAKI Motohiro wrote: > But oom_score_adj have no benefit form end-uses view. That's problem. > Please consider to make end-user friendly good patch at first. > Of course it does, it actually has units whereas oom_adj only grows or shrinks the badness score exponentially. oom_score_adj's units are well understood: on a machine with 4G of memory, 250 means we're trying to prejudice it by 1G of memory so that can be used by other tasks, -250 means other tasks should be prejudiced by 1G in comparison to this task, etc. It's actually quite powerful. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 18ECC6B01B8 for ; Thu, 17 Jun 2010 01:13:06 -0400 (EDT) Received: from hpaq3.eem.corp.google.com (hpaq3.eem.corp.google.com [172.25.149.3]) by smtp-out.google.com with ESMTP id o5H5D1l0004295 for ; Wed, 16 Jun 2010 22:13:01 -0700 Received: from pwi5 (pwi5.prod.google.com [10.241.219.5]) by hpaq3.eem.corp.google.com with ESMTP id o5H5CuLQ028340 for ; Wed, 16 Jun 2010 22:13:00 -0700 Received: by pwi5 with SMTP id 5so1198585pwi.38 for ; Wed, 16 Jun 2010 22:12:56 -0700 (PDT) Date: Wed, 16 Jun 2010 22:12:53 -0700 (PDT) From: David Rientjes Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: <20100608194533.7657.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608194533.7657.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -63,6 +63,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = { > > #endif > > > > /* The badness from the OOM killer */ > > -unsigned long badness(struct task_struct *p, unsigned long uptime); > > static int proc_oom_score(struct task_struct *task, char *buffer) > > { > > unsigned long points = 0; > > - struct timespec uptime; > > > > - do_posix_clock_monotonic_gettime(&uptime); > > read_lock(&tasklist_lock); > > if (pid_alive(task)) > > - points = badness(task, uptime.tv_sec); > > + points = oom_badness(task->group_leader, > > + global_page_state(NR_INACTIVE_ANON) + > > + global_page_state(NR_ACTIVE_ANON) + > > + global_page_state(NR_INACTIVE_FILE) + > > + global_page_state(NR_ACTIVE_FILE) + > > + total_swap_pages); > > Sorry I can't ack this. again and again, I try to explain why this is wrong > (hopefully last) > > 1) incompatibility > oom_score is one of ABI. then, we can't change this. from enduser view, > this change is no merit. In general, an incompatibility is allowed on very > limited situation such as that an end-user get much benefit than compatibility. > In other word, old style ABI doesn't works fine from end user view. > But, in this case, it isn't. > There is no incompatibility here, /proc/pid/oom_score has no meaningful units because of the old heuristic. The _only_ thing it represents is a score in comparison with other eligible tasks to decide which task to kill. Thus, oom_score by itself means nothing if not compared to other eligible tasks. Although deprecated, /proc/pid/oom_adj still changes /proc/pid/oom_score_adj with a different scale (-17 maps to -1000 and +15 maps to +1000), so there is absolutely no userspace imcompatibility with this change. > 2) technically incorrect > this math is not correct math. this is not represented "allowed memory". > example, 1) this is not accumulated mlocked memory, but it can be freed > task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not. > if not, task killing doesn't free SYSV IPC memory. Ah, very good point. We should be using totalram_pages + total_swap_pages here to represent global normalization, memcg limit for CONSTRAINT_MEMCG, and a total of node_spanned_pages for mempolicy nodes or cpuset mems for CONSTAINT_MEMORY_POLICY and CONSTRAINT_CPUSET, respectively. I'll make that switch in the next revision, thanks! > In additon, 3) This normalization doesn't works on asymmetric numa. > total pages and oom are not related almostly. What this does is represents the heuristic baseline, rss and swap, as a proportion depending on the type of oom constraint. This works when comparing eligible tasks amongst each other because the the task with the highest rss and swap is the one we (normally) want to kill, minus the 3% privilege given to root and outside influence of /proc/pid/oom_score_adj. We want to represent this as a proportion and not as a shear value simply because the task may be attached to a cpuset, a memcg, or bound to a mempolicy out from under the task's knowledge. That is, we compare tasks sharing the same constraint for oom kill and normalize the heuristic based on that. We don't want to expose a userspace interface that takes memory quantities directly since the task may be bound to a mempolicy, for instance, later and the oom_score_adj is then rendered obsolete. > 4) scalability. if the > system 10TB memory, 1 point oom score mean 10GB memory consumption. Well, sure, a 10TB system would have a large granularity such as that :) But in such cases we don't necessarily care if one task is using 5GB more than another task using 1TB, for example. > > read_unlock(&tasklist_lock); > > return sprintf(buffer, "%lu\n", points); > > } > > @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, > > } > > > > task->signal->oom_adj = oom_adjust; > > - > > + /* > > + * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum > > + * value is always attainable. > > + */ > > + if (task->signal->oom_adj == OOM_ADJUST_MAX) > > + task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; > > + else > > + task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / > > + -OOM_DISABLE; > > unlock_task_sighand(task, &flags); > > put_task_struct(task); > > Generically, I wasn't against the feature for rare use-case. but sorry, > as far as I investigated, I haven't find any actual user. then, I don't > put ack, because my reviewing basically stand on 1) how much user use this > 2) how strongly required this from an users 3) how much side effect is there > etc etc. not cool or not. oom_score_adj is much more powerful than oom_adj simply because it (i) is in units that are understood, not a bitshift on a widely unpredictable heuristic, and (ii) the granularity is _much_ finer than oom_adj. We have many use cases for this internally especially when we bind tasks to cpusets or memcg and they change in size. > > @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = { > > .llseek = generic_file_llseek, > > }; > > > > +static ssize_t oom_score_adj_read(struct file *file, char __user *buf, > > + size_t count, loff_t *ppos) > > +{ > > + struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); > > + char buffer[PROC_NUMBUF]; > > + int oom_score_adj = OOM_SCORE_ADJ_MIN; > > + unsigned long flags; > > + size_t len; > > + > > + if (!task) > > + return -ESRCH; > > + if (lock_task_sighand(task, &flags)) { > > + oom_score_adj = task->signal->oom_score_adj; > > + unlock_task_sighand(task, &flags); > > + } > > + put_task_struct(task); > > + len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj); > > + return simple_read_from_buffer(buf, count, ppos, buffer, len); > > +} > > + > > +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, > > + size_t count, loff_t *ppos) > > +{ > > + struct task_struct *task; > > + char buffer[PROC_NUMBUF]; > > + unsigned long flags; > > + long oom_score_adj; > > + int err; > > + > > + memset(buffer, 0, sizeof(buffer)); > > + if (count > sizeof(buffer) - 1) > > + count = sizeof(buffer) - 1; > > + if (copy_from_user(buffer, buf, count)) > > + return -EFAULT; > > + > > + err = strict_strtol(strstrip(buffer), 0, &oom_score_adj); > > + if (err) > > + return -EINVAL; > > + if (oom_score_adj < OOM_SCORE_ADJ_MIN || > > + oom_score_adj > OOM_SCORE_ADJ_MAX) > > + return -EINVAL; > > + > > + task = get_proc_task(file->f_path.dentry->d_inode); > > + if (!task) > > + return -ESRCH; > > + if (!lock_task_sighand(task, &flags)) { > > + put_task_struct(task); > > + return -ESRCH; > > + } > > + if (oom_score_adj < task->signal->oom_score_adj && > > + !capable(CAP_SYS_RESOURCE)) { > > + unlock_task_sighand(task, &flags); > > + put_task_struct(task); > > + return -EACCES; > > + } > > + > > + task->signal->oom_score_adj = oom_score_adj; > > + /* > > + * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is > > + * always attainable. > > + */ > > + if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) > > + task->signal->oom_adj = OOM_DISABLE; > > + else > > + task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / > > + OOM_SCORE_ADJ_MAX; > > + unlock_task_sighand(task, &flags); > > + put_task_struct(task); > > + return count; > > +} > > + > > +static const struct file_operations proc_oom_score_adj_operations = { > > + .read = oom_score_adj_read, > > + .write = oom_score_adj_write, > > +}; > > + > > #ifdef CONFIG_AUDITSYSCALL > > #define TMPBUFLEN 21 > > static ssize_t proc_loginuid_read(struct file * file, char __user * buf, > > @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = { > > #endif > > INF("oom_score", S_IRUGO, proc_oom_score), > > REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), > > + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), > > #ifdef CONFIG_AUDITSYSCALL > > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), > > REG("sessionid", S_IRUGO, proc_sessionid_operations), > > @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = { > > #endif > > INF("oom_score", S_IRUGO, proc_oom_score), > > REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), > > + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), > > #ifdef CONFIG_AUDITSYSCALL > > REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), > > REG("sessionid", S_IRUSR, proc_sessionid_operations), > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -4,6 +4,8 @@ > > * Copyright (C) 1998,2000 Rik van Riel > > * Thanks go out to Claus Fischer for some serious inspiration and > > * for goading me into coding this file... > > + * Copyright (C) 2010 Google, Inc. > > + * Rewritten by David Rientjes > > don't put it. > > > > > * > > * The routines in this file are used to kill a process when > > * we're seriously out of memory. This gets called from __alloc_pages() > > @@ -34,7 +36,6 @@ int sysctl_panic_on_oom; > > int sysctl_oom_kill_allocating_task; > > int sysctl_oom_dump_tasks = 1; > > static DEFINE_SPINLOCK(zone_scan_lock); > > -/* #define DEBUG */ > > > > /* > > * Do all threads of the target process overlap our allowed nodes? > > @@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p) > > } > > > > /** > > - * badness - calculate a numeric value for how bad this task has been > > + * oom_badness - heuristic function to determine which candidate task to kill > > * @p: task struct of which task we should calculate > > - * @uptime: current uptime in seconds > > + * @totalpages: total present RAM allowed for page allocation > > * > > - * The formula used is relatively simple and documented inline in the > > - * function. The main rationale is that we want to select a good task > > - * to kill when we run out of memory. > > - * > > - * Good in this context means that: > > - * 1) we lose the minimum amount of work done > > - * 2) we recover a large amount of memory > > - * 3) we don't kill anything innocent of eating tons of memory > > - * 4) we want to kill the minimum amount of processes (one) > > - * 5) we try to kill the process the user expects us to kill, this > > - * algorithm has been meticulously tuned to meet the principle > > - * of least surprise ... (be careful when you change it) > > + * The heuristic for determining which task to kill is made to be as simple and > > + * predictable as possible. The goal is to return the highest value for the > > + * task consuming the most memory to avoid subsequent oom failures. > > */ > > - > > -unsigned long badness(struct task_struct *p, unsigned long uptime) > > +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) > > { > > - unsigned long points, cpu_time, run_time; > > - struct task_struct *child; > > - struct task_struct *c, *t; > > - int oom_adj = p->signal->oom_adj; > > - struct task_cputime task_time; > > - unsigned long utime; > > - unsigned long stime; > > - > > - if (oom_adj == OOM_DISABLE) > > - return 0; > > + int points; > > > > p = find_lock_task_mm(p); > > if (!p) > > return 0; > > > > /* > > - * The memory size of the process is the basis for the badness. > > - */ > > - points = p->mm->total_vm; > > - > > - /* > > - * After this unlock we can no longer dereference local variable `mm' > > - */ > > - task_unlock(p); > > - > > - /* > > - * swapoff can easily use up all memory, so kill those first. > > + * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't > > + * need to be executed for something that cannot be killed. > > */ > > - if (p->flags & PF_OOM_ORIGIN) > > - return ULONG_MAX; > > - > > - /* > > - * Processes which fork a lot of child processes are likely > > - * a good choice. We add half the vmsize of the children if they > > - * have an own mm. This prevents forking servers to flood the > > - * machine with an endless amount of children. In case a single > > - * child is eating the vast majority of memory, adding only half > > - * to the parents will make the child our kill candidate of choice. > > - */ > > - t = p; > > - do { > > - list_for_each_entry(c, &t->children, sibling) { > > - child = find_lock_task_mm(c); > > - if (child) { > > - if (child->mm != p->mm) > > - points += child->mm->total_vm/2 + 1; > > - task_unlock(child); > > - } > > - } > > - } while_each_thread(p, t); > > + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { > > + task_unlock(p); > > + return 0; > > + } > > > > /* > > - * CPU time is in tens of seconds and run time is in thousands > > - * of seconds. There is no particular reason for this other than > > - * that it turned out to work very well in practice. > > + * When the PF_OOM_ORIGIN bit is set, it indicates the task should have > > + * priority for oom killing. > > */ > > - thread_group_cputime(p, &task_time); > > - utime = cputime_to_jiffies(task_time.utime); > > - stime = cputime_to_jiffies(task_time.stime); > > - cpu_time = (utime + stime) >> (SHIFT_HZ + 3); > > - > > - > > - if (uptime >= p->start_time.tv_sec) > > - run_time = (uptime - p->start_time.tv_sec) >> 10; > > - else > > - run_time = 0; > > - > > - if (cpu_time) > > - points /= int_sqrt(cpu_time); > > - if (run_time) > > - points /= int_sqrt(int_sqrt(run_time)); > > + if (p->flags & PF_OOM_ORIGIN) { > > + task_unlock(p); > > + return 1000; > > + } > > > > /* > > - * Niced processes are most likely less important, so double > > - * their badness points. > > + * The memory controller may have a limit of 0 bytes, so avoid a divide > > + * by zero if necessary. > > */ > > - if (task_nice(p) > 0) > > - points *= 2; > > You removed > - run time check > - cpu time check > - nice check > > but no described the reason. reviewers are puzzled. How do we review > this though we don't get your point? please write > The comment for oom_badness() reflects these changes: our goal is to make the heuristic as simple and _predictable_ as possible, we can't allow runtime and cputime, for example, to avoid freeing more memory by biasing against those tasks. A long cputime does not indicate the importance of a task, nor does it avoid subsequent oom kills in the future because we've freed less memory by killing other tasks as a result. > - What benerit is there? It's predictable and users understand exactly what the heuristic is. > - Why do you think no bad effect? These heursitics seem to have been misplaced from the beginning and there was a _lot_ of desire to remove them dating back a couple years: we simply can't convert runtime or nice levels into potential for memory freeing. It's much better to have a sane and predictable heuristic that will react in similar circumstances to do exactly what the oom killer intends to do: oom kill a task that will free a large amount of memory to avoid subsequent failures that will result in an even greater amount of work. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 2BACC6B01B8 for ; Thu, 17 Jun 2010 01:14:52 -0400 (EDT) Received: from kpbe14.cbf.corp.google.com (kpbe14.cbf.corp.google.com [172.25.105.78]) by smtp-out.google.com with ESMTP id o5H5ElNl027967 for ; Wed, 16 Jun 2010 22:14:47 -0700 Received: from pva18 (pva18.prod.google.com [10.241.209.18]) by kpbe14.cbf.corp.google.com with ESMTP id o5H5EP0u027227 for ; Wed, 16 Jun 2010 22:14:46 -0700 Received: by pva18 with SMTP id 18so123013pva.32 for ; Wed, 16 Jun 2010 22:14:45 -0700 (PDT) Date: Wed, 16 Jun 2010 22:14:44 -0700 (PDT) From: David Rientjes Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: <20100608160216.bc52112b.akpm@linux-foundation.org> Message-ID: References: <20100608194533.7657.A69D9226@jp.fujitsu.com> <20100608160216.bc52112b.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > > + if (!totalpages) > > > + totalpages = 1; > > > > > > /* > > > - * Superuser processes are usually more important, so we make it > > > - * less likely that we kill those. > > > + * The baseline for the badness score is the proportion of RAM that each > > > + * task's rss and swap space use. > > > */ > > > - if (has_capability_noaudit(p, CAP_SYS_ADMIN) || > > > - has_capability_noaudit(p, CAP_SYS_RESOURCE)) > > > - points /= 4; > > > + points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 / > > > + totalpages; > > > + task_unlock(p); > > > > > > /* > > > - * We don't want to kill a process with direct hardware access. > > > - * Not only could that mess up the hardware, but usually users > > > - * tend to only have this flag set on applications they think > > > - * of as important. > > > + * Root processes get 3% bonus, just like the __vm_enough_memory() > > > + * implementation used by LSMs. > > > */ > > > - if (has_capability_noaudit(p, CAP_SYS_RAWIO)) > > > - points /= 4; > > > + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) > > > + points -= 30; > > > > > > CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive > > process. but killing interactive process only cause force logout. but > > killing system daemon can makes more catastrophic disaster. > > > > > > Last of all, I'll pulled this one. but only do cherry-pick. > > > > This change was unchangelogged, I don't know what it's for and I don't > understand your comment about it. > It was in the changelog (recall that the badness() function represents a proportion of available memory used by a task, so subtracting 30 is the equivalent of 3% of available memory): Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 960FD6B01CC for ; Thu, 17 Jun 2010 01:32:20 -0400 (EDT) Received: from kpbe19.cbf.corp.google.com (kpbe19.cbf.corp.google.com [172.25.105.83]) by smtp-out.google.com with ESMTP id o5H5WFgA016344 for ; Wed, 16 Jun 2010 22:32:15 -0700 Received: from pvc7 (pvc7.prod.google.com [10.241.209.135]) by kpbe19.cbf.corp.google.com with ESMTP id o5H5WCNf023810 for ; Wed, 16 Jun 2010 22:32:13 -0700 Received: by pvc7 with SMTP id 7so238048pvc.14 for ; Wed, 16 Jun 2010 22:32:12 -0700 (PDT) Date: Wed, 16 Jun 2010 22:32:09 -0700 (PDT) From: David Rientjes Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: <20100608155802.cdd4aff3.akpm@linux-foundation.org> Message-ID: References: <20100608155802.cdd4aff3.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > This a complete rewrite of the oom killer's badness() heuristic which is > > used to determine which task to kill in oom conditions. The goal is to > > make it as simple and predictable as possible so the results are better > > understood and we end up killing the task which will lead to the most > > memory freeing while still respecting the fine-tuning from userspace. > > It's not obvious from this description that then end result is better! I think it's fairly obvious that predictablility is an important part of any heuristic that will determine whether your task survives or dies. > Have you any testcases or scenarios which got improved? > Yes, as cited below in the changelog with the KDE example. > > Instead of basing the heuristic on mm->total_vm for each task, the task's > > rss and swap space is used instead. This is a better indication of the > > amount of memory that will be freeable if the oom killed task is chosen > > and subsequently exits. > > Again, why should we optimise for the amount of memory which a killing > will yield (if that's what you mean). We only need to free enough > memory to unblock the oom condition then proceed. > That's what the oom killer has always done simply because we want to avoid subsequent oom conditions in the near future that will require additional tasks to be killed. It seems far better to kill a large memory-hogging task[*] than ten smaller tasks that total the same amount of memory usage. [*] And, with this rewrite, "memory-hogging" can be defined for the first time from userspace with a tunable, oom_score_adj, that actually has units so that within a cpuset, for example, we can bias a task by 25% of available memory or bias other tasks against it by 25%. For the first time ever, we can say "this task should be able to use 25% more memory than other tasks without getting killed first." > The last thing we want to do is to kill a process which has consumed > 1000 CPU hours, or which is providing some system-critical service or > whatever. Amount-of-memory-freeable is a relatively minor criterion. > What would you suggest otherwise? Cputime? Then we may never be able to fork our bash shell or ssh into our machines. > > This helps specifically in cases where KDE or > > GNOME is chosen for oom kill on desktop systems instead of a memory > > hogging task. > > It helps how? Examples and test cases? > Because KDE and GNOME typically have very large mm->total_vm values but the amount of resident memory in RAM is consumed by other tasks, even memory leakers. mm->total_vm is agreed to be a very poor heursitic baseline by just about everyone. > > The baseline for the heuristic is a proportion of memory that each task is > > currently using in memory plus swap compared to the amount of "allowable" > > memory. > > What does "swap" mean? swapspace includes swap-backed swapcache, > un-swap-backed swapcache and non-resident swap. Which of all these is > being used here and for what reason? > This is swap cache, the number of swap entries for the task which could be freeable if the task is killed that could subsequently be used for page allocations that triggered the oom killer. We want to add hints to the oom killer so that memory which cannot be used on blockable memory allocations may be freed so we don't call into the oom killer again in the near future. > > /proc/pid/oom_adj is changed so that its meaning is rescaled into the > > units used by /proc/pid/oom_score_adj, and vice versa. Changing one of > > these per-task tunables will rescale the value of the other to an > > equivalent meaning. Although /proc/pid/oom_adj was originally defined as > > a bitshift on the badness score, it now shares the same linear growth as > > /proc/pid/oom_score_adj but with different granularity. This is required > > so the ABI is not broken with userspace applications and allows oom_adj to > > be deprecated for future removal. > > It was a mistake to add oom_adj in the first place. Because it's a > user-visible knob which us tied to a particular in-kernel > implementation. As we're seeing now, the presence of that knob locks > us into a particular implementation. > Agreed. > Given that oom_score_adj is just a rescaled version of oom_adj > (correct?), I guess things haven't got a lot worse on that front as a > result of these changes. > No, it's not a rescaled version at all, we merely rescale oom_adj to oom_score_adj units because everyone objected to removing oom_adj without deprecation first. oom_score_adj has units: a proportion of memory available to the application, meaning how much of the system, memcg, cpuset, or mempolicy it should be biased or favored by. Please see the change to Documentation/filesystems/proc.txt which explain this pretty elaborately. > General observation regarding the patch description: I'm not seeing a > lot of reason for merging the patch! What value does it bring to our > users? What problems got solved? > It significantly improves the oom killer's predictability, it protects vital system tasks like KDE and GNOME on the desktop, it allows users to tune each task with a bias or preference in units they understand to affect its score, and it allows that interface to remain constant and valid even when those tasks are subsequently attached to a cgroup or bound to a mempolicy (or their limits or set of allowed nodes are changed). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 80A396B01AC for ; Mon, 21 Jun 2010 07:45:51 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5LBjnKR003977 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 21 Jun 2010 20:45:49 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1DC4A45DE61 for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id EDF1745DD76 for ; Mon, 21 Jun 2010 20:45:48 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id CBDFD1DB8040 for ; Mon, 21 Jun 2010 20:45:48 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 781E71DB803C for ; Mon, 21 Jun 2010 20:45:48 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: References: <20100613201922.619C.A69D9226@jp.fujitsu.com> Message-Id: <20100621194943.B536.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 21 Jun 2010 20:45:47 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > On Sun, 13 Jun 2010, KOSAKI Motohiro wrote: > > > But oom_score_adj have no benefit form end-uses view. That's problem. > > Please consider to make end-user friendly good patch at first. > > > > Of course it does, it actually has units whereas oom_adj only grows or > shrinks the badness score exponentially. oom_score_adj's units are well > understood: on a machine with 4G of memory, 250 means we're trying to > prejudice it by 1G of memory so that can be used by other tasks, -250 > means other tasks should be prejudiced by 1G in comparison to this task, > etc. It's actually quite powerful. And, no real user want such power. When we consider desktop user case, End-users don't use oom_adj by themself. their application are using it. It mean now oom_adj behave as syscall like system interface, unlike kernel knob. application developers also don't need oom_score_adj because application developers don't know end-users machine mem size. Then, you will get the change's merit but end users will get the demerit. That's out of balance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id DACCD6B01B5 for ; Mon, 21 Jun 2010 07:45:53 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5LBjpjS024563 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 21 Jun 2010 20:45:51 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id B69AB45DE51 for ; Mon, 21 Jun 2010 20:45:50 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8FDEB45DD76 for ; Mon, 21 Jun 2010 20:45:50 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id A29251DB803F for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 543041DB803B for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: References: <20100608160216.bc52112b.akpm@linux-foundation.org> Message-Id: <20100621200549.B53C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 21 Jun 2010 20:45:48 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > > This change was unchangelogged, I don't know what it's for and I don't > > understand your comment about it. > > > > It was in the changelog (recall that the badness() function represents a > proportion of available memory used by a task, so subtracting 30 is the > equivalent of 3% of available memory): > > Root tasks are given 3% extra memory just like __vm_enough_memory() > provides in LSMs. In the event of two tasks consuming similar amounts of > memory, it is generally better to save root's task. LSMs have obvious reason to tend to priotize admin's operation than root privilege daemon. otherwise admins can't restore troubles. But in this case, why do need priotize admin shell than daemons? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 27AE06B01AF for ; Mon, 21 Jun 2010 07:45:56 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5LBjrlX024568 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 21 Jun 2010 20:45:54 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id A9DCD45DE52 for ; Mon, 21 Jun 2010 20:45:53 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 79A2D45DE4E for ; Mon, 21 Jun 2010 20:45:53 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 5FE241DB8017 for ; Mon, 21 Jun 2010 20:45:53 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 0876CE08003 for ; Mon, 21 Jun 2010 20:45:50 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: References: <20100608194533.7657.A69D9226@jp.fujitsu.com> Message-Id: <20100621203838.B542.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 21 Jun 2010 20:45:49 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > > Sorry I can't ack this. again and again, I try to explain why this is wrong > > (hopefully last) > > > > 1) incompatibility > > oom_score is one of ABI. then, we can't change this. from enduser view, > > this change is no merit. In general, an incompatibility is allowed on very > > limited situation such as that an end-user get much benefit than compatibility. > > In other word, old style ABI doesn't works fine from end user view. > > But, in this case, it isn't. > > > > There is no incompatibility here, /proc/pid/oom_score has no meaningful > units because of the old heuristic. The _only_ thing it represents is a > score in comparison with other eligible tasks to decide which task to > kill. Thus, oom_score by itself means nothing if not compared to other > eligible tasks. > > Although deprecated, /proc/pid/oom_adj still changes > /proc/pid/oom_score_adj with a different scale (-17 maps to -1000 and +15 > maps to +1000), so there is absolutely no userspace imcompatibility with > this change. I sympathize your burden. Yes, oom_adj is suck. but it is still an abi. we (kernel developers) can't define it as no meaningful. that's defined by userland folks. If you want to change the world, you need to discuss userland folks. > > > 2) technically incorrect > > this math is not correct math. this is not represented "allowed memory". > > example, 1) this is not accumulated mlocked memory, but it can be freed > > task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not. > > if not, task killing doesn't free SYSV IPC memory. > > Ah, very good point. We should be using totalram_pages + total_swap_pages > here to represent global normalization, memcg limit for CONSTRAINT_MEMCG, > and a total of node_spanned_pages for mempolicy nodes or cpuset mems for > CONSTAINT_MEMORY_POLICY and CONSTRAINT_CPUSET, respectively. I'll make > that switch in the next revision, thanks! I can't understand. What problem do this solve? > > > In additon, 3) This normalization doesn't works on asymmetric numa. > > total pages and oom are not related almostly. > > What this does is represents the heuristic baseline, rss and swap, as a > proportion depending on the type of oom constraint. This works when > comparing eligible tasks amongst each other because the the task with the > highest rss and swap is the one we (normally) want to kill, minus the 3% > privilege given to root and outside influence of /proc/pid/oom_score_adj. > > We want to represent this as a proportion and not as a shear value simply > because the task may be attached to a cpuset, a memcg, or bound to a > mempolicy out from under the task's knowledge. That is, we compare tasks > sharing the same constraint for oom kill and normalize the heuristic based > on that. We don't want to expose a userspace interface that takes memory > quantities directly since the task may be bound to a mempolicy, for > instance, later and the oom_score_adj is then rendered obsolete. Can't understand. Do you mean you suggest to ignore this issue? I feel you talked unrelated thing. Plus the fact is, If you think "We don't want to expose a userspace interface that takes memory quantities directly", it already did 5 years ago. your proposal was too late 5 years. (look at andrea) > > 4) scalability. if the > > system 10TB memory, 1 point oom score mean 10GB memory consumption. > > Well, sure, a 10TB system would have a large granularity such as that :) > But in such cases we don't necessarily care if one task is using 5GB more > than another task using 1TB, for example. Probably not. When we are thinking common DB server workload, DB process consume almost memory, but it's OOM_DISABLEed. OOM victims are typically selected from some assistant JVM process. So, I don't think this is good idea. Instead, To enhance memcg oom notification looks promising. And other piece of this patch looks promising rather than this. please resend them. (of cource, test result too) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id B2AE66B01B9 for ; Mon, 21 Jun 2010 16:47:47 -0400 (EDT) Received: from kpbe20.cbf.corp.google.com (kpbe20.cbf.corp.google.com [172.25.105.84]) by smtp-out.google.com with ESMTP id o5LKlhDn023454 for ; Mon, 21 Jun 2010 13:47:44 -0700 Received: from pwj7 (pwj7.prod.google.com [10.241.219.71]) by kpbe20.cbf.corp.google.com with ESMTP id o5LKlYF3030975 for ; Mon, 21 Jun 2010 13:47:42 -0700 Received: by pwj7 with SMTP id 7so1357417pwj.32 for ; Mon, 21 Jun 2010 13:47:42 -0700 (PDT) Date: Mon, 21 Jun 2010 13:47:35 -0700 (PDT) From: David Rientjes Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: <20100621200549.B53C.A69D9226@jp.fujitsu.com> Message-ID: References: <20100608160216.bc52112b.akpm@linux-foundation.org> <20100621200549.B53C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, 21 Jun 2010, KOSAKI Motohiro wrote: > > It was in the changelog (recall that the badness() function represents a > > proportion of available memory used by a task, so subtracting 30 is the > > equivalent of 3% of available memory): > > > > Root tasks are given 3% extra memory just like __vm_enough_memory() > > provides in LSMs. In the event of two tasks consuming similar amounts of > > memory, it is generally better to save root's task. > > LSMs have obvious reason to tend to priotize admin's operation than root > privilege daemon. otherwise admins can't restore troubles. > > But in this case, why do need priotize admin shell than daemons? > For the same reason. We want to slightly bias admin shells and their processes from being oom killed because they are typically in the business of administering the machine and resolving issues that may arise. It would be irresponsible to consider them to have the same killing preference as user tasks in the case of a tie. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id A2A526B01CC for ; Mon, 21 Jun 2010 16:54:24 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id o5LKsJ1H023625 for ; Mon, 21 Jun 2010 13:54:19 -0700 Received: from pxi15 (pxi15.prod.google.com [10.243.27.15]) by wpaz1.hot.corp.google.com with ESMTP id o5LKsIRb025532 for ; Mon, 21 Jun 2010 13:54:18 -0700 Received: by pxi15 with SMTP id 15so452705pxi.16 for ; Mon, 21 Jun 2010 13:54:17 -0700 (PDT) Date: Mon, 21 Jun 2010 13:54:14 -0700 (PDT) From: David Rientjes Subject: Re: [patch 18/18] oom: deprecate oom_adj tunable In-Reply-To: <20100621194943.B536.A69D9226@jp.fujitsu.com> Message-ID: References: <20100613201922.619C.A69D9226@jp.fujitsu.com> <20100621194943.B536.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Mon, 21 Jun 2010, KOSAKI Motohiro wrote: > > Of course it does, it actually has units whereas oom_adj only grows or > > shrinks the badness score exponentially. oom_score_adj's units are well > > understood: on a machine with 4G of memory, 250 means we're trying to > > prejudice it by 1G of memory so that can be used by other tasks, -250 > > means other tasks should be prejudiced by 1G in comparison to this task, > > etc. It's actually quite powerful. > > And, no real user want such power. > Google does, and I imagine other users will want to be able to normalize each task's memory usage against the others. It's perfectly legitimate for one task to consume 3G while another consumes 1G and want to select the 1G task to kill. Setting the 3G task's oom_score_adj value in this case to be -250, for example, depending on the memory capacity of the machine, makes much more sense than influencing it as a bitshift on top of a vastly unpredictable heuristic with oom_adj. This seems rather trivial to understand. > When we consider desktop user case, End-users don't use oom_adj by themself. > their application are using it. It mean now oom_adj behave as syscall like > system interface, unlike kernel knob. application developers also don't > need oom_score_adj because application developers don't know end-users > machine mem size. > I agree, oom_score_adj isn't targeted to the desktop nor is it targeted to application developers (unless they are setting it to OOM_SCORE_ADJ_MIN to disable oom killing for that task, for example). It's targeted at sysadmins and daemons that partition a machine to run a number of concurrent jobs. It's fine to use memcg, for example, to do such partitioning, but memcg can also cause oom conditions with the cgroup. We want to be able to tell the kernel, through an interface such as this, that one task shouldn't killed because it's expected to use 3G of memory but should be killed when it's using 8G, for example. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id C66906B01BA for ; Wed, 30 Jun 2010 05:26:25 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5U9QMgk015308 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Wed, 30 Jun 2010 18:26:22 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 35C8D45DE55 for ; Wed, 30 Jun 2010 18:26:22 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1696745DE51 for ; Wed, 30 Jun 2010 18:26:22 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 003271DB8038 for ; Wed, 30 Jun 2010 18:26:22 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id AB43B1DB803A for ; Wed, 30 Jun 2010 18:26:21 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 16/18] oom: badness heuristic rewrite In-Reply-To: References: <20100621200549.B53C.A69D9226@jp.fujitsu.com> Message-Id: <20100630181153.AA45.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Wed, 30 Jun 2010 18:26:20 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > On Mon, 21 Jun 2010, KOSAKI Motohiro wrote: > > > > It was in the changelog (recall that the badness() function represents a > > > proportion of available memory used by a task, so subtracting 30 is the > > > equivalent of 3% of available memory): > > > > > > Root tasks are given 3% extra memory just like __vm_enough_memory() > > > provides in LSMs. In the event of two tasks consuming similar amounts of > > > memory, it is generally better to save root's task. > > > > LSMs have obvious reason to tend to priotize admin's operation than root > > privilege daemon. otherwise admins can't restore troubles. > > > > But in this case, why do need priotize admin shell than daemons? > > > > For the same reason. We want to slightly bias admin shells and their > processes from being oom killed because they are typically in the business > of administering the machine and resolving issues that may arise. It > would be irresponsible to consider them to have the same killing > preference as user tasks in the case of a tie. Not same. Administrator freely login again. typically killing login process makes to kill some processes in the same session. thus now they have much memory. rest very few case, they can press SysRq+F as a last resort. In the other hand, system daemon crash can makes all of system crash. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 0A7BC6B01AC for ; Fri, 2 Jul 2010 18:35:18 -0400 (EDT) Date: Fri, 2 Jul 2010 15:35:08 -0700 From: Andrew Morton Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset Message-Id: <20100702153508.fda82eb9.akpm@linux-foundation.org> In-Reply-To: <20100613201257.6199.A69D9226@jp.fujitsu.com> References: <20100608122740.8f045c78.akpm@linux-foundation.org> <20100613201257.6199.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Sun, 13 Jun 2010 20:24:55 +0900 (JST) KOSAKI Motohiro wrote: > Sorry for the delay. > > > On Tue, 8 Jun 2010 11:51:32 -0700 (PDT) > > David Rientjes wrote: > > > > > Andrew, are you the maintainer for these fixes or is KOSAKI? > > > > I am, thanks. Kosaki-san, you're making this harder than it should be. > > Please either ack David's patches or promptly work with him on > > finalising them. > > Thanks, Andrew, David. I agree with you. I don't find any end users harm > and regressions in latest David's patch series. So, I'm glad to join his work. whew ;) > Unfortunatelly, I don't have enough time now. then, I expect my next review > is not quite soon. but I'll promise I'll do. So where do we go from here? I have about 12,000 oom-killer related emails saved up in my todo folder, ready for me to read next time I have an oom-killer session. What would happen if I just deleted them all? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id B82A56B01AC for ; Sun, 4 Jul 2010 18:08:13 -0400 (EDT) Received: from wpaz33.hot.corp.google.com (wpaz33.hot.corp.google.com [172.24.198.97]) by smtp-out.google.com with ESMTP id o64M89Ae013687 for ; Sun, 4 Jul 2010 15:08:10 -0700 Received: from pxi6 (pxi6.prod.google.com [10.243.27.6]) by wpaz33.hot.corp.google.com with ESMTP id o64M87sX026496 for ; Sun, 4 Jul 2010 15:08:08 -0700 Received: by pxi6 with SMTP id 6so168512pxi.15 for ; Sun, 04 Jul 2010 15:08:07 -0700 (PDT) Date: Sun, 4 Jul 2010 15:08:01 -0700 (PDT) From: David Rientjes Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100702153508.fda82eb9.akpm@linux-foundation.org> Message-ID: References: <20100608122740.8f045c78.akpm@linux-foundation.org> <20100613201257.6199.A69D9226@jp.fujitsu.com> <20100702153508.fda82eb9.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Fri, 2 Jul 2010, Andrew Morton wrote: > So where do we go from here? I have about 12,000 oom-killer related > emails saved up in my todo folder, ready for me to read next time I > have an oom-killer session. > I'll be proposing my second revision of the badness heuristic rewrite in the next couple of days. That said, I don't know of any other outstanding patches that haven't yet been merged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 716A06B02A3 for ; Thu, 8 Jul 2010 23:00:46 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o6930gBd011300 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Fri, 9 Jul 2010 12:00:43 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 854DE45DE57 for ; Fri, 9 Jul 2010 12:00:40 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 1845545DE53 for ; Fri, 9 Jul 2010 12:00:40 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 004671DB8038 for ; Fri, 9 Jul 2010 12:00:39 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id C26331DB805A for ; Fri, 9 Jul 2010 12:00:37 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch 07/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100702153508.fda82eb9.akpm@linux-foundation.org> References: <20100613201257.6199.A69D9226@jp.fujitsu.com> <20100702153508.fda82eb9.akpm@linux-foundation.org> Message-Id: <20100705110018.CC9F.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Fri, 9 Jul 2010 12:00:36 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: > > Unfortunatelly, I don't have enough time now. then, I expect my next review > > is not quite soon. but I'll promise I'll do. > > So where do we go from here? I have about 12,000 oom-killer related > emails saved up in my todo folder, ready for me to read next time I1 > have an oom-killer session. At least, all deadlock issue should be fixed. I don't know Michel's problem is still there. plus I think all desktop related issue also sould be fixed. but I'm not aggressive to include domain specific OOM tendency. It should be cared user-land callback and userland daemon. because any usecase specific change can be considered as regression from another usecase guys. About David's patch, I dunnno. he didn't explain his patch makes which change. If he will explained the worth and anybody agree it, it can be merged. but otherwise..... > What would happen if I just deleted them all? Probably, no problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org