From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id C4CCC6B01B7 for ; Tue, 1 Jun 2010 03:18:19 -0400 (EDT) Received: from wpaz9.hot.corp.google.com (wpaz9.hot.corp.google.com [172.24.198.73]) by smtp-out.google.com with ESMTP id o517IGii024703 for ; Tue, 1 Jun 2010 00:18:17 -0700 Received: from pwj1 (pwj1.prod.google.com [10.241.219.65]) by wpaz9.hot.corp.google.com with ESMTP id o517IErY012499 for ; Tue, 1 Jun 2010 00:18:15 -0700 Received: by pwj1 with SMTP id 1so441364pwj.27 for ; Tue, 01 Jun 2010 00:18:13 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:07 -0700 (PDT) From: David Rientjes Subject: [patch -mm 00/18] oom killer rewrite Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: This is yet another version of my oom killer rewrite, now rebased to mmotm-2010-05-21-16-05. This version removes the consolidation of the two existing sysctls, oom_kill_allocating_task and oom_dump_tasks, as recommended by a couple different people. This version also makes pagefault oom handling consistent with panic_on_oom behavior now that all architectures have been converted to using the oom killer instead of simply issuing a SIGKILL for current. Many thanks to Nick Piggin for converting the existing archs. --- Documentation/feature-removal-schedule.txt | 25 + Documentation/filesystems/proc.txt | 100 +++- Documentation/sysctl/vm.txt | 23 + fs/proc/base.c | 107 ++++- include/linux/memcontrol.h | 8 include/linux/mempolicy.h | 13 include/linux/oom.h | 26 + include/linux/sched.h | 3 kernel/fork.c | 1 kernel/sysctl.c | 12 mm/memcontrol.c | 18 mm/mempolicy.c | 44 ++ mm/oom_kill.c | 603 +++++++++++++++-------------- mm/page_alloc.c | 29 - 14 files changed, 680 insertions(+), 332 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 7A7416B01B7 for ; Tue, 1 Jun 2010 03:18:21 -0400 (EDT) Received: from kpbe14.cbf.corp.google.com (kpbe14.cbf.corp.google.com [172.25.105.78]) by smtp-out.google.com with ESMTP id o517IJV0022023 for ; Tue, 1 Jun 2010 00:18:19 -0700 Received: from pxi17 (pxi17.prod.google.com [10.243.27.17]) by kpbe14.cbf.corp.google.com with ESMTP id o517IHwh008932 for ; Tue, 1 Jun 2010 00:18:18 -0700 Received: by pxi17 with SMTP id 17so2071393pxi.25 for ; Tue, 01 Jun 2010 00:18:17 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:15 -0700 (PDT) From: David Rientjes Subject: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. Acked-by: Rik van Riel Acked-by: Nick Piggin Acked-by: Balbir Singh Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 12 +++--------- 1 files changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -36,7 +36,7 @@ static DEFINE_SPINLOCK(zone_scan_lock); /* #define DEBUG */ /* - * Is all threads of the target process nodes overlap ours? + * Do all threads of the target process overlap our allowed nodes? */ static int has_intersects_mems_allowed(struct task_struct *tsk) { @@ -168,14 +168,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) points /= 4; /* - * If p's nodes don't overlap ours, it may still help to kill p - * because p may have allocated or otherwise mapped memory on - * this node before. However it will be less likely. - */ - if (!has_intersects_mems_allowed(p)) - points /= 8; - - /* * Adjust the score by oom_adj. */ if (oom_adj) { @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; + if (!has_intersects_mems_allowed(p)) + continue; /* * This task already has access to memory reserves and is -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id E15B16B01C4 for ; Tue, 1 Jun 2010 03:18:27 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id o517IM0F032564 for ; Tue, 1 Jun 2010 00:18:23 -0700 Received: from pzk29 (pzk29.prod.google.com [10.243.19.157]) by wpaz17.hot.corp.google.com with ESMTP id o517ILuR031825 for ; Tue, 1 Jun 2010 00:18:21 -0700 Received: by pzk29 with SMTP id 29so1119367pzk.3 for ; Tue, 01 Jun 2010 00:18:21 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:18 -0700 (PDT) From: David Rientjes Subject: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: When a task is chosen for oom kill, the oom killer first attempts to sacrifice a child not sharing its parent's memory instead. Unfortunately, this often kills in a seemingly random fashion based on the ordering of the selected task's child list. Additionally, it is not guaranteed at all to free a large amount of memory that we need to prevent additional oom killing in the very near future. Instead, we now only attempt to sacrifice the worst child not sharing its parent's memory, if one exists. The worst child is indicated with the highest badness() score. This serves two advantages: we kill a memory-hogging task more often, and we allow the configurable /proc/pid/oom_adj value to be considered as a factor in which child to kill. Reviewers may observe that the previous implementation would iterate through the children and attempt to kill each until one was successful and then the parent if none were found while the new code simply kills the most memory-hogging task or the parent. Note that the only time oom_kill_task() fails, however, is when a child does not have an mm or has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, so the final oom_kill_task() will always succeed. Acked-by: Rik van Riel Acked-by: Nick Piggin Acked-by: Balbir Singh Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: Minchan Kim Reviewed-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/oom_kill.c | 23 +++++++++++++++++------ 1 files changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -433,7 +433,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned long points, struct mem_cgroup *mem, const char *message) { + struct task_struct *victim = p; struct task_struct *c; + unsigned long victim_points = 0; + struct timespec uptime; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return 0; } - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", - message, task_pid_nr(p), p->comm, points); + pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n", + message, task_pid_nr(p), p->comm, points); - /* Try to kill a child first */ + do_posix_clock_monotonic_gettime(&uptime); + /* Try to sacrifice the worst child first */ list_for_each_entry(c, &p->children, sibling) { + unsigned long cpoints; + if (c->mm == p->mm) continue; if (mem && !task_in_mem_cgroup(c, mem)) continue; - if (!oom_kill_task(c)) - return 0; + + /* badness() returns 0 if the thread is unkillable */ + cpoints = badness(c, uptime.tv_sec); + if (cpoints > victim_points) { + victim = c; + victim_points = cpoints; + } } - return oom_kill_task(p); + return oom_kill_task(victim); } #ifdef CONFIG_CGROUP_MEM_RES_CTLR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 3B3576B01C6 for ; Tue, 1 Jun 2010 03:18:28 -0400 (EDT) Received: from kpbe20.cbf.corp.google.com (kpbe20.cbf.corp.google.com [172.25.105.84]) by smtp-out.google.com with ESMTP id o517IPXZ024852 for ; Tue, 1 Jun 2010 00:18:26 -0700 Received: from pzk40 (pzk40.prod.google.com [10.243.19.168]) by kpbe20.cbf.corp.google.com with ESMTP id o517IAGV030019 for ; Tue, 1 Jun 2010 00:18:24 -0700 Received: by pzk40 with SMTP id 40so2371472pzk.23 for ; Tue, 01 Jun 2010 00:18:24 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:21 -0700 (PDT) From: David Rientjes Subject: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: The oom killer presently kills current whenever there is no more memory free or reclaimable on its mempolicy's nodes. There is no guarantee that current is a memory-hogging task or that killing it will free any substantial amount of memory, however. In such situations, it is better to scan the tasklist for nodes that are allowed to allocate on current's set of nodes and kill the task with the highest badness() score. This ensures that the most memory-hogging task, or the one configured by the user with /proc/pid/oom_adj, is always selected in such scenarios. Reviewed-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- include/linux/mempolicy.h | 13 +++++++- mm/mempolicy.c | 44 +++++++++++++++++++++++++ mm/oom_kill.c | 77 +++++++++++++++++++++++++++----------------- 3 files changed, 103 insertions(+), 31 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol, nodemask_t **nodemask); extern bool init_nodemask_of_mempolicy(nodemask_t *mask); +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask); extern unsigned slab_node(struct mempolicy *policy); extern enum zone_type policy_zone; @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, return node_zonelist(0, gfp_flags); } -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) +{ + return false; +} + +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask) +{ + return false; +} static inline int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from_nodes, diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) } #endif +/* + * mempolicy_nodemask_intersects + * + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default + * policy. Otherwise, check for intersection between mask and the policy + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local' + * policy, always return true since it may allocate elsewhere on fallback. + * + * Takes task_lock(tsk) to prevent freeing of its mempolicy. + */ +bool mempolicy_nodemask_intersects(struct task_struct *tsk, + const nodemask_t *mask) +{ + struct mempolicy *mempolicy; + bool ret = true; + + if (!mask) + return ret; + task_lock(tsk); + mempolicy = tsk->mempolicy; + if (!mempolicy) + goto out; + + switch (mempolicy->mode) { + case MPOL_PREFERRED: + /* + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to + * allocate from, they may fallback to other nodes when oom. + * Thus, it's possible for tsk to have allocated memory from + * nodes in mask. + */ + break; + case MPOL_BIND: + case MPOL_INTERLEAVE: + ret = nodes_intersects(mempolicy->v.nodes, *mask); + break; + default: + BUG(); + } +out: + task_unlock(tsk); + return ret; +} + /* Allocate a page in interleaved policy. Own path because it needs to do special accounting. */ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -27,6 +27,7 @@ #include #include #include +#include #include int sysctl_panic_on_oom; @@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock); /* * Do all threads of the target process overlap our allowed nodes? + * @tsk: task struct of which task to consider + * @mask: nodemask passed to page allocator for mempolicy ooms */ -static int has_intersects_mems_allowed(struct task_struct *tsk) +static bool has_intersects_mems_allowed(struct task_struct *tsk, + const nodemask_t *mask) { - struct task_struct *t; + struct task_struct *start = tsk; - t = tsk; do { - if (cpuset_mems_allowed_intersects(current, t)) - return 1; - t = next_thread(t); - } while (t != tsk); - - return 0; + if (mask) { + /* + * If this is a mempolicy constrained oom, tsk's + * cpuset is irrelevant. Only return true if its + * mempolicy intersects current, otherwise it may be + * needlessly killed. + */ + if (mempolicy_nodemask_intersects(tsk, mask)) + return true; + } else { + /* + * This is not a mempolicy constrained oom, so only + * check the mems of tsk's cpuset. + */ + if (cpuset_mems_allowed_intersects(current, tsk)) + return true; + } + tsk = next_thread(tsk); + } while (tsk != start); + return false; } /** @@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, * (not docbooked, we don't want this one cluttering up the manual) */ static struct task_struct *select_bad_process(unsigned long *ppoints, - struct mem_cgroup *mem) + struct mem_cgroup *mem, enum oom_constraint constraint, + const nodemask_t *mask) { struct task_struct *p; struct task_struct *chosen = NULL; @@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; - if (!has_intersects_mems_allowed(p)) + if (!has_intersects_mems_allowed(p, + constraint == CONSTRAINT_MEMORY_POLICY ? mask : + NULL)) continue; /* @@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) panic("out of memory(memcg). panic_on_oom is selected.\n"); read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem); + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; @@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) /* * Must be called with tasklist_lock held for read. */ -static void __out_of_memory(gfp_t gfp_mask, int order) +static void __out_of_memory(gfp_t gfp_mask, int order, + enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; unsigned long points; @@ -576,7 +597,7 @@ retry: * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points, NULL); + p = select_bad_process(&points, NULL, constraint, mask); if (PTR_ERR(p) == -1UL) return; @@ -610,7 +631,8 @@ void pagefault_out_of_memory(void) panic("out of memory from page fault. panic_on_oom is selected.\n"); read_lock(&tasklist_lock); - __out_of_memory(0, 0); /* unknown gfp_mask and order */ + /* unknown gfp_mask and order */ + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); read_unlock(&tasklist_lock); /* @@ -626,6 +648,7 @@ void pagefault_out_of_memory(void) * @zonelist: zonelist pointer * @gfp_mask: memory allocation flags * @order: amount of memory being requested as a power of 2 + * @nodemask: nodemask passed to page allocator * * If we run out of memory, we have the choice between either * killing a random task (bad), letting the system crash (worse) @@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, */ constraint = constrained_alloc(zonelist, gfp_mask, nodemask); read_lock(&tasklist_lock); - - switch (constraint) { - case CONSTRAINT_MEMORY_POLICY: - oom_kill_process(current, gfp_mask, order, 0, NULL, - "No available memory (MPOL_BIND)"); - break; - - case CONSTRAINT_NONE: - if (sysctl_panic_on_oom) { + if (unlikely(sysctl_panic_on_oom)) { + /* + * panic_on_oom only affects CONSTRAINT_NONE, the kernel + * should not panic for cpuset or mempolicy induced memory + * failures. + */ + if (constraint == CONSTRAINT_NONE) { dump_header(NULL, gfp_mask, order, NULL); - panic("out of memory. panic_on_oom is selected\n"); + panic("Out of memory: panic_on_oom is enabled\n"); } - /* Fall-through */ - case CONSTRAINT_CPUSET: - __out_of_memory(gfp_mask, order); - break; } - + __out_of_memory(gfp_mask, order, constraint, nodemask); read_unlock(&tasklist_lock); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id BEFEB6B01CA for ; Tue, 1 Jun 2010 03:18:38 -0400 (EDT) Received: from hpaq7.eem.corp.google.com (hpaq7.eem.corp.google.com [172.25.149.7]) by smtp-out.google.com with ESMTP id o517IajV023386 for ; Tue, 1 Jun 2010 00:18:36 -0700 Received: from pwi9 (pwi9.prod.google.com [10.241.219.9]) by hpaq7.eem.corp.google.com with ESMTP id o517IYVF006412 for ; Tue, 1 Jun 2010 00:18:34 -0700 Received: by pwi9 with SMTP id 9so620987pwi.16 for ; Tue, 01 Jun 2010 00:18:33 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:25 -0700 (PDT) From: David Rientjes Subject: [patch -mm 04/18] oom: extract panic helper function In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: There are various points in the oom killer where the kernel must determine whether to panic or not. It's better to extract this to a helper function to remove all the confusion as to its semantics. There's no functional change with this patch. Signed-off-by: David Rientjes --- include/linux/oom.h | 1 + mm/oom_kill.c | 50 ++++++++++++++++++++++++++++---------------------- 2 files changed, 29 insertions(+), 22 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -22,6 +22,7 @@ enum oom_constraint { CONSTRAINT_NONE, CONSTRAINT_CPUSET, CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, }; extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -493,17 +493,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return oom_kill_task(victim); } +/* + * Determines whether the kernel must panic because of the panic_on_oom sysctl. + */ +static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, + int order) +{ + if (likely(!sysctl_panic_on_oom)) + return; + if (sysctl_panic_on_oom != 2) { + /* + * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel + * does not panic for cpuset, mempolicy, or memcg allocation + * failures. + */ + if (constraint != CONSTRAINT_NONE) + return; + } + read_lock(&tasklist_lock); + dump_header(NULL, gfp_mask, order, NULL); + read_unlock(&tasklist_lock); + panic("Out of memory: %s panic_on_oom is enabled\n", + sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) { unsigned long points = 0; struct task_struct *p; - if (sysctl_panic_on_oom == 2) - panic("out of memory(memcg). panic_on_oom is selected.\n"); + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); + p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; @@ -627,9 +650,7 @@ void pagefault_out_of_memory(void) /* Got some memory back in the last second. */ return; - if (sysctl_panic_on_oom) - panic("out of memory from page fault. panic_on_oom is selected.\n"); - + check_panic_on_oom(CONSTRAINT_NONE, 0, 0); read_lock(&tasklist_lock); /* unknown gfp_mask and order */ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); @@ -666,28 +687,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, /* Got some memory back in the last second. */ return; - if (sysctl_panic_on_oom == 2) { - dump_header(NULL, gfp_mask, order, NULL); - panic("out of memory. Compulsory panic_on_oom is selected.\n"); - } - /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); - if (unlikely(sysctl_panic_on_oom)) { - /* - * panic_on_oom only affects CONSTRAINT_NONE, the kernel - * should not panic for cpuset or mempolicy induced memory - * failures. - */ - if (constraint == CONSTRAINT_NONE) { - dump_header(NULL, gfp_mask, order, NULL); - panic("Out of memory: panic_on_oom is enabled\n"); - } - } __out_of_memory(gfp_mask, order, constraint, nodemask); read_unlock(&tasklist_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 20C2E6B01CA for ; Tue, 1 Jun 2010 03:18:41 -0400 (EDT) Received: from hpaq12.eem.corp.google.com (hpaq12.eem.corp.google.com [172.25.149.12]) by smtp-out.google.com with ESMTP id o517IbQQ025124 for ; Tue, 1 Jun 2010 00:18:38 -0700 Received: from pzk4 (pzk4.prod.google.com [10.243.19.132]) by hpaq12.eem.corp.google.com with ESMTP id o517IZkm005068 for ; Tue, 1 Jun 2010 00:18:36 -0700 Received: by pzk4 with SMTP id 4so1048410pzk.7 for ; Tue, 01 Jun 2010 00:18:35 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:32 -0700 (PDT) From: David Rientjes Subject: [patch -mm 05/18] oom: remove special handling for pagefault ooms In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: It is possible to remove the special pagefault oom handler by simply oom locking all system zones and then calling directly into out_of_memory(). All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a parallel oom killing in progress that will lead to eventual memory freeing so it's not necessary to needlessly kill another task. The context in which the pagefault is allocating memory is unknown to the oom killer, so this is done on a system-wide level. If a task has already been oom killed and hasn't fully exited yet, this will be a no-op since select_bad_process() recognizes tasks across the system with TIF_MEMDIE set. Acked-by: Nick Piggin Signed-off-by: David Rientjes --- mm/oom_kill.c | 86 +++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 57 insertions(+), 29 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -603,6 +603,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) } /* + * Try to acquire the oom killer lock for all system zones. Returns zero if a + * parallel oom killing is taking place, otherwise locks all zones and returns + * non-zero. + */ +static int try_set_system_oom(void) +{ + struct zone *zone; + int ret = 1; + + spin_lock(&zone_scan_lock); + for_each_populated_zone(zone) + if (zone_is_oom_locked(zone)) { + ret = 0; + goto out; + } + for_each_populated_zone(zone) + zone_set_flag(zone, ZONE_OOM_LOCKED); +out: + spin_unlock(&zone_scan_lock); + return ret; +} + +/* + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation + * attempts or page faults may now recall the oom killer, if necessary. + */ +static void clear_system_oom(void) +{ + struct zone *zone; + + spin_lock(&zone_scan_lock); + for_each_populated_zone(zone) + zone_clear_flag(zone, ZONE_OOM_LOCKED); + spin_unlock(&zone_scan_lock); +} + + +/* * Must be called with tasklist_lock held for read. */ static void __out_of_memory(gfp_t gfp_mask, int order, @@ -637,33 +675,6 @@ retry: goto retry; } -/* - * pagefault handler calls into here because it is out of memory but - * doesn't know exactly how or why. - */ -void pagefault_out_of_memory(void) -{ - unsigned long freed = 0; - - blocking_notifier_call_chain(&oom_notify_list, 0, &freed); - if (freed > 0) - /* Got some memory back in the last second. */ - return; - - check_panic_on_oom(CONSTRAINT_NONE, 0, 0); - read_lock(&tasklist_lock); - /* unknown gfp_mask and order */ - __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); - read_unlock(&tasklist_lock); - - /* - * Give "p" a good chance of killing itself before we - * retry to allocate memory. - */ - if (!test_thread_flag(TIF_MEMDIE)) - schedule_timeout_uninterruptible(1); -} - /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -680,7 +691,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask) { unsigned long freed = 0; - enum oom_constraint constraint; + enum oom_constraint constraint = CONSTRAINT_NONE; blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0) @@ -691,7 +702,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + if (zonelist) + constraint = constrained_alloc(zonelist, gfp_mask, nodemask); check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); __out_of_memory(gfp_mask, order, constraint, nodemask); @@ -704,3 +716,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, if (!test_thread_flag(TIF_MEMDIE)) schedule_timeout_uninterruptible(1); } + +/* + * The pagefault handler calls here because it is out of memory, so kill a + * memory-hogging task. If a populated zone has ZONE_OOM_LOCKED set, a parallel + * oom killing is already in progress so do nothing. If a task is found with + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit. + */ +void pagefault_out_of_memory(void) +{ + if (try_set_system_oom()) { + out_of_memory(NULL, 0, 0, NULL); + clear_system_oom(); + } + if (!test_thread_flag(TIF_MEMDIE)) + schedule_timeout_uninterruptible(1); +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 4ED0F6B01CD for ; Tue, 1 Jun 2010 03:18:44 -0400 (EDT) Received: from kpbe14.cbf.corp.google.com (kpbe14.cbf.corp.google.com [172.25.105.78]) by smtp-out.google.com with ESMTP id o517IetS018690 for ; Tue, 1 Jun 2010 00:18:40 -0700 Received: from pxi17 (pxi17.prod.google.com [10.243.27.17]) by kpbe14.cbf.corp.google.com with ESMTP id o517IHwi008932 for ; Tue, 1 Jun 2010 00:18:38 -0700 Received: by pxi17 with SMTP id 17so2071393pxi.25 for ; Tue, 01 Jun 2010 00:18:38 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:35 -0700 (PDT) From: David Rientjes Subject: [patch -mm 06/18] oom: move sysctl declarations to oom.h In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: The three oom killer sysctl variables (sysctl_oom_dump_tasks, sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better declared in include/linux/oom.h rather than kernel/sysctl.c. Signed-off-by: David Rientjes --- include/linux/oom.h | 5 +++++ kernel/sysctl.c | 4 +--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -44,5 +44,10 @@ static inline void oom_killer_enable(void) { oom_killer_disabled = false; } + +/* sysctls */ +extern int sysctl_oom_dump_tasks; +extern int sysctl_oom_kill_allocating_task; +extern int sysctl_panic_on_oom; #endif /* __KERNEL__*/ #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -87,9 +88,6 @@ /* External variables not in a header file. */ extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; -extern int sysctl_panic_on_oom; -extern int sysctl_oom_kill_allocating_task; -extern int sysctl_oom_dump_tasks; extern int max_threads; extern int core_uses_pid; extern int suid_dumpable; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 5D9A06B01CD for ; Tue, 1 Jun 2010 03:18:46 -0400 (EDT) Received: from hpaq5.eem.corp.google.com (hpaq5.eem.corp.google.com [172.25.149.5]) by smtp-out.google.com with ESMTP id o517IiFJ022581 for ; Tue, 1 Jun 2010 00:18:45 -0700 Received: from pxi7 (pxi7.prod.google.com [10.243.27.7]) by hpaq5.eem.corp.google.com with ESMTP id o517Ig3g025706 for ; Tue, 1 Jun 2010 00:18:43 -0700 Received: by pxi7 with SMTP id 7so2303628pxi.13 for ; Tue, 01 Jun 2010 00:18:42 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:39 -0700 (PDT) From: David Rientjes Subject: [patch -mm 07/18] oom: enable oom tasklist dump by default In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is very helpful information in diagnosing why a user's task has been killed. It emits useful information such as each eligible thread's memory usage that can determine why the system is oom, so it should be enabled by default. Signed-off-by: David Rientjes --- Documentation/sysctl/vm.txt | 2 +- mm/oom_kill.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -511,7 +511,7 @@ information may not be desired. If this is set to non-zero, this information is shown whenever the OOM killer actually kills a memory-hogging task. -The default value is 0. +The default value is 1 (enabled). ============================================================== diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -32,7 +32,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; -int sysctl_oom_dump_tasks; +int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); /* #define DEBUG */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id AF6B86B01D2 for ; Tue, 1 Jun 2010 03:18:53 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o517Ipcw025600 for ; Tue, 1 Jun 2010 00:18:51 -0700 Received: from pwi6 (pwi6.prod.google.com [10.241.219.6]) by kpbe15.cbf.corp.google.com with ESMTP id o517InRX031830 for ; Tue, 1 Jun 2010 00:18:50 -0700 Received: by pwi6 with SMTP id 6so5161086pwi.0 for ; Tue, 01 Jun 2010 00:18:49 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:47 -0700 (PDT) From: David Rientjes Subject: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: Add a forkbomb penalty for processes that fork an excessively large number of children to penalize that group of tasks and not others. A threshold is configurable from userspace to determine how many first- generation execve children (those with their own address spaces) a task may have before it is considered a forkbomb. This can be tuned by altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to 1000. When a task has more than 1000 first-generation children with different address spaces than itself, a penalty of (average rss of children) * (# of 1st generation execve children) ----------------------------------------------------------------- oom_forkbomb_thres is assessed. So, for example, using the default oom_forkbomb_thres of 1000, the penalty is twice the average rss of all its execve children if there are 2000 such tasks. A task is considered to count toward the threshold if its total runtime is less than one second; for 1000 of such tasks to exist, the parent process must be forking at an extremely high rate either erroneously or maliciously. Even though a particular task may be designated a forkbomb and selected as the victim, the oom killer will still kill the 1st generation execve child with the highest badness() score in its place. The avoids killing important servers or system daemons. When a web server forks a very large number of threads for client connections, for example, it is much better to kill one of those threads than to kill the server and make it unresponsive. [oleg@redhat.com: optimize task_lock when iterating children] Signed-off-by: David Rientjes --- Documentation/filesystems/proc.txt | 7 +++- Documentation/sysctl/vm.txt | 21 ++++++++++++ include/linux/oom.h | 3 ++ kernel/sysctl.c | 8 +++++ mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++++++++ 5 files changed, 97 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1248,8 +1248,11 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: root -processes are given 3% extra memory over other tasks. +There are a couple of additional factor included in the badness score: root +processes are given 3% extra memory over other tasks, and tasks which forkbomb +an excessive number of child processes are penalized by their average size. +The number of child processes considered to be a forkbomb is configurable +via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt). The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -46,6 +46,7 @@ Currently, these files are in /proc/sys/vm: - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order - oom_dump_tasks +- oom_forkbomb_thres - oom_kill_allocating_task - overcommit_memory - overcommit_ratio @@ -515,6 +516,26 @@ The default value is 1 (enabled). ============================================================== +oom_forkbomb_thres + +This value defines how many children with a seperate address space a specific +task may have before being considered as a possible forkbomb. Tasks with more +children not sharing the same address space as the parent will be penalized by a +quantity of memory equaling + + (average rss of execve children) * (# of 1st generation execve children) + ------------------------------------------------------------------------ + oom_forkbomb_thres + +in the oom killer's badness heuristic. Such tasks may be protected with a lower +oom_adj value (see Documentation/filesystems/proc.txt) if necessary. + +A value of 0 will disable forkbomb detection. + +The default value is 1000. + +============================================================== + oom_kill_allocating_task This enables or disables killing the OOM-triggering task in diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -16,6 +16,9 @@ #define OOM_SCORE_ADJ_MIN (-1000) #define OOM_SCORE_ADJ_MAX 1000 +/* See Documentation/sysctl/vm.txt */ +#define DEFAULT_OOM_FORKBOMB_THRES 1000 + #ifdef __KERNEL__ #include diff --git a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1001,6 +1001,14 @@ static struct ctl_table vm_table[] = { .proc_handler = proc_dointvec, }, { + .procname = "oom_forkbomb_thres", + .data = &sysctl_oom_forkbomb_thres, + .maxlen = sizeof(sysctl_oom_forkbomb_thres), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + }, + { .procname = "overcommit_ratio", .data = &sysctl_overcommit_ratio, .maxlen = sizeof(sysctl_overcommit_ratio), diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -35,6 +35,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; +int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES; static DEFINE_SPINLOCK(zone_scan_lock); /* @@ -94,6 +95,64 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, return false; } +/* + * Tasks that fork a very large number of children with seperate address spaces + * may be the result of a bug, user error, malicious applications, or even those + * with a very legitimate purpose such as a webserver. The oom killer assesses + * a penalty equaling + * + * (average rss of children) * (# of 1st generation execve children) + * ----------------------------------------------------------------- + * sysctl_oom_forkbomb_thres + * + * for such tasks to target the parent. oom_kill_process() will attempt to + * first kill a child, so there's no risk of killing an important system daemon + * via this method. A web server, for example, may fork a very large number of + * threads to respond to client connections; it's much better to kill a child + * than to kill the parent, making the server unresponsive. The goal here is + * to give the user a chance to recover from the error rather than deplete all + * memory such that the system is unusable, it's not meant to effect a forkbomb + * policy. + */ +static unsigned long oom_forkbomb_penalty(struct task_struct *tsk) +{ + struct task_struct *child; + unsigned long child_rss = 0; + int forkcount = 0; + + if (!sysctl_oom_forkbomb_thres) + return 0; + list_for_each_entry(child, &tsk->children, sibling) { + struct task_cputime task_time; + unsigned long runtime; + unsigned long rss; + + task_lock(child); + if (!child->mm || child->mm == tsk->mm) { + task_unlock(child); + continue; + } + rss = get_mm_rss(child->mm); + task_unlock(child); + + thread_group_cputime(child, &task_time); + runtime = cputime_to_jiffies(task_time.utime) + + cputime_to_jiffies(task_time.stime); + /* + * Only threads that have run for less than a second are + * considered toward the forkbomb penalty, these threads rarely + * get to execute at all in such cases anyway. + */ + if (runtime < HZ) { + child_rss += rss; + forkcount++; + } + } + + return forkcount > sysctl_oom_forkbomb_thres ? + (child_rss / sysctl_oom_forkbomb_thres) : 0; +} + /** * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate @@ -143,6 +202,7 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 / totalpages; task_unlock(p); + points += oom_forkbomb_penalty(p); /* * Root processes get 3% bonus, just like the __vm_enough_memory() used -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id C03AF6B01D4 for ; Tue, 1 Jun 2010 03:18:53 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id o517Intg023799 for ; Tue, 1 Jun 2010 00:18:49 -0700 Received: from pva4 (pva4.prod.google.com [10.241.209.4]) by kpbe16.cbf.corp.google.com with ESMTP id o517Ilmg002144 for ; Tue, 1 Jun 2010 00:18:48 -0700 Received: by pva4 with SMTP id 4so602952pva.30 for ; Tue, 01 Jun 2010 00:18:47 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:43 -0700 (PDT) From: David Rientjes Subject: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes --- Documentation/filesystems/proc.txt | 94 ++++++++----- fs/proc/base.c | 99 +++++++++++++- include/linux/memcontrol.h | 8 + include/linux/oom.h | 14 ++- include/linux/sched.h | 3 +- kernel/fork.c | 1 + mm/memcontrol.c | 18 +++ mm/oom_kill.c | 267 ++++++++++++++++-------------------- 8 files changed, 311 insertions(+), 193 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -33,7 +33,8 @@ Table of Contents 2 Modifying System Parameters 3 Per-Process Parameters - 3.1 /proc//oom_adj - Adjust the oom-killer score + 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer + score 3.2 /proc//oom_score - Display current oom-killer score 3.3 /proc//io - Display the IO accounting fields 3.4 /proc//coredump_filter - Core dump filtering settings @@ -1234,42 +1235,61 @@ of the kernel. CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------ -3.1 /proc//oom_adj - Adjust the oom-killer score ------------------------------------------------------- - -This file can be used to adjust the score used to select which processes -should be killed in an out-of-memory situation. Giving it a high score will -increase the likelihood of this process being killed by the oom-killer. Valid -values are in the range -16 to +15, plus the special value -17, which disables -oom-killing altogether for this process. - -The process to be killed in an out-of-memory situation is selected among all others -based on its badness score. This value equals the original memory size of the process -and is then updated according to its CPU time (utime + stime) and the -run time (uptime - start time). The longer it runs the smaller is the score. -Badness score is divided by the square root of the CPU time and then by -the double square root of the run time. - -Swapped out tasks are killed first. Half of each child's memory size is added to -the parent's score if they do not share the same memory. Thus forking servers -are the prime candidates to be killed. Having only one 'hungry' child will make -parent less preferable than the child. - -/proc//oom_score shows process' current badness score. - -The following heuristics are then applied: - * if the task was reniced, its score doubles - * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE - or CAP_SYS_RAWIO) have their score divided by 4 - * if oom condition happened in one cpuset and checked process does not belong - to it, its score is divided by 8 - * the resulting score is multiplied by two to the power of oom_adj, i.e. - points <<= oom_adj when it is positive and - points >>= -(oom_adj) otherwise - -The task with the highest badness score is then selected and its children -are killed, process itself will be killed in an OOM situation when it does -not have children or some of them disabled oom like described above. +3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: root +processes are given 3% extra memory over other tasks. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc//oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc//oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc//oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc//oom_score_adj. + +Writing to /proc//oom_score_adj or /proc//oom_adj will change the +other with its scaled value. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with seperate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + 3.2 /proc//oom_score - Display current oom-killer score ------------------------------------------------------------- diff --git a/fs/proc/base.c b/fs/proc/base.c --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include #include @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = { #endif /* The badness from the OOM killer */ -unsigned long badness(struct task_struct *p, unsigned long uptime); static int proc_oom_score(struct task_struct *task, char *buffer) { unsigned long points = 0; - struct timespec uptime; - do_posix_clock_monotonic_gettime(&uptime); read_lock(&tasklist_lock); if (pid_alive(task)) - points = badness(task, uptime.tv_sec); + points = oom_badness(task->group_leader, + global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages); read_unlock(&tasklist_lock); return sprintf(buffer, "%lu\n", points); } @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, } task->signal->oom_adj = oom_adjust; - + /* + * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum + * value is always attainable. + */ + if (task->signal->oom_adj == OOM_ADJUST_MAX) + task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; + else + task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / + -OOM_DISABLE; unlock_task_sighand(task, &flags); put_task_struct(task); @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = { .llseek = generic_file_llseek, }; +static ssize_t oom_score_adj_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); + char buffer[PROC_NUMBUF]; + int oom_score_adj = OOM_SCORE_ADJ_MIN; + unsigned long flags; + size_t len; + + if (!task) + return -ESRCH; + if (lock_task_sighand(task, &flags)) { + oom_score_adj = task->signal->oom_score_adj; + unlock_task_sighand(task, &flags); + } + put_task_struct(task); + len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj); + return simple_read_from_buffer(buf, count, ppos, buffer, len); +} + +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + char buffer[PROC_NUMBUF]; + unsigned long flags; + long oom_score_adj; + int err; + + memset(buffer, 0, sizeof(buffer)); + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + + err = strict_strtol(strstrip(buffer), 0, &oom_score_adj); + if (err) + return -EINVAL; + if (oom_score_adj < OOM_SCORE_ADJ_MIN || + oom_score_adj > OOM_SCORE_ADJ_MAX) + return -EINVAL; + + task = get_proc_task(file->f_path.dentry->d_inode); + if (!task) + return -ESRCH; + if (!lock_task_sighand(task, &flags)) { + put_task_struct(task); + return -ESRCH; + } + if (oom_score_adj < task->signal->oom_score_adj && + !capable(CAP_SYS_RESOURCE)) { + unlock_task_sighand(task, &flags); + put_task_struct(task); + return -EACCES; + } + + task->signal->oom_score_adj = oom_score_adj; + /* + * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is + * always attainable. + */ + if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + task->signal->oom_adj = OOM_DISABLE; + else + task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / + OOM_SCORE_ADJ_MAX; + unlock_task_sighand(task, &flags); + put_task_struct(task); + return count; +} + +static const struct file_operations proc_oom_score_adj_operations = { + .read = oom_score_adj_read, + .write = oom_score_adj_write, +}; + #ifdef CONFIG_AUDITSYSCALL #define TMPBUFLEN 21 static ssize_t proc_loginuid_read(struct file * file, char __user * buf, @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = { #endif INF("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUGO, proc_sessionid_operations), @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif INF("oom_score", S_IRUGO, proc_oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), + REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), REG("sessionid", S_IRUSR, proc_sessionid_operations), diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val); unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, gfp_t gfp_mask, int nid, int zid); +u64 mem_cgroup_get_limit(struct mem_cgroup *mem); + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct mem_cgroup; @@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, return 0; } +static inline +u64 mem_cgroup_get_limit(struct mem_cgroup *mem) +{ + return 0; +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -1,14 +1,24 @@ #ifndef __INCLUDE_LINUX_OOM_H #define __INCLUDE_LINUX_OOM_H -/* /proc//oom_adj set to -17 protects from the oom-killer */ +/* + * /proc//oom_adj set to -17 protects from the oom-killer + */ #define OOM_DISABLE (-17) /* inclusive */ #define OOM_ADJUST_MIN (-16) #define OOM_ADJUST_MAX 15 +/* + * /proc//oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for + * pid. + */ +#define OOM_SCORE_ADJ_MIN (-1000) +#define OOM_SCORE_ADJ_MAX 1000 + #ifdef __KERNEL__ +#include #include #include @@ -25,6 +35,8 @@ enum oom_constraint { CONSTRAINT_MEMCG, }; +extern unsigned int oom_badness(struct task_struct *p, + unsigned long totalpages); extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags); extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -629,7 +629,8 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif - int oom_adj; /* OOM kill score adjustment (bit shift) */ + int oom_adj; /* OOM kill score adjustment (bit shift) */ + int oom_score_adj; /* OOM kill score adjustment */ }; /* Context switch must be unlocked if interrupts are to be enabled */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sig->oom_adj = current->signal->oom_adj; + sig->oom_score_adj = current->signal->oom_score_adj; return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem) } /* + * Return the memory (and swap, if configured) limit for a memcg. + */ +u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) +{ + u64 limit; + u64 memsw; + + limit = res_counter_read_u64(&memcg->res, RES_LIMIT) + + total_swap_pages; + memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT); + /* + * If memsw is finite and limits the amount of swap space available + * to this memcg, return that limit. + */ + return min(limit, memsw); +} + +/* * Visit the first child (need not be the first child as per the ordering * of the cgroup list, since we track last_scanned_child) of @mem and use * that to reclaim free pages from. diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -4,6 +4,8 @@ * Copyright (C) 1998,2000 Rik van Riel * Thanks go out to Claus Fischer for some serious inspiration and * for goading me into coding this file... + * Copyright (C) 2010 Google, Inc + * Rewritten by David Rientjes * * The routines in this file are used to kill a process when * we're seriously out of memory. This gets called from __alloc_pages() @@ -34,7 +36,6 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; static DEFINE_SPINLOCK(zone_scan_lock); -/* #define DEBUG */ /* * Do all threads of the target process overlap our allowed nodes? @@ -94,37 +95,33 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, } /** - * badness - calculate a numeric value for how bad this task has been + * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate - * @uptime: current uptime in seconds + * @totalpages: total present RAM allowed for page allocation * - * The formula used is relatively simple and documented inline in the - * function. The main rationale is that we want to select a good task - * to kill when we run out of memory. - * - * Good in this context means that: - * 1) we lose the minimum amount of work done - * 2) we recover a large amount of memory - * 3) we don't kill anything innocent of eating tons of memory - * 4) we want to kill the minimum amount of processes (one) - * 5) we try to kill the process the user expects us to kill, this - * algorithm has been meticulously tuned to meet the principle - * of least surprise ... (be careful when you change it) + * The heuristic for determining which task to kill is made to be as simple and + * predictable as possible. The goal is to return the highest value for the + * task consuming the most memory to avoid subsequent oom conditions. */ - -unsigned long badness(struct task_struct *p, unsigned long uptime) +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) { - unsigned long points, cpu_time, run_time; struct mm_struct *mm; - struct task_struct *child; - int oom_adj = p->signal->oom_adj; - struct task_cputime task_time; - unsigned long utime; - unsigned long stime; + int points; - if (oom_adj == OOM_DISABLE) + /* + * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't + * need to be executed for something that can't be killed. + */ + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) return 0; + /* + * When the PF_OOM_ORIGIN bit is set, it indicates the task should have + * priority for oom killing. + */ + if (p->flags & PF_OOM_ORIGIN) + return 1000; + task_lock(p); mm = p->mm; if (!mm) { @@ -133,98 +130,37 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) } /* - * The memory size of the process is the basis for the badness. + * The memory controller may have a limit of 0 bytes, so avoid a divide + * by zero if necessary. */ - points = mm->total_vm; + if (!totalpages) + totalpages = 1; /* - * After this unlock we can no longer dereference local variable `mm' + * The baseline for the badness score is the proportion of RAM that each + * task's rss and swap space use. */ + points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 / + totalpages; task_unlock(p); /* - * swapoff can easily use up all memory, so kill those first. - */ - if (p->flags & PF_OOM_ORIGIN) - return ULONG_MAX; - - /* - * Processes which fork a lot of child processes are likely - * a good choice. We add half the vmsize of the children if they - * have an own mm. This prevents forking servers to flood the - * machine with an endless amount of children. In case a single - * child is eating the vast majority of memory, adding only half - * to the parents will make the child our kill candidate of choice. - */ - list_for_each_entry(child, &p->children, sibling) { - task_lock(child); - if (child->mm != mm && child->mm) - points += child->mm->total_vm/2 + 1; - task_unlock(child); - } - - /* - * CPU time is in tens of seconds and run time is in thousands - * of seconds. There is no particular reason for this other than - * that it turned out to work very well in practice. - */ - thread_group_cputime(p, &task_time); - utime = cputime_to_jiffies(task_time.utime); - stime = cputime_to_jiffies(task_time.stime); - cpu_time = (utime + stime) >> (SHIFT_HZ + 3); - - - if (uptime >= p->start_time.tv_sec) - run_time = (uptime - p->start_time.tv_sec) >> 10; - else - run_time = 0; - - if (cpu_time) - points /= int_sqrt(cpu_time); - if (run_time) - points /= int_sqrt(int_sqrt(run_time)); - - /* - * Niced processes are most likely less important, so double - * their badness points. - */ - if (task_nice(p) > 0) - points *= 2; - - /* - * Superuser processes are usually more important, so we make it - * less likely that we kill those. - */ - if (has_capability_noaudit(p, CAP_SYS_ADMIN) || - has_capability_noaudit(p, CAP_SYS_RESOURCE)) - points /= 4; - - /* - * We don't want to kill a process with direct hardware access. - * Not only could that mess up the hardware, but usually users - * tend to only have this flag set on applications they think - * of as important. + * Root processes get 3% bonus, just like the __vm_enough_memory() used + * by LSMs. */ - if (has_capability_noaudit(p, CAP_SYS_RAWIO)) - points /= 4; + if (has_capability_noaudit(p, CAP_SYS_ADMIN)) + points -= 30; /* - * Adjust the score by oom_adj. + * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the + * range may either completely disable oom killing or always prefer a + * certain task. */ - if (oom_adj) { - if (oom_adj > 0) { - if (!points) - points = 1; - points <<= oom_adj; - } else - points >>= -(oom_adj); - } + points += p->signal->oom_score_adj; -#ifdef DEBUG - printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n", - p->pid, p->comm, points); -#endif - return points; + if (points < 0) + return 0; + return (points <= 1000) ? points : 1000; } /* @@ -232,12 +168,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime) */ #ifdef CONFIG_NUMA static enum oom_constraint constrained_alloc(struct zonelist *zonelist, - gfp_t gfp_mask, nodemask_t *nodemask) + gfp_t gfp_mask, nodemask_t *nodemask, + unsigned long *totalpages) { struct zone *zone; struct zoneref *z; enum zone_type high_zoneidx = gfp_zone(gfp_mask); + bool cpuset_limited = false; + int nid; + + /* Default to all anonymous memory, page cache, and swap */ + *totalpages = global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages; + if (!zonelist) + return CONSTRAINT_NONE; /* * Reach here only when __GFP_NOFAIL is used. So, we should avoid * to kill current.We have to random task kill in this case. @@ -247,26 +195,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, return CONSTRAINT_NONE; /* - * The nodemask here is a nodemask passed to alloc_pages(). Now, - * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy - * feature. mempolicy is an only user of nodemask here. - * check mempolicy's nodemask contains all N_HIGH_MEMORY + * This is not a __GFP_THISNODE allocation, so a truncated nodemask in + * the page allocator means a mempolicy is in effect. Cpuset policy + * is enforced in get_page_from_freelist(). */ - if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) + if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) { + *totalpages = total_swap_pages; + for_each_node_mask(nid, *nodemask) + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + + node_page_state(nid, NR_ACTIVE_ANON) + + node_page_state(nid, NR_INACTIVE_FILE) + + node_page_state(nid, NR_ACTIVE_FILE); return CONSTRAINT_MEMORY_POLICY; + } /* Check this allocation failure is caused by cpuset's wall function */ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, nodemask) if (!cpuset_zone_allowed_softwall(zone, gfp_mask)) - return CONSTRAINT_CPUSET; - + cpuset_limited = true; + + if (cpuset_limited) { + *totalpages = total_swap_pages; + for_each_node_mask(nid, cpuset_current_mems_allowed) + *totalpages += node_page_state(nid, NR_INACTIVE_ANON) + + node_page_state(nid, NR_ACTIVE_ANON) + + node_page_state(nid, NR_INACTIVE_FILE) + + node_page_state(nid, NR_ACTIVE_FILE); + return CONSTRAINT_CPUSET; + } return CONSTRAINT_NONE; } #else static enum oom_constraint constrained_alloc(struct zonelist *zonelist, - gfp_t gfp_mask, nodemask_t *nodemask) + gfp_t gfp_mask, nodemask_t *nodemask, + unsigned long *totalpages) { + *totalpages = global_page_state(NR_INACTIVE_ANON) + + global_page_state(NR_ACTIVE_ANON) + + global_page_state(NR_INACTIVE_FILE) + + global_page_state(NR_ACTIVE_FILE) + + total_swap_pages; return CONSTRAINT_NONE; } #endif @@ -277,18 +246,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, * * (not docbooked, we don't want this one cluttering up the manual) */ -static struct task_struct *select_bad_process(unsigned long *ppoints, - struct mem_cgroup *mem, enum oom_constraint constraint, - const nodemask_t *mask) +static struct task_struct *select_bad_process(unsigned int *ppoints, + unsigned long totalpages, struct mem_cgroup *mem, + enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; struct task_struct *chosen = NULL; - struct timespec uptime; *ppoints = 0; - do_posix_clock_monotonic_gettime(&uptime); for_each_process(p) { - unsigned long points; + unsigned int points; /* * skip kernel threads and tasks which have already released @@ -333,13 +300,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, return ERR_PTR(-1UL); chosen = p; - *ppoints = ULONG_MAX; + *ppoints = 1000; } - if (p->signal->oom_adj == OOM_DISABLE) + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) continue; - points = badness(p, uptime.tv_sec); + points = oom_badness(p, totalpages); if (points > *ppoints || !chosen) { chosen = p; *ppoints = points; @@ -355,7 +322,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, * * Dumps the current memory state of all system tasks, excluding kernel threads. * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj - * score, and name. + * value, oom_score_adj value, and name. * * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are * shown. @@ -367,7 +334,7 @@ static void dump_tasks(const struct mem_cgroup *mem) struct task_struct *g, *p; printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " - "name\n"); + "oom_score_adj name\n"); do_each_thread(g, p) { struct mm_struct *mm; @@ -387,10 +354,10 @@ static void dump_tasks(const struct mem_cgroup *mem) task_unlock(p); continue; } - printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu %3d %3d %4d %s\n", p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm, get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj, - p->comm); + p->signal->oom_score_adj, p->comm); task_unlock(p); } while_each_thread(g, p); } @@ -399,8 +366,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, struct mem_cgroup *mem) { pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " - "oom_adj=%d\n", - current->comm, gfp_mask, order, current->signal->oom_adj); + "oom_adj=%d, oom_score_adj=%d\n", + current->comm, gfp_mask, order, current->signal->oom_adj, + current->signal->oom_score_adj); task_lock(current); cpuset_print_task_mems_allowed(current); task_unlock(current); @@ -465,7 +433,7 @@ static int oom_kill_task(struct task_struct *p) * change to NULL at any time since we do not hold task_lock(p). * However, this is of no concern to us. */ - if (!p->mm || p->signal->oom_adj == OOM_DISABLE) + if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) return 1; __oom_kill_task(p, 1); @@ -474,13 +442,12 @@ static int oom_kill_task(struct task_struct *p) } static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, - unsigned long points, struct mem_cgroup *mem, - const char *message) + unsigned int points, unsigned long totalpages, + struct mem_cgroup *mem, const char *message) { struct task_struct *victim = p; struct task_struct *c; - unsigned long victim_points = 0; - struct timespec uptime; + unsigned int victim_points = 0; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -494,21 +461,20 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, return 0; } - pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n", + pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n", message, task_pid_nr(p), p->comm, points); - do_posix_clock_monotonic_gettime(&uptime); /* Try to sacrifice the worst child first */ list_for_each_entry(c, &p->children, sibling) { - unsigned long cpoints; + unsigned int cpoints; if (c->mm == p->mm) continue; if (mem && !task_in_mem_cgroup(c, mem)) continue; - /* badness() returns 0 if the thread is unkillable */ - cpoints = badness(c, uptime.tv_sec); + /* oom_badness() returns 0 if the thread is unkillable */ + cpoints = oom_badness(c, totalpages); if (cpoints > victim_points) { victim = c; victim_points = cpoints; @@ -520,17 +486,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, #ifdef CONFIG_CGROUP_MEM_RES_CTLR void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) { + unsigned long limit; unsigned long points = 0; struct task_struct *p; check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0); + limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT; read_lock(&tasklist_lock); retry: - p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL); + p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL); if (!p || PTR_ERR(p) == -1UL) goto out; - if (oom_kill_process(p, gfp_mask, 0, points, mem, + if (oom_kill_process(p, gfp_mask, 0, points, limit, mem, "Memory cgroup out of memory")) goto retry; out: @@ -643,22 +611,22 @@ static void clear_system_oom(void) /* * Must be called with tasklist_lock held for read. */ -static void __out_of_memory(gfp_t gfp_mask, int order, +static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages, enum oom_constraint constraint, const nodemask_t *mask) { struct task_struct *p; - unsigned long points; + unsigned int points; if (sysctl_oom_kill_allocating_task) - if (!oom_kill_process(current, gfp_mask, order, 0, NULL, - "Out of memory (oom_kill_allocating_task)")) + if (!oom_kill_process(current, gfp_mask, order, 0, totalpages, + NULL, "Out of memory (oom_kill_allocating_task)")) return; retry: /* * Rambo mode: Shoot down a process and hope it solves whatever * issues we may have. */ - p = select_bad_process(&points, NULL, constraint, mask); + p = select_bad_process(&points, totalpages, NULL, constraint, mask); if (PTR_ERR(p) == -1UL) return; @@ -670,7 +638,7 @@ retry: panic("Out of memory and no killable processes...\n"); } - if (oom_kill_process(p, gfp_mask, order, points, NULL, + if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, "Out of memory")) goto retry; } @@ -690,6 +658,7 @@ retry: void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask) { + unsigned long totalpages; unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; @@ -702,11 +671,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ - if (zonelist) - constraint = constrained_alloc(zonelist, gfp_mask, nodemask); + constraint = constrained_alloc(zonelist, gfp_mask, nodemask, + &totalpages); check_panic_on_oom(constraint, gfp_mask, order); read_lock(&tasklist_lock); - __out_of_memory(gfp_mask, order, constraint, nodemask); + __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask); read_unlock(&tasklist_lock); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 6A4206B01D4 for ; Tue, 1 Jun 2010 03:18:57 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o517ItE6025330 for ; Tue, 1 Jun 2010 00:18:55 -0700 Received: from pvc7 (pvc7.prod.google.com [10.241.209.135]) by wpaz5.hot.corp.google.com with ESMTP id o517IriP025066 for ; Tue, 1 Jun 2010 00:18:54 -0700 Received: by pvc7 with SMTP id 7so238439pvc.1 for ; Tue, 01 Jun 2010 00:18:53 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:50 -0700 (PDT) From: David Rientjes Subject: [patch -mm 10/18] oom: deprecate oom_adj tunable In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: /proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is May 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes --- Documentation/feature-removal-schedule.txt | 25 +++++++++++++++++++++++++ Documentation/filesystems/proc.txt | 3 +++ fs/proc/base.c | 8 ++++++++ include/linux/oom.h | 3 +++ 4 files changed, 39 insertions(+), 0 deletions(-) diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -174,6 +174,31 @@ Who: Eric Biederman --------------------------- +What: /proc//oom_adj +When: May 2012 +Why: /proc//oom_adj allows userspace to influence the oom killer's + badness heuristic used to determine which task to kill when the kernel + is out of memory. + + The badness heuristic has since been rewritten since the introduction of + this tunable such that its meaning is deprecated. The value was + implemented as a bitshift on a score generated by the badness() + function that did not have any precise units of measure. With the + rewrite, the score is given as a proportion of available memory to the + task allocating pages, so using a bitshift which grows the score + exponentially is, thus, impossible to tune with fine granularity. + + A much more powerful interface, /proc//oom_score_adj, was + introduced with the oom killer rewrite that allows users to increase or + decrease the badness() score linearly. This interface will replace + /proc//oom_adj. + + A warning will be emitted to the kernel log if an application uses this + deprecated interface. After it is printed once, future warnings will be + suppressed until the kernel is rebooted. + +--------------------------- + What: remove EXPORT_SYMBOL(kernel_thread) When: August 2006 Files: arch/*/kernel/*_ksyms.c diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1288,6 +1288,9 @@ scaled linearly with /proc//oom_score_adj. Writing to /proc//oom_score_adj or /proc//oom_adj will change the other with its scaled value. +NOTICE: /proc//oom_adj is deprecated and will be removed, please see +Documentation/feature-removal-schedule.txt. + Caveat: when a parent task is selected, the oom killer will sacrifice any first generation children with seperate address spaces instead, if possible. This avoids servers and important system daemons from being killed and loses the diff --git a/fs/proc/base.c b/fs/proc/base.c --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1044,6 +1044,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf, return -EACCES; } + /* + * Warn that /proc/pid/oom_adj is deprecated, see + * Documentation/feature-removal-schedule.txt. + */ + printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, " + "please use /proc/%d/oom_score_adj instead.\n", + current->comm, task_pid_nr(current), + task_pid_nr(current), task_pid_nr(task)); task->signal->oom_adj = oom_adjust; /* * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -2,6 +2,9 @@ #define __INCLUDE_LINUX_OOM_H /* + * /proc//oom_adj is deprecated, see + * Documentation/feature-removal-schedule.txt. + * * /proc//oom_adj set to -17 protects from the oom-killer */ #define OOM_DISABLE (-17) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 4E5D06B01D8 for ; Tue, 1 Jun 2010 03:19:01 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o517IwPA001150 for ; Tue, 1 Jun 2010 00:18:59 -0700 Received: from pwi5 (pwi5.prod.google.com [10.241.219.5]) by kpbe15.cbf.corp.google.com with ESMTP id o517IvVO031963 for ; Tue, 1 Jun 2010 00:18:57 -0700 Received: by pwi5 with SMTP id 5so1749811pwi.34 for ; Tue, 01 Jun 2010 00:18:56 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:54 -0700 (PDT) From: David Rientjes Subject: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: If memory has been depleted in lowmem zones even with the protection afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that killing current users will help. The memory is either reclaimable (or migratable) already, in which case we should not invoke the oom killer at all, or it is pinned by an application for I/O. Killing such an application may leave the hardware in an unspecified state and there is no guarantee that it will be able to make a timely exit. Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is not used so that the task can perhaps recover or try again later. Previously, the heuristic provided some protection for those tasks with CAP_SYS_RAWIO, but this is no longer necessary since we will not be killing tasks for the purposes of ISA allocations. high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the default for all allocations that are not __GFP_DMA, __GFP_DMA32, __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those flags. Testing for high_zoneidx being less than ZONE_NORMAL will only return true for allocations that have either __GFP_DMA or __GFP_DMA32. Signed-off-by: David Rientjes --- mm/page_alloc.c | 29 ++++++++++++++++++++--------- 1 files changed, 20 insertions(+), 9 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer will not help higher order allocs */ if (order > PAGE_ALLOC_COSTLY_ORDER) goto out; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (high_zoneidx < ZONE_NORMAL) + goto out; /* * GFP_THISNODE contains __GFP_NORETRY and we never hit this. * Sanity check for bare calls of __GFP_THISNODE, not real OOM. @@ -2052,15 +2055,23 @@ rebalance: if (page) goto got_pg; - /* - * The OOM killer does not trigger for high-order - * ~__GFP_NOFAIL allocations so if no progress is being - * made, there are no other options and retrying is - * unlikely to help. - */ - if (order > PAGE_ALLOC_COSTLY_ORDER && - !(gfp_mask & __GFP_NOFAIL)) - goto nopage; + if (!(gfp_mask & __GFP_NOFAIL)) { + /* + * The oom killer is not called for high-order + * allocations that may fail, so if no progress + * is being made, there are no other options and + * retrying is unlikely to help. + */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto nopage; + /* + * The oom killer is not called for lowmem + * allocations to prevent needlessly killing + * innocent tasks. + */ + if (high_zoneidx < ZONE_NORMAL) + goto nopage; + } goto restart; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 71A996B01D9 for ; Tue, 1 Jun 2010 03:19:04 -0400 (EDT) Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o517J21p024400 for ; Tue, 1 Jun 2010 00:19:02 -0700 Received: from pxi10 (pxi10.prod.google.com [10.243.27.10]) by kpbe11.cbf.corp.google.com with ESMTP id o517IJZb022950 for ; Tue, 1 Jun 2010 00:19:01 -0700 Received: by pxi10 with SMTP id 10so2973638pxi.35 for ; Tue, 01 Jun 2010 00:19:00 -0700 (PDT) Date: Tue, 1 Jun 2010 00:18:57 -0700 (PDT) From: David Rientjes Subject: [patch -mm 12/18] oom: remove unnecessary code and cleanup In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: Remove the redundancy in __oom_kill_task() since: - init can never be passed to this function: it will never be PF_EXITING or selectable from select_bad_process(), and - it will never be passed a task from oom_kill_task() without an ->mm and we're unconcerned about detachment from exiting tasks, there's no reason to protect them against SIGKILL or access to memory reserves. Also moves the kernel log message to a higher level since the verbosity is not always emitted here; we need not print an error message if an exiting task is given a longer timeslice. Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: KOSAKI Motohiro Signed-off-by: David Rientjes --- mm/oom_kill.c | 64 ++++++++++++++------------------------------------------ 1 files changed, 16 insertions(+), 48 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -439,67 +439,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(mem); } -#define K(x) ((x) << (PAGE_SHIFT-10)) - /* - * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO - * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO - * set. + * Give the oom killed task high priority and access to memory reserves so that + * it may quickly exit and free its memory. */ -static void __oom_kill_task(struct task_struct *p, int verbose) +static void __oom_kill_task(struct task_struct *p) { - if (is_global_init(p)) { - WARN_ON(1); - printk(KERN_WARNING "tried to kill init!\n"); - return; - } - - task_lock(p); - if (!p->mm) { - WARN_ON(1); - printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n", - task_pid_nr(p), p->comm); - task_unlock(p); - return; - } - - if (verbose) - printk(KERN_ERR "Killed process %d (%s) " - "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n", - task_pid_nr(p), p->comm, - K(p->mm->total_vm), - K(get_mm_counter(p->mm, MM_ANONPAGES)), - K(get_mm_counter(p->mm, MM_FILEPAGES))); - task_unlock(p); - - /* - * We give our sacrificial lamb high priority and access to - * all the memory it needs. That way it should be able to - * exit() and clear out its resources quickly... - */ p->rt.time_slice = HZ; set_tsk_thread_flag(p, TIF_MEMDIE); - force_sig(SIGKILL, p); } +#define K(x) ((x) << (PAGE_SHIFT-10)) static int oom_kill_task(struct task_struct *p) { - /* WARNING: mm may not be dereferenced since we did not obtain its - * value from get_task_mm(p). This is OK since all we need to do is - * compare mm to q->mm below. - * - * Furthermore, even if mm contains a non-NULL value, p->mm may - * change to NULL at any time since we do not hold task_lock(p). - * However, this is of no concern to us. - */ - if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) + task_lock(p); + if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) { + task_unlock(p); return 1; + } + pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", + task_pid_nr(p), p->comm, K(p->mm->total_vm), + K(get_mm_counter(p->mm, MM_ANONPAGES)), + K(get_mm_counter(p->mm, MM_FILEPAGES))); + task_unlock(p); - __oom_kill_task(p, 1); - + __oom_kill_task(p); return 0; } +#undef K static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned int points, unsigned long totalpages, @@ -517,7 +485,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { - __oom_kill_task(p, 0); + __oom_kill_task(p); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id B25A16B01DB for ; Tue, 1 Jun 2010 03:19:08 -0400 (EDT) Received: from hpaq2.eem.corp.google.com (hpaq2.eem.corp.google.com [172.25.149.2]) by smtp-out.google.com with ESMTP id o517J62Z026188 for ; Tue, 1 Jun 2010 00:19:06 -0700 Received: from pwj1 (pwj1.prod.google.com [10.241.219.65]) by hpaq2.eem.corp.google.com with ESMTP id o517J3Rd005936 for ; Tue, 1 Jun 2010 00:19:04 -0700 Received: by pwj1 with SMTP id 1so455310pwj.41 for ; Tue, 01 Jun 2010 00:19:03 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:01 -0700 (PDT) From: David Rientjes Subject: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: Tasks detach its ->mm prior to exiting so it's possible that in progress oom kills or already exiting tasks may be missed during the oom killer's tasklist scan. When an eligible task is found with either TIF_MEMDIE or PF_EXITING set, the oom killer is supposed to be a no-op to avoid needlessly killing additional tasks. This closes the race between a task detaching its ->mm and being removed from the tasklist. Out of memory conditions as the result of memory controllers will automatically filter tasks that have detached their ->mm (since task_in_mem_cgroup() will return 0). This is acceptable, however, since memcg constrained ooms aren't the result of a lack of memory resources but rather a limit imposed by userspace that requires a task be killed regardless. [oleg@redhat.com: fix PF_EXITING check for !p->mm tasks] Acked-by: Nick Piggin Signed-off-by: David Rientjes --- mm/oom_kill.c | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -317,12 +317,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, for_each_process(p) { unsigned int points; - /* - * skip kernel threads and tasks which have already released - * their mm. - */ - if (!p->mm) - continue; /* skip the init task */ if (is_global_init(p)) continue; @@ -355,7 +349,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, * the process of exiting and releasing its resources. * Otherwise we could get an easy OOM deadlock. */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING && p->mm) { if (p != current) return ERR_PTR(-1UL); @@ -363,6 +357,12 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, *ppoints = 1000; } + /* + * skip kernel threads and tasks which have already released + * their mm. + */ + if (!p->mm) + continue; if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 84E246B01DF for ; Tue, 1 Jun 2010 03:19:13 -0400 (EDT) Received: from kpbe14.cbf.corp.google.com (kpbe14.cbf.corp.google.com [172.25.105.78]) by smtp-out.google.com with ESMTP id o517JAbi020019 for ; Tue, 1 Jun 2010 00:19:10 -0700 Received: from pxi5 (pxi5.prod.google.com [10.243.27.5]) by kpbe14.cbf.corp.google.com with ESMTP id o517J8Ce009889 for ; Tue, 1 Jun 2010 00:19:08 -0700 Received: by pxi5 with SMTP id 5so2658519pxi.37 for ; Tue, 01 Jun 2010 00:19:08 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:04 -0700 (PDT) From: David Rientjes Subject: [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: From: Oleg Nesterov select_bad_process() thinks a kernel thread can't have ->mm != NULL, this is not true due to use_mm(). Change the code to check PF_KTHREAD. Signed-off-by: Oleg Nesterov Signed-off-by: David Rientjes --- mm/oom_kill.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -317,8 +317,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, for_each_process(p) { unsigned int points; - /* skip the init task */ - if (is_global_init(p)) + /* skip the init task and kthreads */ + if (is_global_init(p) || (p->flags & PF_KTHREAD)) continue; if (mem && !task_in_mem_cgroup(p, mem)) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 4E9D96B01E1 for ; Tue, 1 Jun 2010 03:19:14 -0400 (EDT) Received: from kpbe11.cbf.corp.google.com (kpbe11.cbf.corp.google.com [172.25.105.75]) by smtp-out.google.com with ESMTP id o517JDif026513 for ; Tue, 1 Jun 2010 00:19:13 -0700 Received: from pwj10 (pwj10.prod.google.com [10.241.219.74]) by kpbe11.cbf.corp.google.com with ESMTP id o517J8aV023435 for ; Tue, 1 Jun 2010 00:19:11 -0700 Received: by pwj10 with SMTP id 10so2382525pwj.7 for ; Tue, 01 Jun 2010 00:19:11 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:09 -0700 (PDT) From: David Rientjes Subject: [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: From: Oleg Nesterov Almost all ->mm == NUL checks in oom_kill.c are wrong. The current code assumes that the task without ->mm has already released its memory and ignores the process. However this is not necessarily true when this process is multithreaded, other live sub-threads can use this ->mm. - Remove the "if (!p->mm)" check in select_bad_process(), it is just wrong. - Add the new helper, find_lock_task_mm(), which finds the live thread which uses the memory and takes task_lock() to pin ->mm - change oom_badness() to use this helper instead of just checking ->mm != NULL. - As David pointed out, select_bad_process() must never choose the task without ->mm, but no matter what oom_badness() returns the task can be chosen if nothing else has been found yet. Change oom_badness() to return int, change it to return -1 if find_lock_task_mm() fails, and change select_bad_process() to check points >= 0. Note! This patch is not enough, we need more changes. - oom_badness() was fixed, but oom_kill_task() still ignores the task without ->mm - oom_forkbomb_penalty() should use find_lock_task_mm() too, and it also needs other changes to actually find the first first-descendant children This will be addressed later. Signed-off-by: Oleg Nesterov Signed-off-by: David Rientjes --- mm/oom_kill.c | 37 +++++++++++++++++++------------------ 1 files changed, 19 insertions(+), 18 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -95,6 +95,20 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, return false; } +static struct task_struct *find_lock_task_mm(struct task_struct *p) +{ + struct task_struct *t = p; + + do { + task_lock(t); + if (likely(t->mm)) + return t; + task_unlock(t); + } while_each_thread(p, t); + + return NULL; +} + /* * Tasks that fork a very large number of children with seperate address spaces * may be the result of a bug, user error, malicious applications, or even those @@ -164,7 +178,6 @@ static unsigned long oom_forkbomb_penalty(struct task_struct *tsk) */ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) { - struct mm_struct *mm; int points; /* @@ -181,12 +194,9 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) if (p->flags & PF_OOM_ORIGIN) return 1000; - task_lock(p); - mm = p->mm; - if (!mm) { - task_unlock(p); + p = find_lock_task_mm(p); + if (!p) return 0; - } /* * The memory controller may have a limit of 0 bytes, so avoid a divide @@ -199,8 +209,8 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages) * The baseline for the badness score is the proportion of RAM that each * task's rss and swap space use. */ - points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 / - totalpages; + points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * + 1000 / totalpages; task_unlock(p); points += oom_forkbomb_penalty(p); @@ -357,17 +367,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, *ppoints = 1000; } - /* - * skip kernel threads and tasks which have already released - * their mm. - */ - if (!p->mm) - continue; - if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) - continue; - points = oom_badness(p, totalpages); - if (points > *ppoints || !chosen) { + if (points > *ppoints) { chosen = p; *ppoints = points; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 460BA6B01E1 for ; Tue, 1 Jun 2010 03:19:21 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o517JGti020141 for ; Tue, 1 Jun 2010 00:19:17 -0700 Received: from pxi18 (pxi18.prod.google.com [10.243.27.18]) by wpaz5.hot.corp.google.com with ESMTP id o517JBT5025646 for ; Tue, 1 Jun 2010 00:19:15 -0700 Received: by pxi18 with SMTP id 18so1778793pxi.19 for ; Tue, 01 Jun 2010 00:19:15 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:12 -0700 (PDT) From: David Rientjes Subject: [patch -mm 16/18] oom: give current access to memory reserves if it has been killed In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: It's possible to livelock the page allocator if a thread has mm->mmap_sem and fails to make forward progress because the oom killer selects another thread sharing the same ->mm to kill that cannot exit until the semaphore is dropped. The oom killer will not kill multiple tasks at the same time; each oom killed task must exit before another task may be killed. Thus, if one thread is holding mm->mmap_sem and cannot allocate memory, all threads sharing the same ->mm are blocked from exiting as well. In the oom kill case, that means the thread holding mm->mmap_sem will never free additional memory since it cannot get access to memory reserves and the thread that depends on it with access to memory reserves cannot exit because it cannot acquire the semaphore. Thus, the page allocators livelocks. When the oom killer is called and current happens to have a pending SIGKILL, this patch automatically gives it access to memory reserves and returns. Upon returning to the page allocator, its allocation will hopefully succeed so it can quickly exit and free its memory. If not, the page allocator will fail the allocation if it is not __GFP_NOFAIL. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -697,6 +697,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, return; /* + * If current has a pending SIGKILL, then automatically select it. The + * goal is to allow it to allocate so that it may quickly exit and free + * its memory. + */ + if (fatal_signal_pending(current)) { + set_tsk_thread_flag(current, TIF_MEMDIE); + return; + } + + /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 0CD376B01E4 for ; Tue, 1 Jun 2010 03:19:23 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id o517JKi2020178 for ; Tue, 1 Jun 2010 00:19:20 -0700 Received: from pzk13 (pzk13.prod.google.com [10.243.19.141]) by wpaz17.hot.corp.google.com with ESMTP id o517JIhh032715 for ; Tue, 1 Jun 2010 00:19:18 -0700 Received: by pzk13 with SMTP id 13so2537943pzk.13 for ; Tue, 01 Jun 2010 00:19:18 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:15 -0700 (PDT) From: David Rientjes Subject: [patch -mm 17/18] oom: avoid sending exiting tasks a SIGKILL In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: It's unnecessary to SIGKILL a task that is already PF_EXITING and can actually cause a NULL pointer dereference of the sighand if it has already been detached. Instead, simply set TIF_MEMDIE so it has access to memory reserves and can quickly exit as the comment implies. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -486,7 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * its children or threads, just set TIF_MEMDIE so it can die quickly */ if (p->flags & PF_EXITING) { - __oom_kill_task(p); + set_tsk_thread_flag(p, TIF_MEMDIE); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 5FE0F6B01E6 for ; Tue, 1 Jun 2010 03:19:25 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o517JO0f023805 for ; Tue, 1 Jun 2010 00:19:24 -0700 Received: from pzk36 (pzk36.prod.google.com [10.243.19.164]) by wpaz5.hot.corp.google.com with ESMTP id o517JMnq025906 for ; Tue, 1 Jun 2010 00:19:23 -0700 Received: by pzk36 with SMTP id 36so1186215pzk.32 for ; Tue, 01 Jun 2010 00:19:22 -0700 (PDT) Date: Tue, 1 Jun 2010 00:19:19 -0700 (PDT) From: David Rientjes Subject: [patch -mm 18/18] oom: clean up oom_kill_task() In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: __oom_kill_task() only has a single caller, so merge it into that function. Reviewed-by: KAMEZAWA Hiroyuki Signed-off-by: David Rientjes --- mm/oom_kill.c | 15 +++------------ 1 files changed, 3 insertions(+), 12 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -440,17 +440,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, dump_tasks(mem); } -/* - * Give the oom killed task high priority and access to memory reserves so that - * it may quickly exit and free its memory. - */ -static void __oom_kill_task(struct task_struct *p) -{ - p->rt.time_slice = HZ; - set_tsk_thread_flag(p, TIF_MEMDIE); - force_sig(SIGKILL, p); -} - #define K(x) ((x) << (PAGE_SHIFT-10)) static int oom_kill_task(struct task_struct *p) { @@ -465,7 +454,9 @@ static int oom_kill_task(struct task_struct *p) K(get_mm_counter(p->mm, MM_FILEPAGES))); task_unlock(p); - __oom_kill_task(p); + p->rt.time_slice = HZ; + set_tsk_thread_flag(p, TIF_MEMDIE); + force_sig(SIGKILL, p); return 0; } #undef K -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id EB7ED6B01EA for ; Tue, 1 Jun 2010 03:20:55 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517KruN007255 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:20:53 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id D461145DE58 for ; Tue, 1 Jun 2010 16:20:51 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 65B5945DE52 for ; Tue, 1 Jun 2010 16:20:51 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 41CCFE08006 for ; Tue, 1 Jun 2010 16:20:51 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id C7B0FE08008 for ; Tue, 1 Jun 2010 16:20:50 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: Message-Id: <20100601162030.244B.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:20:50 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Tasks that do not share the same set of allowed nodes with the task that > triggered the oom should not be considered as candidates for oom kill. > > Tasks in other cpusets with a disjoint set of mems would be unfairly > penalized otherwise because of oom conditions elsewhere; an extreme > example could unfairly kill all other applications on the system if a > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > more memory than allowed. > > Killing tasks outside of current's cpuset rarely would free memory for > current anyway. To use a sane heuristic, we must ensure that killing a > task would likely free memory for current and avoid needlessly killing > others at all costs just because their potential memory freeing is > unknown. It is better to kill current than another task needlessly. > > Acked-by: Rik van Riel > Acked-by: Nick Piggin > Acked-by: Balbir Singh > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id D31C16B01EA for ; Tue, 1 Jun 2010 03:33:13 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517XBnA012113 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:33:11 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 2F58645DE57 for ; Tue, 1 Jun 2010 16:33:11 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 076FE45DE4E for ; Tue, 1 Jun 2010 16:33:11 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id D8EF91DB8061 for ; Tue, 1 Jun 2010 16:33:10 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 8B54B1DB8043 for ; Tue, 1 Jun 2010 16:33:10 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 04/18] oom: extract panic helper function In-Reply-To: References: Message-Id: <20100601163252.244E.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:33:09 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > There are various points in the oom killer where the kernel must > determine whether to panic or not. It's better to extract this to a > helper function to remove all the confusion as to its semantics. > > There's no functional change with this patch. > > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 23E346B0210 for ; Tue, 1 Jun 2010 03:34:46 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517YhOn012660 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:34:43 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 58E0A45DE59 for ; Tue, 1 Jun 2010 16:34:43 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 14E1245DE51 for ; Tue, 1 Jun 2010 16:34:43 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id A9183E08004 for ; Tue, 1 Jun 2010 16:34:42 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 43C0C1DB8038 for ; Tue, 1 Jun 2010 16:34:42 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 05/18] oom: remove special handling for pagefault ooms In-Reply-To: References: Message-Id: <20100601163420.2451.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:34:33 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > It is possible to remove the special pagefault oom handler by simply oom > locking all system zones and then calling directly into out_of_memory(). > > All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a > parallel oom killing in progress that will lead to eventual memory freeing > so it's not necessary to needlessly kill another task. The context in > which the pagefault is allocating memory is unknown to the oom killer, so > this is done on a system-wide level. > > If a task has already been oom killed and hasn't fully exited yet, this > will be a no-op since select_bad_process() recognizes tasks across the > system with TIF_MEMDIE set. > > Acked-by: Nick Piggin > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id AA69C6B0210 for ; Tue, 1 Jun 2010 03:34:59 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517Yv0P013311 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:34:57 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 47D4945DE57 for ; Tue, 1 Jun 2010 16:34:57 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 0576245DE4F for ; Tue, 1 Jun 2010 16:34:57 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 99895E18002 for ; Tue, 1 Jun 2010 16:34:56 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 3F77E1DB8038 for ; Tue, 1 Jun 2010 16:34:56 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 06/18] oom: move sysctl declarations to oom.h In-Reply-To: References: Message-Id: <20100601163441.2454.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:34:53 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > The three oom killer sysctl variables (sysctl_oom_dump_tasks, > sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better > declared in include/linux/oom.h rather than kernel/sysctl.c. > > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 278436B0215 for ; Tue, 1 Jun 2010 03:36:06 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517a2Fv010989 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:36:03 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 9FA6E45DE7B for ; Tue, 1 Jun 2010 16:36:02 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 7672445DE6F for ; Tue, 1 Jun 2010 16:36:02 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 59EE71DB803F for ; Tue, 1 Jun 2010 16:36:02 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 0E05B1DB803A for ; Tue, 1 Jun 2010 16:36:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 07/18] oom: enable oom tasklist dump by default In-Reply-To: References: Message-Id: <20100601163545.2457.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:36:01 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is > very helpful information in diagnosing why a user's task has been killed. > It emits useful information such as each eligible thread's memory usage > that can determine why the system is oom, so it should be enabled by > default. > > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 5FA6E6B0217 for ; Tue, 1 Jun 2010 03:36:55 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517ar7A011354 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:36:53 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C38E945DE54 for ; Tue, 1 Jun 2010 16:36:52 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 9C53445DE53 for ; Tue, 1 Jun 2010 16:36:52 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 7BA8F1DB8016 for ; Tue, 1 Jun 2010 16:36:52 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 2104D1DB8013 for ; Tue, 1 Jun 2010 16:36:49 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: References: Message-Id: <20100601163627.245D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:36:48 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > This a complete rewrite of the oom killer's badness() heuristic which is > used to determine which task to kill in oom conditions. The goal is to > make it as simple and predictable as possible so the results are better > understood and we end up killing the task which will lead to the most > memory freeing while still respecting the fine-tuning from userspace. > > The baseline for the heuristic is a proportion of memory that each task is > currently using in memory plus swap compared to the amount of "allowable" > memory. "Allowable," in this sense, means the system-wide resources for > unconstrained oom conditions, the set of mempolicy nodes, the mems > attached to current's cpuset, or a memory controller's limit. The > proportion is given on a scale of 0 (never kill) to 1000 (always kill), > roughly meaning that if a task has a badness() score of 500 that the task > consumes approximately 50% of allowable memory resident in RAM or in swap > space. > > The proportion is always relative to the amount of "allowable" memory and > not the total amount of RAM systemwide so that mempolicies and cpusets may > operate in isolation; they shall not need to know the true size of the > machine on which they are running if they are bound to a specific set of > nodes or mems, respectively. > > Root tasks are given 3% extra memory just like __vm_enough_memory() > provides in LSMs. In the event of two tasks consuming similar amounts of > memory, it is generally better to save root's task. > > Because of the change in the badness() heuristic's baseline, it is also > necessary to introduce a new user interface to tune it. It's not possible > to redefine the meaning of /proc/pid/oom_adj with a new scale since the > ABI cannot be changed for backward compatability. Instead, a new tunable, > /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may > be used to polarize the heuristic such that certain tasks are never > considered for oom kill while others may always be considered. The value > is added directly into the badness() score so a value of -500, for > example, means to discount 50% of its memory consumption in comparison to > other tasks either on the system, bound to the mempolicy, in the cpuset, > or sharing the same memory controller. > > /proc/pid/oom_adj is changed so that its meaning is rescaled into the > units used by /proc/pid/oom_score_adj, and vice versa. Changing one of > these per-task tunables will rescale the value of the other to an > equivalent meaning. Although /proc/pid/oom_adj was originally defined as > a bitshift on the badness score, it now shares the same linear growth as > /proc/pid/oom_score_adj but with different granularity. This is required > so the ABI is not broken with userspace applications and allows oom_adj to > be deprecated for future removal. > > Signed-off-by: David Rientjes nack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id C6B526B0219 for ; Tue, 1 Jun 2010 03:37:27 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517bNtH011734 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:37:23 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id F413645DE54 for ; Tue, 1 Jun 2010 16:37:22 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C28AA45DE4C for ; Tue, 1 Jun 2010 16:37:22 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id A178A1DB8016 for ; Tue, 1 Jun 2010 16:37:22 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 5031A1DB801A for ; Tue, 1 Jun 2010 16:37:22 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: References: Message-Id: <20100601163705.2460.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:37:21 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Add a forkbomb penalty for processes that fork an excessively large > number of children to penalize that group of tasks and not others. A > threshold is configurable from userspace to determine how many first- > generation execve children (those with their own address spaces) a task > may have before it is considered a forkbomb. This can be tuned by > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > 1000. > > When a task has more than 1000 first-generation children with different > address spaces than itself, a penalty of > > (average rss of children) * (# of 1st generation execve children) > ----------------------------------------------------------------- > oom_forkbomb_thres > > is assessed. So, for example, using the default oom_forkbomb_thres of > 1000, the penalty is twice the average rss of all its execve children if > there are 2000 such tasks. A task is considered to count toward the > threshold if its total runtime is less than one second; for 1000 of such > tasks to exist, the parent process must be forking at an extremely high > rate either erroneously or maliciously. > > Even though a particular task may be designated a forkbomb and selected as > the victim, the oom killer will still kill the 1st generation execve child > with the highest badness() score in its place. The avoids killing > important servers or system daemons. When a web server forks a very large > number of threads for client connections, for example, it is much better > to kill one of those threads than to kill the server and make it > unresponsive. > > [oleg@redhat.com: optimize task_lock when iterating children] > Signed-off-by: David Rientjes nack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 396CB6B021B for ; Tue, 1 Jun 2010 03:37:57 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517btJB014113 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:37:55 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id E53E545DE50 for ; Tue, 1 Jun 2010 16:37:54 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C939645DE4E for ; Tue, 1 Jun 2010 16:37:54 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id B135FE08005 for ; Tue, 1 Jun 2010 16:37:54 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 6CE4BE08001 for ; Tue, 1 Jun 2010 16:37:54 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 10/18] oom: deprecate oom_adj tunable In-Reply-To: References: Message-Id: <20100601163739.2463.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:37:53 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > /proc/pid/oom_adj is now deprecated so that that it may eventually be > removed. The target date for removal is May 2012. > > A warning will be printed to the kernel log if a task attempts to use this > interface. Future warning will be suppressed until the kernel is rebooted > to prevent spamming the kernel log. > > Signed-off-by: David Rientjes > --- > Documentation/feature-removal-schedule.txt | 25 +++++++++++++++++++++++++ > Documentation/filesystems/proc.txt | 3 +++ > fs/proc/base.c | 8 ++++++++ > include/linux/oom.h | 3 +++ > 4 files changed, 39 insertions(+), 0 deletions(-) nack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 354056B021D for ; Tue, 1 Jun 2010 03:38:38 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517cZos014557 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:38:36 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id AC09A45DE61 for ; Tue, 1 Jun 2010 16:38:35 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 7817545DE51 for ; Tue, 1 Jun 2010 16:38:35 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 5D0681DB803F for ; Tue, 1 Jun 2010 16:38:35 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 136041DB803A for ; Tue, 1 Jun 2010 16:38:35 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: References: Message-Id: <20100601163813.2466.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:38:34 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > If memory has been depleted in lowmem zones even with the protection > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that > killing current users will help. The memory is either reclaimable (or > migratable) already, in which case we should not invoke the oom killer at > all, or it is pinned by an application for I/O. Killing such an > application may leave the hardware in an unspecified state and there is no > guarantee that it will be able to make a timely exit. > > Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is > not used so that the task can perhaps recover or try again later. > > Previously, the heuristic provided some protection for those tasks with > CAP_SYS_RAWIO, but this is no longer necessary since we will not be > killing tasks for the purposes of ISA allocations. > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the > default for all allocations that are not __GFP_DMA, __GFP_DMA32, > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those > flags. Testing for high_zoneidx being less than ZONE_NORMAL will only > return true for allocations that have either __GFP_DMA or __GFP_DMA32. > > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id E184B6B021E for ; Tue, 1 Jun 2010 03:39:08 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517d3ZY014815 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:39:03 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 2C21045DE53 for ; Tue, 1 Jun 2010 16:39:03 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 0243245DE4F for ; Tue, 1 Jun 2010 16:39:03 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id D1EA3E18005 for ; Tue, 1 Jun 2010 16:39:02 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 8E932E18003 for ; Tue, 1 Jun 2010 16:39:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: Message-Id: <20100601163842.2469.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:39:01 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > When a task is chosen for oom kill, the oom killer first attempts to > sacrifice a child not sharing its parent's memory instead. Unfortunately, > this often kills in a seemingly random fashion based on the ordering of > the selected task's child list. Additionally, it is not guaranteed at all > to free a large amount of memory that we need to prevent additional oom > killing in the very near future. > > Instead, we now only attempt to sacrifice the worst child not sharing its > parent's memory, if one exists. The worst child is indicated with the > highest badness() score. This serves two advantages: we kill a > memory-hogging task more often, and we allow the configurable > /proc/pid/oom_adj value to be considered as a factor in which child to > kill. > > Reviewers may observe that the previous implementation would iterate > through the children and attempt to kill each until one was successful and > then the parent if none were found while the new code simply kills the > most memory-hogging task or the parent. Note that the only time > oom_kill_task() fails, however, is when a child does not have an mm or has > a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, > so the final oom_kill_task() will always succeed. > > Acked-by: Rik van Riel > Acked-by: Nick Piggin > Acked-by: Balbir Singh > Reviewed-by: KAMEZAWA Hiroyuki > Reviewed-by: Minchan Kim > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 929516B021F for ; Tue, 1 Jun 2010 03:39:29 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517dRCY015421 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:39:27 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 3B02C45DE63 for ; Tue, 1 Jun 2010 16:39:27 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 186B145DE57 for ; Tue, 1 Jun 2010 16:39:27 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id EE1C21DB803C for ; Tue, 1 Jun 2010 16:39:26 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8DDE41DB803F for ; Tue, 1 Jun 2010 16:39:26 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: References: Message-Id: <20100601163912.246C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:39:25 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. > > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. > > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes ack -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 4418A6B0222 for ; Tue, 1 Jun 2010 03:40:17 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517eFj8013270 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:40:15 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id E2B7B45DE52 for ; Tue, 1 Jun 2010 16:40:14 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id B5F3A45DE51 for ; Tue, 1 Jun 2010 16:40:14 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 9A6171DB8038 for ; Tue, 1 Jun 2010 16:40:14 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id EDB7EE0801F for ; Tue, 1 Jun 2010 16:40:10 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 12/18] oom: remove unnecessary code and cleanup In-Reply-To: References: Message-Id: <20100601163954.246F.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:40:10 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Remove the redundancy in __oom_kill_task() since: > > - init can never be passed to this function: it will never be PF_EXITING > or selectable from select_bad_process(), and > > - it will never be passed a task from oom_kill_task() without an ->mm > and we're unconcerned about detachment from exiting tasks, there's no > reason to protect them against SIGKILL or access to memory reserves. > > Also moves the kernel log message to a higher level since the verbosity is > not always emitted here; we need not print an error message if an exiting > task is given a longer timeslice. > > Reviewed-by: KAMEZAWA Hiroyuki > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes need respin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 6C0FE6B0224 for ; Tue, 1 Jun 2010 03:40:48 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517ejpT016022 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:40:45 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 063C045DE51 for ; Tue, 1 Jun 2010 16:40:45 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id D966E45DE4F for ; Tue, 1 Jun 2010 16:40:44 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id AC0BE1DB803E for ; Tue, 1 Jun 2010 16:40:44 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0A05D1DB803F for ; Tue, 1 Jun 2010 16:40:44 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: References: Message-Id: <20100601164026.2472.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:40:43 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Tasks detach its ->mm prior to exiting so it's possible that in progress > oom kills or already exiting tasks may be missed during the oom killer's > tasklist scan. When an eligible task is found with either TIF_MEMDIE or > PF_EXITING set, the oom killer is supposed to be a no-op to avoid > needlessly killing additional tasks. This closes the race between a task > detaching its ->mm and being removed from the tasklist. > > Out of memory conditions as the result of memory controllers will > automatically filter tasks that have detached their ->mm (since > task_in_mem_cgroup() will return 0). This is acceptable, however, since > memcg constrained ooms aren't the result of a lack of memory resources but > rather a limit imposed by userspace that requires a task be killed > regardless. > > [oleg@redhat.com: fix PF_EXITING check for !p->mm tasks] > Acked-by: Nick Piggin > Signed-off-by: David Rientjes need respin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 7398B6B0225 for ; Tue, 1 Jun 2010 03:41:07 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517f5Xv016015 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:41:05 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id B3EDA45DE57 for ; Tue, 1 Jun 2010 16:41:04 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8050445DE4F for ; Tue, 1 Jun 2010 16:41:04 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 65EC31DB803E for ; Tue, 1 Jun 2010 16:41:04 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1600B1DB8038 for ; Tue, 1 Jun 2010 16:41:04 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads In-Reply-To: References: Message-Id: <20100601164047.2475.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:41:03 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > From: Oleg Nesterov > > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this > is not true due to use_mm(). > > Change the code to check PF_KTHREAD. > > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes > --- > mm/oom_kill.c | 4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) need respin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 2C6BF6B0215 for ; Tue, 1 Jun 2010 03:41:37 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517fYLD016328 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:41:35 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id C278E45DE6E for ; Tue, 1 Jun 2010 16:41:34 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 8F81045DE60 for ; Tue, 1 Jun 2010 16:41:34 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 72D831DB8037 for ; Tue, 1 Jun 2010 16:41:34 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 18AE91DB8042 for ; Tue, 1 Jun 2010 16:41:31 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives In-Reply-To: References: Message-Id: <20100601164114.2478.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:41:30 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > From: Oleg Nesterov > > Almost all ->mm == NUL checks in oom_kill.c are wrong. > > The current code assumes that the task without ->mm has already > released its memory and ignores the process. However this is not > necessarily true when this process is multithreaded, other live > sub-threads can use this ->mm. > > - Remove the "if (!p->mm)" check in select_bad_process(), it is > just wrong. > > - Add the new helper, find_lock_task_mm(), which finds the live > thread which uses the memory and takes task_lock() to pin ->mm > > - change oom_badness() to use this helper instead of just checking > ->mm != NULL. > > - As David pointed out, select_bad_process() must never choose the > task without ->mm, but no matter what oom_badness() returns the > task can be chosen if nothing else has been found yet. > > Change oom_badness() to return int, change it to return -1 if > find_lock_task_mm() fails, and change select_bad_process() to > check points >= 0. > > Note! This patch is not enough, we need more changes. > > - oom_badness() was fixed, but oom_kill_task() still ignores > the task without ->mm > > - oom_forkbomb_penalty() should use find_lock_task_mm() too, > and it also needs other changes to actually find the first > first-descendant children > > This will be addressed later. > > Signed-off-by: Oleg Nesterov > Signed-off-by: David Rientjes need respin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 908CA6B021D for ; Tue, 1 Jun 2010 03:44:36 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o517iT7R017817 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 1 Jun 2010 16:44:29 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 1D57D45DE52 for ; Tue, 1 Jun 2010 16:44:29 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id F123245DE51 for ; Tue, 1 Jun 2010 16:44:28 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id D81901DB803F for ; Tue, 1 Jun 2010 16:44:28 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 836211DB803C for ; Tue, 1 Jun 2010 16:44:25 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 16/18] oom: give current access to memory reserves if it has been killed In-Reply-To: References: Message-Id: <20100601164147.247B.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Tue, 1 Jun 2010 16:44:24 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > It's possible to livelock the page allocator if a thread has mm->mmap_sem > and fails to make forward progress because the oom killer selects another > thread sharing the same ->mm to kill that cannot exit until the semaphore > is dropped. > > The oom killer will not kill multiple tasks at the same time; each oom > killed task must exit before another task may be killed. Thus, if one > thread is holding mm->mmap_sem and cannot allocate memory, all threads > sharing the same ->mm are blocked from exiting as well. In the oom kill > case, that means the thread holding mm->mmap_sem will never free > additional memory since it cannot get access to memory reserves and the > thread that depends on it with access to memory reserves cannot exit > because it cannot acquire the semaphore. Thus, the page allocators > livelocks. > > When the oom killer is called and current happens to have a pending > SIGKILL, this patch automatically gives it access to memory reserves and > returns. Upon returning to the page allocator, its allocation will > hopefully succeed so it can quickly exit and free its memory. If not, the > page allocator will fail the allocation if it is not __GFP_NOFAIL. > > Reviewed-by: KAMEZAWA Hiroyuki > Signed-off-by: David Rientjes ack. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 0EBEE6B0224 for ; Tue, 1 Jun 2010 03:46:27 -0400 (EDT) Date: Tue, 1 Jun 2010 17:46:20 +1000 From: Nick Piggin Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-ID: <20100601074620.GR9453@laptop> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, Jun 01, 2010 at 12:18:43AM -0700, David Rientjes wrote: > This a complete rewrite of the oom killer's badness() heuristic which is > used to determine which task to kill in oom conditions. The goal is to > make it as simple and predictable as possible so the results are better > understood and we end up killing the task which will lead to the most > memory freeing while still respecting the fine-tuning from userspace. Do you have particular ways of testing this (and other heuristics changes such as the forkbomb detector)? Such that you can look at your test case or workload and see that it is really improved? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 271376B01DD for ; Tue, 1 Jun 2010 14:44:36 -0400 (EDT) Received: from hpaq12.eem.corp.google.com (hpaq12.eem.corp.google.com [172.25.149.12]) by smtp-out.google.com with ESMTP id o51IiURp023949 for ; Tue, 1 Jun 2010 11:44:30 -0700 Received: from pzk31 (pzk31.prod.google.com [10.243.19.159]) by hpaq12.eem.corp.google.com with ESMTP id o51IiMQU011036 for ; Tue, 1 Jun 2010 11:44:29 -0700 Received: by pzk31 with SMTP id 31so2601624pzk.16 for ; Tue, 01 Jun 2010 11:44:29 -0700 (PDT) Date: Tue, 1 Jun 2010 11:44:23 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100601163627.245D.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601163627.245D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, KOSAKI Motohiro wrote: > > This a complete rewrite of the oom killer's badness() heuristic which is > > used to determine which task to kill in oom conditions. The goal is to > > make it as simple and predictable as possible so the results are better > > understood and we end up killing the task which will lead to the most > > memory freeing while still respecting the fine-tuning from userspace. > > > > The baseline for the heuristic is a proportion of memory that each task is > > currently using in memory plus swap compared to the amount of "allowable" > > memory. "Allowable," in this sense, means the system-wide resources for > > unconstrained oom conditions, the set of mempolicy nodes, the mems > > attached to current's cpuset, or a memory controller's limit. The > > proportion is given on a scale of 0 (never kill) to 1000 (always kill), > > roughly meaning that if a task has a badness() score of 500 that the task > > consumes approximately 50% of allowable memory resident in RAM or in swap > > space. > > > > The proportion is always relative to the amount of "allowable" memory and > > not the total amount of RAM systemwide so that mempolicies and cpusets may > > operate in isolation; they shall not need to know the true size of the > > machine on which they are running if they are bound to a specific set of > > nodes or mems, respectively. > > > > Root tasks are given 3% extra memory just like __vm_enough_memory() > > provides in LSMs. In the event of two tasks consuming similar amounts of > > memory, it is generally better to save root's task. > > > > Because of the change in the badness() heuristic's baseline, it is also > > necessary to introduce a new user interface to tune it. It's not possible > > to redefine the meaning of /proc/pid/oom_adj with a new scale since the > > ABI cannot be changed for backward compatability. Instead, a new tunable, > > /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may > > be used to polarize the heuristic such that certain tasks are never > > considered for oom kill while others may always be considered. The value > > is added directly into the badness() score so a value of -500, for > > example, means to discount 50% of its memory consumption in comparison to > > other tasks either on the system, bound to the mempolicy, in the cpuset, > > or sharing the same memory controller. > > > > /proc/pid/oom_adj is changed so that its meaning is rescaled into the > > units used by /proc/pid/oom_score_adj, and vice versa. Changing one of > > these per-task tunables will rescale the value of the other to an > > equivalent meaning. Although /proc/pid/oom_adj was originally defined as > > a bitshift on the badness score, it now shares the same linear growth as > > /proc/pid/oom_score_adj but with different granularity. This is required > > so the ABI is not broken with userspace applications and allows oom_adj to > > be deprecated for future removal. > > > > Signed-off-by: David Rientjes > > nack > Why? If it's because the patch is too big, I've explained a few times that functionally you can't break it apart into anything meaningful. I do not believe it is better to break functional changes into smaller patches that simply change function signatures to pass additional arguments that are unused in the first patch, for example. If it's because it adds /proc/pid/oom_score_adj in the same patch, that's allowed since otherwise it would be useless with the old heuristic. In other words, you cannot apply oom_score_adj's meaning to the bitshift in any sane way. I'll suggest what I have multiple times: the easiest way to review the functional change here is to merge the patch into your own tree and then review oom_badness(). I agree that the way the diff comes out it is a little difficult to read just from the patch form, so merging it and reviewing the actual heuristic function is the easiest way. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 681046B01CC for ; Tue, 1 Jun 2010 14:57:01 -0400 (EDT) Received: from kpbe15.cbf.corp.google.com (kpbe15.cbf.corp.google.com [172.25.105.79]) by smtp-out.google.com with ESMTP id o51IuugW030657 for ; Tue, 1 Jun 2010 11:56:57 -0700 Received: from pwi8 (pwi8.prod.google.com [10.241.219.8]) by kpbe15.cbf.corp.google.com with ESMTP id o51IutcW006226 for ; Tue, 1 Jun 2010 11:56:55 -0700 Received: by pwi8 with SMTP id 8so2624941pwi.17 for ; Tue, 01 Jun 2010 11:56:55 -0700 (PDT) Date: Tue, 1 Jun 2010 11:56:48 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100601074620.GR9453@laptop> Message-ID: References: <20100601074620.GR9453@laptop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, Nick Piggin wrote: > > This a complete rewrite of the oom killer's badness() heuristic which is > > used to determine which task to kill in oom conditions. The goal is to > > make it as simple and predictable as possible so the results are better > > understood and we end up killing the task which will lead to the most > > memory freeing while still respecting the fine-tuning from userspace. > > Do you have particular ways of testing this (and other heuristics > changes such as the forkbomb detector)? > Yes, the patch prior to this one in the series, "oom: enable oom tasklist dump by default", allows you to examine the oom_score_adj of all eligible tasks. Used in combination with /proc/pid/oom_score, which reports the result of the badness heuristic to userspace, I tested the result of the change by ensuring that it worked as intended. Since we'll now see a tasklist dump of all eligible tasks whenever someone reports an oom problem (hopefully fewer reports as a result of this rewrite than currently!), it's much easier to determine (i) why the oom killer was called, and (ii) why a particular task was chosen for kill. That's been my testing philosophy. The forkbomb detector does add a minimal bias to tasks that have a large number of execve children, just as the current oom killer does (although the bias is much smaller with my heursitic). Rik and I had a lengthy conversation on linux-mm about that when it was first proposed. The key to that particular bias is that you must remember that even though a task is selected for oom kill that the oom killer still attempts to kill an execve child first. So the end result is that an important system daemon, such as a webserver, doesn't actually get oom killed when it's selected as a result of this, but it's more of a bias toward the children to be killed (a client) instead. We're guaranteed that a child will be killed if a task is chosen as the result of a tiebreaker because of the forkbomb detector because it surely has a child with a different mm that is eligible. This isn't meant to be enforce a kernel-wide forkbomb policy, which would obviously be better implemented elsewhere, but rather bias the children when a parent is forking an egregiously large number of tasks. "Egregious" in this case is defined as whatever the user uses for oom_forkbomb_thres, which I believe defaults to a sane value of 1000. > Such that you can look at your test case or workload and see that > it is really improved? > I'm glad you asked that because some recent conversation has been slightly confusing to me about how this affects the desktop; this rewrite significantly improves the oom killer's response for desktop users. The core ideas were developed in the thread from this mailing list back in February called "Improving OOM killer" at http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report that vital system tasks such as kdeinit are killed whenever a memory hogging task is forked either intentionally or unintentionally. I argued for a while that KDE should be taking proper precautions by adjusting its own oom_adj score and that of its forked children as it's an inherited value, but I was eventually convinced that an overall improvement to the heuristic must be made to kill a task that was known to free a large amount of memory that is resident in RAM and that we have a consistent way of defining oom priorities when a task is run uncontained and when it is a member of a memcg or cpuset (or even mempolicy now), even in the case when it's contained out from under the task's knowledge. When faced with memory pressure from an out of control or memory hogging task on the desktop, the oom killer now kills it instead of a vital task such as an X server (and oracle, webserver, etc on server platforms) because of the use of the task's rss instead of total_vm statistic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C2AD86B01CD for ; Tue, 1 Jun 2010 14:57:18 -0400 (EDT) Received: from wpaz1.hot.corp.google.com (wpaz1.hot.corp.google.com [172.24.198.65]) by smtp-out.google.com with ESMTP id o51IvGp8000452 for ; Tue, 1 Jun 2010 11:57:16 -0700 Received: from pwi6 (pwi6.prod.google.com [10.241.219.6]) by wpaz1.hot.corp.google.com with ESMTP id o51IuICr005694 for ; Tue, 1 Jun 2010 11:57:15 -0700 Received: by pwi6 with SMTP id 6so5830864pwi.0 for ; Tue, 01 Jun 2010 11:57:15 -0700 (PDT) Date: Tue, 1 Jun 2010 11:57:10 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: <20100601163705.2460.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601163705.2460.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, KOSAKI Motohiro wrote: > > Add a forkbomb penalty for processes that fork an excessively large > > number of children to penalize that group of tasks and not others. A > > threshold is configurable from userspace to determine how many first- > > generation execve children (those with their own address spaces) a task > > may have before it is considered a forkbomb. This can be tuned by > > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > > 1000. > > > > When a task has more than 1000 first-generation children with different > > address spaces than itself, a penalty of > > > > (average rss of children) * (# of 1st generation execve children) > > ----------------------------------------------------------------- > > oom_forkbomb_thres > > > > is assessed. So, for example, using the default oom_forkbomb_thres of > > 1000, the penalty is twice the average rss of all its execve children if > > there are 2000 such tasks. A task is considered to count toward the > > threshold if its total runtime is less than one second; for 1000 of such > > tasks to exist, the parent process must be forking at an extremely high > > rate either erroneously or maliciously. > > > > Even though a particular task may be designated a forkbomb and selected as > > the victim, the oom killer will still kill the 1st generation execve child > > with the highest badness() score in its place. The avoids killing > > important servers or system daemons. When a web server forks a very large > > number of threads for client connections, for example, it is much better > > to kill one of those threads than to kill the server and make it > > unresponsive. > > > > [oleg@redhat.com: optimize task_lock when iterating children] > > Signed-off-by: David Rientjes > > nack > Why? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id B5CB26B01D0 for ; Tue, 1 Jun 2010 14:58:23 -0400 (EDT) Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id o51IwJs2028945 for ; Tue, 1 Jun 2010 11:58:19 -0700 Received: from pzk17 (pzk17.prod.google.com [10.243.19.145]) by hpaq1.eem.corp.google.com with ESMTP id o51IwH7B007642 for ; Tue, 1 Jun 2010 11:58:18 -0700 Received: by pzk17 with SMTP id 17so2823231pzk.5 for ; Tue, 01 Jun 2010 11:58:17 -0700 (PDT) Date: Tue, 1 Jun 2010 11:58:13 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 12/18] oom: remove unnecessary code and cleanup In-Reply-To: <20100601163954.246F.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601163954.246F.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, KOSAKI Motohiro wrote: > > Remove the redundancy in __oom_kill_task() since: > > > > - init can never be passed to this function: it will never be PF_EXITING > > or selectable from select_bad_process(), and > > > > - it will never be passed a task from oom_kill_task() without an ->mm > > and we're unconcerned about detachment from exiting tasks, there's no > > reason to protect them against SIGKILL or access to memory reserves. > > > > Also moves the kernel log message to a higher level since the verbosity is > > not always emitted here; we need not print an error message if an exiting > > task is given a longer timeslice. > > > > Reviewed-by: KAMEZAWA Hiroyuki > > Reviewed-by: KOSAKI Motohiro > > Signed-off-by: David Rientjes > > need respin. > This is a duplicate of the same patch that you earlier added your Reviewed-by line as cited above, what has changed? This applies fine. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 77AF26B01CC for ; Tue, 1 Jun 2010 14:59:31 -0400 (EDT) Received: from wpaz13.hot.corp.google.com (wpaz13.hot.corp.google.com [172.24.198.77]) by smtp-out.google.com with ESMTP id o51IxTBE003093 for ; Tue, 1 Jun 2010 11:59:29 -0700 Received: from pxi12 (pxi12.prod.google.com [10.243.27.12]) by wpaz13.hot.corp.google.com with ESMTP id o51IxR2q023225 for ; Tue, 1 Jun 2010 11:59:28 -0700 Received: by pxi12 with SMTP id 12so5742805pxi.0 for ; Tue, 01 Jun 2010 11:59:27 -0700 (PDT) Date: Tue, 1 Jun 2010 11:59:24 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: <20100601164026.2472.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601164026.2472.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, KOSAKI Motohiro wrote: > > Tasks detach its ->mm prior to exiting so it's possible that in progress > > oom kills or already exiting tasks may be missed during the oom killer's > > tasklist scan. When an eligible task is found with either TIF_MEMDIE or > > PF_EXITING set, the oom killer is supposed to be a no-op to avoid > > needlessly killing additional tasks. This closes the race between a task > > detaching its ->mm and being removed from the tasklist. > > > > Out of memory conditions as the result of memory controllers will > > automatically filter tasks that have detached their ->mm (since > > task_in_mem_cgroup() will return 0). This is acceptable, however, since > > memcg constrained ooms aren't the result of a lack of memory resources but > > rather a limit imposed by userspace that requires a task be killed > > regardless. > > > > [oleg@redhat.com: fix PF_EXITING check for !p->mm tasks] > > Acked-by: Nick Piggin > > Signed-off-by: David Rientjes > > need respin. > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I know you've pushed Oleg's patches but they are also included here so no respin is necessary unless they are merged first (and I think that should only happen if Andrew considers them to be rc material). I'll base my patchsets on the -mm tree. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 5676F6B01D6 for ; Tue, 1 Jun 2010 16:45:03 -0400 (EDT) Date: Tue, 1 Jun 2010 22:43:42 +0200 From: Oleg Nesterov Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit Message-ID: <20100601204342.GC20732@redhat.com> References: <20100601164026.2472.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KOSAKI Motohiro , Andrew Morton , Rik van Riel , Nick Piggin , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On 06/01, David Rientjes wrote: > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > know you've pushed Oleg's patches (plus other fixes) > but they are also included here so no > respin is necessary unless they are merged first (and I think that should > only happen if Andrew considers them to be rc material). Well, I disagree. I think it is always better to push the simple bugfixes first, then change/improve the logic. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 9AA676B01CD for ; Tue, 1 Jun 2010 17:19:46 -0400 (EDT) Received: from wpaz29.hot.corp.google.com (wpaz29.hot.corp.google.com [172.24.198.93]) by smtp-out.google.com with ESMTP id o51LJhe7000776 for ; Tue, 1 Jun 2010 14:19:43 -0700 Received: from pzk32 (pzk32.prod.google.com [10.243.19.160]) by wpaz29.hot.corp.google.com with ESMTP id o51LJfBH027847 for ; Tue, 1 Jun 2010 14:19:42 -0700 Received: by pzk32 with SMTP id 32so2667891pzk.21 for ; Tue, 01 Jun 2010 14:19:41 -0700 (PDT) Date: Tue, 1 Jun 2010 14:19:38 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: <20100601204342.GC20732@redhat.com> Message-ID: References: <20100601164026.2472.A69D9226@jp.fujitsu.com> <20100601204342.GC20732@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: KOSAKI Motohiro , Andrew Morton , Rik van Riel , Nick Piggin , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, Oleg Nesterov wrote: > On 06/01, David Rientjes wrote: > > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > > know you've pushed Oleg's patches > > (plus other fixes) > You're suggesting that I should develop my patches on top of what I speculate that Andrew will eventually merge in -mm? I don't have that kind of time, sorry. > > but they are also included here so no > > respin is necessary unless they are merged first (and I think that should > > only happen if Andrew considers them to be rc material). > > Well, I disagree. > > I think it is always better to push the simple bugfixes first, then > change/improve the logic. > Unless your fixes, which seem to still be under development considering your discussion with KOSAKI in those threads, are going into 2.6.35 during the rc cycle, then there's no difference in them being merged as part of this patchset since they are duplicated here. So you'll need to convince Andrew they are rc material otherwise it doesn't matter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 5BA026B01B0 for ; Tue, 1 Jun 2010 20:32:44 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o520Wfhe018252 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Wed, 2 Jun 2010 09:32:41 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id B203445DE54 for ; Wed, 2 Jun 2010 09:32:41 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 4235945DE62 for ; Wed, 2 Jun 2010 09:32:41 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id E5C20E08020 for ; Wed, 2 Jun 2010 09:32:40 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 99FC1E0800A for ; Wed, 2 Jun 2010 09:32:37 +0900 (JST) Date: Wed, 2 Jun 2010 09:28:19 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit Message-Id: <20100602092819.58579806.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100601204342.GC20732@redhat.com> References: <20100601164026.2472.A69D9226@jp.fujitsu.com> <20100601204342.GC20732@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: David Rientjes , KOSAKI Motohiro , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010 22:43:42 +0200 Oleg Nesterov wrote: > On 06/01, David Rientjes wrote: > > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > > know you've pushed Oleg's patches > > (plus other fixes) > > > but they are also included here so no > > respin is necessary unless they are merged first (and I think that should > > only happen if Andrew considers them to be rc material). > > Well, I disagree. > > I think it is always better to push the simple bugfixes first, then > change/improve the logic. > yes..yes...I hope David finish easy-to-be-merged ones and go to new stage. IOW, please reduce size of patches sent at once. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 6CD066B01AC for ; Wed, 2 Jun 2010 05:50:03 -0400 (EDT) Received: from wpaz21.hot.corp.google.com (wpaz21.hot.corp.google.com [172.24.198.85]) by smtp-out.google.com with ESMTP id o529ntcH004889 for ; Wed, 2 Jun 2010 02:49:57 -0700 Received: from pzk31 (pzk31.prod.google.com [10.243.19.159]) by wpaz21.hot.corp.google.com with ESMTP id o529nrDa027351 for ; Wed, 2 Jun 2010 02:49:54 -0700 Received: by pzk31 with SMTP id 31so2919110pzk.16 for ; Wed, 02 Jun 2010 02:49:53 -0700 (PDT) Date: Wed, 2 Jun 2010 02:49:49 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: <20100602092819.58579806.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <20100601164026.2472.A69D9226@jp.fujitsu.com> <20100601204342.GC20732@redhat.com> <20100602092819.58579806.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Oleg Nesterov , KOSAKI Motohiro , Andrew Morton , Rik van Riel , Nick Piggin , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010, KAMEZAWA Hiroyuki wrote: > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > > > know you've pushed Oleg's patches > > > > (plus other fixes) > > > > > but they are also included here so no > > > respin is necessary unless they are merged first (and I think that should > > > only happen if Andrew considers them to be rc material). > > > > Well, I disagree. > > > > I think it is always better to push the simple bugfixes first, then > > change/improve the logic. > > > yes..yes...I hope David finish easy-to-be-merged ones and go to new stage. > IOW, please reduce size of patches sent at once. > How do you define "easy-to-be-merged"? We've been through several iterations of this patchset where the end result is that it's been merged in -mm once, removed from -mm six weeks later, and nobody providing any feedback that I can work from. Providing simple "nack" emails does nothing for the development of the patchset unless you actively get involved in the review process and subsequent discussion on how to move forward. Listen, I want to hear everybody's ideas and suggestions on improvements. In fact, I think I've responded in a way that demonstrates that quite well: I've dropped the consolidation of sysctls, I've avoided deprecation of existing sysctls, I've unified the semantics of panic_on_oom, and I've split out patches where possible. All of those were at the requests of people whom I've asked to review this patchset time and time again. Kame, you've been very helpful in your feedback with regards to this patchset and I've valued your feedback from the first revision. We had some differing views of how to handle task selection early on in other threads, but I sincerely enjoy hearing your feedback because it's interesting and challenging; you find things that I've missed and challenge me to defend decisions that were made. I really, really like doing that type of development, I just wish we all could make some forward progress on this thing instead of staling out all the time. I'm asking everyone to please review this work and comment on what you don't like or provide suggestions on how to improve it. It's been posted in its various forms about eight times now over the course of a few months, I really hope there's no big surprises in it to anyone anymore. Sure, there are cleanups here that possibly could be considered rc material even though they admittedly aren't critical, but that isn't a reason to just stall out all of this work. I'm sure Andrew can decide what he wants to merge into 2.6.35-rc2 after looking at the discussion and analyzing the impact; let us please focus on the actual implementation and design choices of the new oom killer presented here rather than get sidetracked. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 6E7BB6B01AC for ; Wed, 2 Jun 2010 06:46:55 -0400 (EDT) Date: Wed, 2 Jun 2010 20:46:21 +1000 From: Nick Piggin Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit Message-ID: <20100602104621.GA6152@laptop> References: <20100601164026.2472.A69D9226@jp.fujitsu.com> <20100601204342.GC20732@redhat.com> <20100602092819.58579806.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KAMEZAWA Hiroyuki , Oleg Nesterov , KOSAKI Motohiro , Andrew Morton , Rik van Riel , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, Jun 02, 2010 at 02:49:49AM -0700, David Rientjes wrote: > On Wed, 2 Jun 2010, KAMEZAWA Hiroyuki wrote: > > > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > > > > know you've pushed Oleg's patches > > > > > > (plus other fixes) > > > > > > > but they are also included here so no > > > > respin is necessary unless they are merged first (and I think that should > > > > only happen if Andrew considers them to be rc material). > > > > > > Well, I disagree. > > > > > > I think it is always better to push the simple bugfixes first, then > > > change/improve the logic. > > > > > yes..yes...I hope David finish easy-to-be-merged ones and go to new stage. > > IOW, please reduce size of patches sent at once. > > > > How do you define "easy-to-be-merged"? We've been through several > iterations of this patchset where the end result is that it's been merged > in -mm once, removed from -mm six weeks later, and nobody providing any > feedback that I can work from. Providing simple "nack" emails does > nothing for the development of the patchset unless you actively get > involved in the review process and subsequent discussion on how to move > forward. > > Listen, I want to hear everybody's ideas and suggestions on improvements. > In fact, I think I've responded in a way that demonstrates that quite > well: I've dropped the consolidation of sysctls, I've avoided deprecation > of existing sysctls, I've unified the semantics of panic_on_oom, and I've > split out patches where possible. All of those were at the requests of > people whom I've asked to review this patchset time and time again. > > Kame, you've been very helpful in your feedback with regards to this > patchset and I've valued your feedback from the first revision. We had > some differing views of how to handle task selection early on in other > threads, but I sincerely enjoy hearing your feedback because it's > interesting and challenging; you find things that I've missed and > challenge me to defend decisions that were made. I really, really like > doing that type of development, I just wish we all could make some forward > progress on this thing instead of staling out all the time. Well there are a large number of patches with no objections, some of which are bug-fixes which may need to be backported to earlier kernels. It would be nice if the patchset would be rearranged so all these can be merged soon (I don't want the situation where a couple of patches hold up your entire patchset again). When you are reduced to a few patches changing major functionality, it could be eaiser to get those reviewed and merged on their own. > I'm asking everyone to please review this work and comment on what you > don't like or provide suggestions on how to improve it. It's been posted > in its various forms about eight times now over the course of a few > months, I really hope there's no big surprises in it to anyone anymore. > Sure, there are cleanups here that possibly could be considered rc > material even though they admittedly aren't critical, but that isn't a > reason to just stall out all of this work. I'm sure Andrew can decide > what he wants to merge into 2.6.35-rc2 after looking at the discussion and > analyzing the impact; let us please focus on the actual implementation and > design choices of the new oom killer presented here rather than get > sidetracked. Well the merge window is closed and even if it wasn't the patches would be better to sit in -mm for a bit. So I don't think there is a big rush now, let's just get it right so everything is lined up to get into the next merge window. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id A19A56B01B0 for ; Wed, 2 Jun 2010 09:54:06 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o52Ds4XH016905 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Wed, 2 Jun 2010 22:54:04 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0B59C45DE51 for ; Wed, 2 Jun 2010 22:54:04 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id DA39545DE4F for ; Wed, 2 Jun 2010 22:54:03 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id C75C31DB803B for ; Wed, 2 Jun 2010 22:54:03 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 7A34C1DB803E for ; Wed, 2 Jun 2010 22:54:03 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: References: <20100601074620.GR9453@laptop> Message-Id: <20100602222347.F527.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Wed, 2 Jun 2010 22:54:02 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > > Such that you can look at your test case or workload and see that > > it is really improved? > > > > I'm glad you asked that because some recent conversation has been > slightly confusing to me about how this affects the desktop; this rewrite > significantly improves the oom killer's response for desktop users. The > core ideas were developed in the thread from this mailing list back in > February called "Improving OOM killer" at > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report > that vital system tasks such as kdeinit are killed whenever a memory > hogging task is forked either intentionally or unintentionally. I argued > for a while that KDE should be taking proper precautions by adjusting its > own oom_adj score and that of its forked children as it's an inherited > value, but I was eventually convinced that an overall improvement to the > heuristic must be made to kill a task that was known to free a large > amount of memory that is resident in RAM and that we have a consistent way > of defining oom priorities when a task is run uncontained and when it is a > member of a memcg or cpuset (or even mempolicy now), even in the case when > it's contained out from under the task's knowledge. When faced with > memory pressure from an out of control or memory hogging task on the > desktop, the oom killer now kills it instead of a vital task such as an X > server (and oracle, webserver, etc on server platforms) because of the use > of the task's rss instead of total_vm statistic. The above story teach us oom-killer need some improvement. but it haven't prove your patches are correct solution. that's why you got to ask testing way. Nobody have objection to fix KDE OOM issue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 392586B01B5 for ; Wed, 2 Jun 2010 09:54:07 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o52Ds53i021508 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Wed, 2 Jun 2010 22:54:05 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id E3D9045DE4F for ; Wed, 2 Jun 2010 22:54:04 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C857645DE4E for ; Wed, 2 Jun 2010 22:54:04 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id B37FE1DB8014 for ; Wed, 2 Jun 2010 22:54:04 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 6A6951DB8013 for ; Wed, 2 Jun 2010 22:54:04 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: References: <20100601163627.245D.A69D9226@jp.fujitsu.com> Message-Id: <20100602225252.F536.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Wed, 2 Jun 2010 22:54:03 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Why? > > If it's because the patch is too big, I've explained a few times that > functionally you can't break it apart into anything meaningful. I do not > believe it is better to break functional changes into smaller patches that > simply change function signatures to pass additional arguments that are > unused in the first patch, for example. > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's > allowed since otherwise it would be useless with the old heuristic. In > other words, you cannot apply oom_score_adj's meaning to the bitshift in > any sane way. > > I'll suggest what I have multiple times: the easiest way to review the > functional change here is to merge the patch into your own tree and then > review oom_badness(). I agree that the way the diff comes out it is a > little difficult to read just from the patch form, so merging it and > reviewing the actual heuristic function is the easiest way. I've already explained the reason. 1) all-of-rewrite patches are always unacceptable. that's prevent our code maintainance. 2) no justification patches are also unacceptable. you need to write more proper patch descriptaion at least. We don't need pointless suggestion. you only need to fix the patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 7BBEA6B01B5 for ; Wed, 2 Jun 2010 09:54:10 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o52Ds6qh006335 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Wed, 2 Jun 2010 22:54:07 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 4A30945DE79 for ; Wed, 2 Jun 2010 22:54:06 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 0BE7E45DE6E for ; Wed, 2 Jun 2010 22:54:06 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id DFDA01DB803B for ; Wed, 2 Jun 2010 22:54:05 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 64A801DB8040 for ; Wed, 2 Jun 2010 22:54:02 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: <20100601204342.GC20732@redhat.com> References: <20100601204342.GC20732@redhat.com> Message-Id: <20100602221633.F521.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Wed, 2 Jun 2010 22:54:01 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Oleg Nesterov Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Andrew Morton , Rik van Riel , Nick Piggin , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > On 06/01, David Rientjes wrote: > > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I > > know you've pushed Oleg's patches > > (plus other fixes) > > > but they are also included here so no > > respin is necessary unless they are merged first (and I think that should > > only happen if Andrew considers them to be rc material). > > Well, I disagree. > > I think it is always better to push the simple bugfixes first, then > change/improve the logic. Yep. That's exactly the reason why I would push his patch series at first. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 3E2DD6B01AC for ; Wed, 2 Jun 2010 17:21:04 -0400 (EDT) Received: from kpbe20.cbf.corp.google.com (kpbe20.cbf.corp.google.com [172.25.105.84]) by smtp-out.google.com with ESMTP id o52LKxOJ009724 for ; Wed, 2 Jun 2010 14:21:00 -0700 Received: from pxi6 (pxi6.prod.google.com [10.243.27.6]) by kpbe20.cbf.corp.google.com with ESMTP id o52LKwKd020850 for ; Wed, 2 Jun 2010 14:20:58 -0700 Received: by pxi6 with SMTP id 6so2703165pxi.15 for ; Wed, 02 Jun 2010 14:20:58 -0700 (PDT) Date: Wed, 2 Jun 2010 14:20:53 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100602225252.F536.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010, KOSAKI Motohiro wrote: > I've already explained the reason. 1) all-of-rewrite patches are > always unacceptable. that's prevent our code maintainance. How else would you propose to completely change a heuristic?? By doing it in steps where the intermediate changes make an absolute mess of it first and then slowly work toward the end result? This is a complete rewrite of the badness() heuristic, it introduces a new userspace interface, oom_score_adj, which it heavily relies upon (otherwise it'd be impossible to disable oom killing completely for certain tasks, for example), so naturally that needs to be included. I've followed your suggestion of splitting out the forkbomb detector into the next patch, which you don't even have any feedback for either other than "nack", so what else do you want from me?? Please follow my suggestion that I've repeatedly made: merge the patch locally and check out the new oom_badness() function and see if there's anything you're concerned with. In other words, please actually review the implementation and design. > 2) no justification > patches are also unacceptable. you need to write more proper patch descriptaion > at least. > What needs to be included in the patch description that isn't already? I think it's intention and implementation is clearly spelled out. > We don't need pointless suggestion. you only need to fix the patch. > It's a review tip to make it easier to read the patch since the complete rewrite of oom_badness() is difficult to read in patch form because of the breaks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 445186B01AD for ; Wed, 2 Jun 2010 17:24:01 -0400 (EDT) Received: from wpaz13.hot.corp.google.com (wpaz13.hot.corp.google.com [172.24.198.77]) by smtp-out.google.com with ESMTP id o52LNwda015871 for ; Wed, 2 Jun 2010 14:23:58 -0700 Received: from pzk32 (pzk32.prod.google.com [10.243.19.160]) by wpaz13.hot.corp.google.com with ESMTP id o52LNv7j030226 for ; Wed, 2 Jun 2010 14:23:57 -0700 Received: by pzk32 with SMTP id 32so3288100pzk.21 for ; Wed, 02 Jun 2010 14:23:57 -0700 (PDT) Date: Wed, 2 Jun 2010 14:23:53 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100602222347.F527.A69D9226@jp.fujitsu.com> Message-ID: References: <20100601074620.GR9453@laptop> <20100602222347.F527.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010, KOSAKI Motohiro wrote: > > I'm glad you asked that because some recent conversation has been > > slightly confusing to me about how this affects the desktop; this rewrite > > significantly improves the oom killer's response for desktop users. The > > core ideas were developed in the thread from this mailing list back in > > February called "Improving OOM killer" at > > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report > > that vital system tasks such as kdeinit are killed whenever a memory > > hogging task is forked either intentionally or unintentionally. I argued > > for a while that KDE should be taking proper precautions by adjusting its > > own oom_adj score and that of its forked children as it's an inherited > > value, but I was eventually convinced that an overall improvement to the > > heuristic must be made to kill a task that was known to free a large > > amount of memory that is resident in RAM and that we have a consistent way > > of defining oom priorities when a task is run uncontained and when it is a > > member of a memcg or cpuset (or even mempolicy now), even in the case when > > it's contained out from under the task's knowledge. When faced with > > memory pressure from an out of control or memory hogging task on the > > desktop, the oom killer now kills it instead of a vital task such as an X > > server (and oracle, webserver, etc on server platforms) because of the use > > of the task's rss instead of total_vm statistic. > > The above story teach us oom-killer need some improvement. but it haven't > prove your patches are correct solution. that's why you got to ask testing way. > I would consider what I said above, "when faced with memory pressure from an out of control or memory hogging task on the desktop, the oom killer now kills it instead of a vital task such as an X server because of the use of the task's rss instead of total_vm statistic" as an improvement over killing X in those cases which it currently does. How do you disagree? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 27C086B01AD for ; Wed, 2 Jun 2010 17:35:14 -0400 (EDT) Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id o52LZ8n0032510 for ; Wed, 2 Jun 2010 14:35:08 -0700 Received: from pzk2 (pzk2.prod.google.com [10.243.19.130]) by hpaq1.eem.corp.google.com with ESMTP id o52LZ6Rd029579 for ; Wed, 2 Jun 2010 14:35:07 -0700 Received: by pzk2 with SMTP id 2so2120370pzk.25 for ; Wed, 02 Jun 2010 14:35:06 -0700 (PDT) Date: Wed, 2 Jun 2010 14:35:01 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit In-Reply-To: <20100602104621.GA6152@laptop> Message-ID: References: <20100601164026.2472.A69D9226@jp.fujitsu.com> <20100601204342.GC20732@redhat.com> <20100602092819.58579806.kamezawa.hiroyu@jp.fujitsu.com> <20100602104621.GA6152@laptop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: KAMEZAWA Hiroyuki , Oleg Nesterov , KOSAKI Motohiro , Andrew Morton , Rik van Riel , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010, Nick Piggin wrote: > Well there are a large number of patches with no objections, some of > which are bug-fixes which may need to be backported to earlier kernels. > It would be nice if the patchset would be rearranged so all these can > be merged soon (I don't want the situation where a couple of patches > hold up your entire patchset again). > I've written fixes in this patchset and have merged Oleg's work into it, but I would stress that none of these are really bugfixes that fix an unstable condition: killing a task outside of current's cpuset even though it was needless isn't a bugfix, recalling the oom killer once a kthread has called unuse_mm() isn't a bugfix, etc. So while they definitely are fixes that we'd like to see upstream at some point, hence they were merged here as well, their impact is not as severe as it may have been described outside of this thread. I definitely don't want that situation where a couple of patches hold it up either, I'm waiting for something to work on. > When you are reduced to a few patches changing major functionality, it > could be eaiser to get those reviewed and merged on their own. > What patches specifically do you think are 2.6.35-rc2 material? Otherwise, in my opinion, holding up this entire thing from being merged doesn't make a lot of sense based on order of patches. > Well the merge window is closed and even if it wasn't the patches would > be better to sit in -mm for a bit. So I don't think there is a big rush > now, let's just get it right so everything is lined up to get into the > next merge window. > They already sat in -mm for six weeks, so I had stopped my work thinking they already had a path upstream then were abruptly removed with the only alternative left to me in being to fold incremental fixes into one another and repost. There have been no changes to what was sitting in -mm for six weeks other than dropping the consolidation of sysctls, the unifying of the panic_on_oom semantics for pagefault ooms, and refactoring of the patchset. I'm left in the position where people want certain patches merged first even though they won't say it's rc material, they want to me to base my patchset off what they speculatively believe Andrew will eventually merge in -mm in the first place from others, and they refuse to review both the implementation and design of the new heursitic. It compounds my work every day with absolutely no forward progress being made and we've stalled out on all this work because nobody is actually getting involved in reviewing the patchset for Andrew. I honestly don't understand why this entire patchset cannot be merged right now with a target of 2.6.36. If you disagree, please show me the patches that you believe are rc material and the problems that they fix that are either regressions from current code or have a severe enough impact to warrant that type of consideration. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 509EA6B01AD for ; Wed, 2 Jun 2010 20:10:13 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o530A91B025787 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Thu, 3 Jun 2010 09:10:10 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 68F1745DE57 for ; Thu, 3 Jun 2010 09:10:09 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 19C641EF081 for ; Thu, 3 Jun 2010 09:10:09 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id F34991DB8040 for ; Thu, 3 Jun 2010 09:10:08 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id BD3451DB803B for ; Thu, 3 Jun 2010 09:10:07 +0900 (JST) Date: Thu, 3 Jun 2010 09:05:52 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100603090552.1206dfb4.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: References: <20100601074620.GR9453@laptop> <20100602222347.F527.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KOSAKI Motohiro , Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010 14:23:53 -0700 (PDT) David Rientjes wrote: > On Wed, 2 Jun 2010, KOSAKI Motohiro wrote: > > > > I'm glad you asked that because some recent conversation has been > > > slightly confusing to me about how this affects the desktop; this rewrite > > > significantly improves the oom killer's response for desktop users. The > > > core ideas were developed in the thread from this mailing list back in > > > February called "Improving OOM killer" at > > > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report > > > that vital system tasks such as kdeinit are killed whenever a memory > > > hogging task is forked either intentionally or unintentionally. I argued > > > for a while that KDE should be taking proper precautions by adjusting its > > > own oom_adj score and that of its forked children as it's an inherited > > > value, but I was eventually convinced that an overall improvement to the > > > heuristic must be made to kill a task that was known to free a large > > > amount of memory that is resident in RAM and that we have a consistent way > > > of defining oom priorities when a task is run uncontained and when it is a > > > member of a memcg or cpuset (or even mempolicy now), even in the case when > > > it's contained out from under the task's knowledge. When faced with > > > memory pressure from an out of control or memory hogging task on the > > > desktop, the oom killer now kills it instead of a vital task such as an X > > > server (and oracle, webserver, etc on server platforms) because of the use > > > of the task's rss instead of total_vm statistic. > > > > The above story teach us oom-killer need some improvement. but it haven't > > prove your patches are correct solution. that's why you got to ask testing way. > > > > I would consider what I said above, "when faced with memory pressure from > an out of control or memory hogging task on the desktop, the oom killer > now kills it instead of a vital task such as an X server because of the > use of the task's rss instead of total_vm statistic" as an improvement > over killing X in those cases which it currently does. How do you > disagree? > It was you who disagree using RSS for oom killing in the last winter. By what observation did you change your mind ? (Don't take this as criticism. I'm just curious.) My stand point: I don't like the new interface at all but welcome the concept for using RSS . And I and my custoemr will never use the new interface other than OOM_DISABLE. So, I don't say ack nor nack. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id E43B36B01AD for ; Wed, 2 Jun 2010 23:07:54 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5337q3B006827 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Thu, 3 Jun 2010 12:07:52 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id BEF2245DE50 for ; Thu, 3 Jun 2010 12:07:51 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 9481045DE4E for ; Thu, 3 Jun 2010 12:07:51 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 477991DB8015 for ; Thu, 3 Jun 2010 12:07:51 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id F2C0E1DB8012 for ; Thu, 3 Jun 2010 12:07:50 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: References: <20100602222347.F527.A69D9226@jp.fujitsu.com> Message-Id: <20100603104314.723D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Thu, 3 Jun 2010 12:07:50 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > > The above story teach us oom-killer need some improvement. but it haven't > > prove your patches are correct solution. that's why you got to ask testing way. > > I would consider what I said above, "when faced with memory pressure from > an out of control or memory hogging task on the desktop, the oom killer > now kills it instead of a vital task such as an X server because of the > use of the task's rss instead of total_vm statistic" as an improvement > over killing X in those cases which it currently does. How do you > disagree? People observed simple s/total_vm/rss/ patch solve X issue. Then, other additional pieces need to explain why that's necessary and how to confirm it. In other word, I'm sure I'll continue to get OOM bug report in future. I'll need to decide revert or not revert each patches. no infomation is unwelcome. also, that's the reason why all of rewrite patch is wrong. if it will be merged, small bug report eventually is going to make all of revert. that doesn't fit our developerment process. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id BA77C6B01E7 for ; Thu, 3 Jun 2010 02:44:46 -0400 (EDT) Received: from kpbe19.cbf.corp.google.com (kpbe19.cbf.corp.google.com [172.25.105.83]) by smtp-out.google.com with ESMTP id o536igL3012989 for ; Wed, 2 Jun 2010 23:44:42 -0700 Received: from pxi10 (pxi10.prod.google.com [10.243.27.10]) by kpbe19.cbf.corp.google.com with ESMTP id o536ieT9010793 for ; Wed, 2 Jun 2010 23:44:41 -0700 Received: by pxi10 with SMTP id 10so3280280pxi.21 for ; Wed, 02 Jun 2010 23:44:40 -0700 (PDT) Date: Wed, 2 Jun 2010 23:44:33 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100603090552.1206dfb4.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <20100601074620.GR9453@laptop> <20100602222347.F527.A69D9226@jp.fujitsu.com> <20100603090552.1206dfb4.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Thu, 3 Jun 2010, KAMEZAWA Hiroyuki wrote: > > > > I'm glad you asked that because some recent conversation has been > > > > slightly confusing to me about how this affects the desktop; this rewrite > > > > significantly improves the oom killer's response for desktop users. The > > > > core ideas were developed in the thread from this mailing list back in > > > > February called "Improving OOM killer" at > > > > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report > > > > that vital system tasks such as kdeinit are killed whenever a memory > > > > hogging task is forked either intentionally or unintentionally. I argued > > > > for a while that KDE should be taking proper precautions by adjusting its > > > > own oom_adj score and that of its forked children as it's an inherited > > > > value, but I was eventually convinced that an overall improvement to the > > > > heuristic must be made to kill a task that was known to free a large > > > > amount of memory that is resident in RAM and that we have a consistent way > > > > of defining oom priorities when a task is run uncontained and when it is a > > > > member of a memcg or cpuset (or even mempolicy now), even in the case when > > > > it's contained out from under the task's knowledge. When faced with > > > > memory pressure from an out of control or memory hogging task on the > > > > desktop, the oom killer now kills it instead of a vital task such as an X > > > > server (and oracle, webserver, etc on server platforms) because of the use > > > > of the task's rss instead of total_vm statistic. > > > > > > The above story teach us oom-killer need some improvement. but it haven't > > > prove your patches are correct solution. that's why you got to ask testing way. > > > > > > > I would consider what I said above, "when faced with memory pressure from > > an out of control or memory hogging task on the desktop, the oom killer > > now kills it instead of a vital task such as an X server because of the > > use of the task's rss instead of total_vm statistic" as an improvement > > over killing X in those cases which it currently does. How do you > > disagree? > > > > It was you who disagree using RSS for oom killing in the last winter. > By what observation did you change your mind ? (Don't take this as criticism. > I'm just curious.) > The fact that when I ran the new heuristic it improved the oom killer on my desktop to save KDE and kill a memory-hogging task that stressed it. I became supportive of the idea through the discussion that went on specifically about using total_vm as a baseline and was convinced that it was better to use rss as well as a more powerful user interface so that admins could more accurately set their oom kill priorities even when their cpuset, memcg, or mempolicy placement was changed out from under it. > My stand point: > I don't like the new interface at all but welcome the concept for using RSS . Using rss is not a new interface. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id BAF0D6B01E9 for ; Thu, 3 Jun 2010 02:49:19 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id o536nHO5016227 for ; Wed, 2 Jun 2010 23:49:17 -0700 Received: from pvf33 (pvf33.prod.google.com [10.241.210.97]) by kpbe16.cbf.corp.google.com with ESMTP id o536nF4m003449 for ; Wed, 2 Jun 2010 23:49:15 -0700 Received: by pvf33 with SMTP id 33so373292pvf.3 for ; Wed, 02 Jun 2010 23:49:15 -0700 (PDT) Date: Wed, 2 Jun 2010 23:48:55 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100603104314.723D.A69D9226@jp.fujitsu.com> Message-ID: References: <20100602222347.F527.A69D9226@jp.fujitsu.com> <20100603104314.723D.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Nick Piggin , Andrew Morton , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Thu, 3 Jun 2010, KOSAKI Motohiro wrote: > > I would consider what I said above, "when faced with memory pressure from > > an out of control or memory hogging task on the desktop, the oom killer > > now kills it instead of a vital task such as an X server because of the > > use of the task's rss instead of total_vm statistic" as an improvement > > over killing X in those cases which it currently does. How do you > > disagree? > > People observed simple s/total_vm/rss/ patch solve X issue. It doesn't, you need to consider swap as well. > Then, > other additional pieces need to explain why that's necessary and > how to confirm it. > Are you talking about oom_score_adj? Please read the patch description. > In other word, I'm sure I'll continue to get OOM bug report in future. > I'll need to decide revert or not revert each patches. no infomation is > unwelcome. also, that's the reason why all of rewrite patch is wrong. > if it will be merged, small bug report eventually is going to make > all of revert. that doesn't fit our developerment process. > You're speculating that a new problem will be introduced with this change that you cannot describe but are concerned that you won't be able to debug that unknown issue without simply reverting the entire change? These "nack"ing reasons of yours are getting more and more interesting, I must say. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 9C90D6B01B2 for ; Thu, 3 Jun 2010 16:33:28 -0400 (EDT) Received: from kpbe14.cbf.corp.google.com (kpbe14.cbf.corp.google.com [172.25.105.78]) by smtp-out.google.com with ESMTP id o53KXOSV022933 for ; Thu, 3 Jun 2010 13:33:24 -0700 Received: from pvg16 (pvg16.prod.google.com [10.241.210.144]) by kpbe14.cbf.corp.google.com with ESMTP id o53KWspE025307 for ; Thu, 3 Jun 2010 13:33:22 -0700 Received: by pvg16 with SMTP id 16so298514pvg.33 for ; Thu, 03 Jun 2010 13:33:22 -0700 (PDT) Date: Thu, 3 Jun 2010 13:33:19 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: Message-ID: References: <20100601163705.2460.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 1 Jun 2010, David Rientjes wrote: > On Tue, 1 Jun 2010, KOSAKI Motohiro wrote: > > > > Add a forkbomb penalty for processes that fork an excessively large > > > number of children to penalize that group of tasks and not others. A > > > threshold is configurable from userspace to determine how many first- > > > generation execve children (those with their own address spaces) a task > > > may have before it is considered a forkbomb. This can be tuned by > > > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > > > 1000. > > > > > > When a task has more than 1000 first-generation children with different > > > address spaces than itself, a penalty of > > > > > > (average rss of children) * (# of 1st generation execve children) > > > ----------------------------------------------------------------- > > > oom_forkbomb_thres > > > > > > is assessed. So, for example, using the default oom_forkbomb_thres of > > > 1000, the penalty is twice the average rss of all its execve children if > > > there are 2000 such tasks. A task is considered to count toward the > > > threshold if its total runtime is less than one second; for 1000 of such > > > tasks to exist, the parent process must be forking at an extremely high > > > rate either erroneously or maliciously. > > > > > > Even though a particular task may be designated a forkbomb and selected as > > > the victim, the oom killer will still kill the 1st generation execve child > > > with the highest badness() score in its place. The avoids killing > > > important servers or system daemons. When a web server forks a very large > > > number of threads for client connections, for example, it is much better > > > to kill one of those threads than to kill the server and make it > > > unresponsive. > > > > > > [oleg@redhat.com: optimize task_lock when iterating children] > > > Signed-off-by: David Rientjes > > > > nack > > > > Why? > Still waiting for an answer to this, KOSAKI. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 655266B01AD for ; Thu, 3 Jun 2010 19:11:23 -0400 (EDT) Date: Thu, 3 Jun 2010 16:10:30 -0700 From: Andrew Morton Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100603161030.074d9b98.akpm@linux-foundation.org> In-Reply-To: <20100602225252.F536.A69D9226@jp.fujitsu.com> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Wed, 2 Jun 2010 22:54:03 +0900 (JST) KOSAKI Motohiro wrote: > > Why? > > > > If it's because the patch is too big, I've explained a few times that > > functionally you can't break it apart into anything meaningful. I do not > > believe it is better to break functional changes into smaller patches that > > simply change function signatures to pass additional arguments that are > > unused in the first patch, for example. > > > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's > > allowed since otherwise it would be useless with the old heuristic. In > > other words, you cannot apply oom_score_adj's meaning to the bitshift in > > any sane way. > > > > I'll suggest what I have multiple times: the easiest way to review the > > functional change here is to merge the patch into your own tree and then > > review oom_badness(). I agree that the way the diff comes out it is a > > little difficult to read just from the patch form, so merging it and > > reviewing the actual heuristic function is the easiest way. > > I've already explained the reason. 1) all-of-rewrite patches are > always unacceptable. that's prevent our code maintainance. No, we'll sometime completely replace implementations. There's no hard rule apart from "whatever makes sense". If wholesale replacement makes sense as a patch-presentation method then we'll do that. > 2) no justification > patches are also unacceptable. you need to write more proper patch descriptaion > at least. The descriptions look better than usual from a quick scan. I haven't really got into them yet. And I'm going to have to get into it because of you guys' seeming inability to get your act together. The unsubstantiated "nack"s are of no use and I shall just be ignoring them and making my own decisions. If you have specific objections then let's hear them. In detail, please - don't refer to previous conversations because that's all too confusing - there is benefit in starting again. I expect I'll be looking at the oom-killer situation in depth early next week. It would be useful if between now and then you can send any specific, detailed and actionable comments which you have. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 3EF7F6B01AF for ; Thu, 3 Jun 2010 19:15:50 -0400 (EDT) Date: Thu, 3 Jun 2010 16:15:32 -0700 From: Andrew Morton Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100603161532.8e41b42a.akpm@linux-foundation.org> In-Reply-To: <20100603104314.723D.A69D9226@jp.fujitsu.com> References: <20100602222347.F527.A69D9226@jp.fujitsu.com> <20100603104314.723D.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Nick Piggin , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Thu, 3 Jun 2010 12:07:50 +0900 (JST) KOSAKI Motohiro wrote: > In other word, I'm sure I'll continue to get OOM bug report in future. You must have some reason for believing that. Please share it with us. Even better: apply the patches and run some tests. If you believe there are new failure modes then surely you can quickly prepare a testcase which demonstrates them. Or just suggest a test case - I expect David will be able to test it. Again: without hard, tangible engineering facts I cannot take comments such as the above into account. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 069806B01AD for ; Thu, 3 Jun 2010 19:58:12 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o53Nw9hK008290 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 4 Jun 2010 08:58:09 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 621B845DE4E for ; Fri, 4 Jun 2010 08:58:09 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 41E6F45DE53 for ; Fri, 4 Jun 2010 08:58:09 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 27AAD1DB8048 for ; Fri, 4 Jun 2010 08:58:09 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id D14031DB804F for ; Fri, 4 Jun 2010 08:58:05 +0900 (JST) Date: Fri, 4 Jun 2010 08:53:47 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100603161030.074d9b98.akpm@linux-foundation.org> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Thu, 3 Jun 2010 16:10:30 -0700 Andrew Morton wrote: > On Wed, 2 Jun 2010 22:54:03 +0900 (JST) > KOSAKI Motohiro wrote: > > > > Why? > > > > > > If it's because the patch is too big, I've explained a few times that > > > functionally you can't break it apart into anything meaningful. I do not > > > believe it is better to break functional changes into smaller patches that > > > simply change function signatures to pass additional arguments that are > > > unused in the first patch, for example. > > > > > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's > > > allowed since otherwise it would be useless with the old heuristic. In > > > other words, you cannot apply oom_score_adj's meaning to the bitshift in > > > any sane way. > > > > > > I'll suggest what I have multiple times: the easiest way to review the > > > functional change here is to merge the patch into your own tree and then > > > review oom_badness(). I agree that the way the diff comes out it is a > > > little difficult to read just from the patch form, so merging it and > > > reviewing the actual heuristic function is the easiest way. > > > > I've already explained the reason. 1) all-of-rewrite patches are > > always unacceptable. that's prevent our code maintainance. > > No, we'll sometime completely replace implementations. There's no hard > rule apart from "whatever makes sense". If wholesale replacement makes > sense as a patch-presentation method then we'll do that. > I agree. IMHO. But this series includes both of bug fixes and new features at random. Then, a small bugfixes, which doens't require refactoring, seems to do that. That's irritating guys (at least me) because it seems that he tries to sneak his own new logic into bugfix and moreover, it makes backport to distro difficult. I'd like to beg him separate them into 2 series as bugfix and something new. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 341526B01AD for ; Thu, 3 Jun 2010 20:05:03 -0400 (EDT) Date: Thu, 3 Jun 2010 17:04:43 -0700 From: Andrew Morton Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100603170443.011fdf7c.akpm@linux-foundation.org> In-Reply-To: <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Fri, 4 Jun 2010 08:53:47 +0900 KAMEZAWA Hiroyuki wrote: > On Thu, 3 Jun 2010 16:10:30 -0700 > Andrew Morton wrote: > > > On Wed, 2 Jun 2010 22:54:03 +0900 (JST) > > KOSAKI Motohiro wrote: > > > > > > Why? > > > > > > > > If it's because the patch is too big, I've explained a few times that > > > > functionally you can't break it apart into anything meaningful. I do not > > > > believe it is better to break functional changes into smaller patches that > > > > simply change function signatures to pass additional arguments that are > > > > unused in the first patch, for example. > > > > > > > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's > > > > allowed since otherwise it would be useless with the old heuristic. In > > > > other words, you cannot apply oom_score_adj's meaning to the bitshift in > > > > any sane way. > > > > > > > > I'll suggest what I have multiple times: the easiest way to review the > > > > functional change here is to merge the patch into your own tree and then > > > > review oom_badness(). I agree that the way the diff comes out it is a > > > > little difficult to read just from the patch form, so merging it and > > > > reviewing the actual heuristic function is the easiest way. > > > > > > I've already explained the reason. 1) all-of-rewrite patches are > > > always unacceptable. that's prevent our code maintainance. > > > > No, we'll sometime completely replace implementations. There's no hard > > rule apart from "whatever makes sense". If wholesale replacement makes > > sense as a patch-presentation method then we'll do that. > > > I agree. > > IMHO. > > But this series includes both of bug fixes and new features at random. > Then, a small bugfixes, which doens't require refactoring, seems to do that. > That's irritating guys (at least me) because it seems that he tries to sneak > his own new logic into bugfix and moreover, it makes backport to distro difficult. > I'd like to beg him separate them into 2 series as bugfix and something new. > Sure, bugfixes should come separately and first. For a number of reasons: - people (including the -stable maintainers) might want to backport them - we might end up not merging the larger, bugfix-including patches at all - the large bugfix-including patches might blow up and need reverting. If we do that, we accidentally revert bugfixes! Have we identified specifically which bugfixes should be separated out in this fashion? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 34BEB6B01AD for ; Thu, 3 Jun 2010 20:25:09 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o540P6Lo008092 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 4 Jun 2010 09:25:06 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 3F84145DE50 for ; Fri, 4 Jun 2010 09:25:06 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 2557345DE4C for ; Fri, 4 Jun 2010 09:25:06 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id F30051DB8015 for ; Fri, 4 Jun 2010 09:25:05 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id AC5CC1DB8012 for ; Fri, 4 Jun 2010 09:25:05 +0900 (JST) Date: Fri, 4 Jun 2010 09:20:47 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100604092047.7b7d7bb1.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100603170443.011fdf7c.akpm@linux-foundation.org> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> <20100603170443.011fdf7c.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Thu, 3 Jun 2010 17:04:43 -0700 Andrew Morton wrote: > Sure, bugfixes should come separately and first. For a number of > reasons: > > - people (including the -stable maintainers) might want to backport them > > - we might end up not merging the larger, bugfix-including patches at all > > - the large bugfix-including patches might blow up and need > reverting. If we do that, we accidentally revert bugfixes! > > Have we identified specifically which bugfixes should be separated out > in this fashion? > In my personal observation [1/18] for better behavior under cpuset. [2/18] for better behavior under cpuset. [3/18] for better behavior under mempolicy. [4/18] refactoring. [5/18] refactoring. [6/18] clean up. [7/18] changing the deault sysctl value. [8/18] completely new logic. [9/18] completely new logic. [10/18] a supplement for 8,9. [11/18] for better behavior under lowmem oom (disable oom kill) [12/18] clean up [13/18] bugfix for a possible race condition. (I'm not sure about details) [14/18] bugfix [15/18] bugfix [16/18] bugfix [17/18] bugfix [18/18] clean up. If distro admins are aggresive, them may backport 1,2,3,7,11 but it changes current logic. So, it's distro's decision. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id C02416B01AD for ; Fri, 4 Jun 2010 02:01:48 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5461ktJ023676 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 4 Jun 2010 15:01:46 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 58CDF45DE6F for ; Fri, 4 Jun 2010 15:01:46 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 2938245DE60 for ; Fri, 4 Jun 2010 15:01:46 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 0ED831DB8037 for ; Fri, 4 Jun 2010 15:01:46 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id A7E20E38004 for ; Fri, 4 Jun 2010 15:01:42 +0900 (JST) Date: Fri, 4 Jun 2010 14:57:23 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100604145723.e16d7fe0.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100604092047.7b7d7bb1.kamezawa.hiroyu@jp.fujitsu.com> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> <20100603170443.011fdf7c.akpm@linux-foundation.org> <20100604092047.7b7d7bb1.kamezawa.hiroyu@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Andrew Morton , KOSAKI Motohiro , David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Fri, 4 Jun 2010 09:20:47 +0900 KAMEZAWA Hiroyuki wrote: > On Thu, 3 Jun 2010 17:04:43 -0700 > Andrew Morton wrote: > > > Sure, bugfixes should come separately and first. For a number of > > reasons: > > > > - people (including the -stable maintainers) might want to backport them > > > > - we might end up not merging the larger, bugfix-including patches at all > > > > - the large bugfix-including patches might blow up and need > > reverting. If we do that, we accidentally revert bugfixes! > > > > Have we identified specifically which bugfixes should be separated out > > in this fashion? > > > > In my personal observation > > [1/18] for better behavior under cpuset. > [2/18] for better behavior under cpuset. > [3/18] for better behavior under mempolicy. > [4/18] refactoring. > [5/18] refactoring. > [6/18] clean up. > [7/18] changing the deault sysctl value. > [8/18] completely new logic. > [9/18] completely new logic. > [10/18] a supplement for 8,9. > [11/18] for better behavior under lowmem oom (disable oom kill) > [12/18] clean up > [13/18] bugfix for a possible race condition. (I'm not sure about details) > [14/18] bugfix > [15/18] bugfix > [16/18] bugfix > [17/18] bugfix > [18/18] clean up. > > If distro admins are aggresive, them may backport 1,2,3,7,11 but > it changes current logic. So, it's distro's decision. > IMHO, without considering HUNKs, the patch order should be 13,14,15,16,17,1,2,3,7,11,4,5,6,18,12,8,9,10. bugfix -> patches for things making better -> refactoring -> the new implementation. David, I have no objections to functions itself. But please start from small good things. "Refactoring" is good but it tend to make backporting not-straightforward. So, I think it should be done when there is no known issues. I think you can do. Bye, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id B0EBD6B01AD for ; Fri, 4 Jun 2010 05:19:14 -0400 (EDT) Received: from kpbe19.cbf.corp.google.com (kpbe19.cbf.corp.google.com [172.25.105.83]) by smtp-out.google.com with ESMTP id o549J8G8018617 for ; Fri, 4 Jun 2010 02:19:09 -0700 Received: from pvh11 (pvh11.prod.google.com [10.241.210.203]) by kpbe19.cbf.corp.google.com with ESMTP id o549J7Hb022126 for ; Fri, 4 Jun 2010 02:19:07 -0700 Received: by pvh11 with SMTP id 11so627433pvh.13 for ; Fri, 04 Jun 2010 02:19:07 -0700 (PDT) Date: Fri, 4 Jun 2010 02:19:01 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Andrew Morton , KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Fri, 4 Jun 2010, KAMEZAWA Hiroyuki wrote: > > No, we'll sometime completely replace implementations. There's no hard > > rule apart from "whatever makes sense". If wholesale replacement makes > > sense as a patch-presentation method then we'll do that. > > > I agree. > > IMHO. > > But this series includes both of bug fixes and new features at random. > Then, a small bugfixes, which doens't require refactoring, seems to do that. > That's irritating guys (at least me) because it seems that he tries to sneak > his own new logic into bugfix and moreover, it makes backport to distro difficult. I'll reply to your proposed patch order in your other email, but please don't think that I'm trying to sneak anything in with this series :) It's been posted here for months and everything has been fully open to review and comment. Most of the patches that have been added on after the heuristic rewrite were things that came up later in testing and inspection, so I understand how the series has a somewhat awkward flow. I'll fix that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id A84526B01B0 for ; Fri, 4 Jun 2010 05:22:56 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id o549MqeJ032676 for ; Fri, 4 Jun 2010 02:22:52 -0700 Received: from pzk9 (pzk9.prod.google.com [10.243.19.137]) by wpaz5.hot.corp.google.com with ESMTP id o549MpqR008537 for ; Fri, 4 Jun 2010 02:22:51 -0700 Received: by pzk9 with SMTP id 9so628240pzk.18 for ; Fri, 04 Jun 2010 02:22:51 -0700 (PDT) Date: Fri, 4 Jun 2010 02:22:48 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100604145723.e16d7fe0.kamezawa.hiroyu@jp.fujitsu.com> Message-ID: References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> <20100603170443.011fdf7c.akpm@linux-foundation.org> <20100604092047.7b7d7bb1.kamezawa.hiroyu@jp.fujitsu.com> <20100604145723.e16d7fe0.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Andrew Morton , KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , Balbir Singh , linux-mm@kvack.org List-ID: On Fri, 4 Jun 2010, KAMEZAWA Hiroyuki wrote: > > In my personal observation > > > > [1/18] for better behavior under cpuset. > > [2/18] for better behavior under cpuset. > > [3/18] for better behavior under mempolicy. > > [4/18] refactoring. > > [5/18] refactoring. > > [6/18] clean up. > > [7/18] changing the deault sysctl value. > > [8/18] completely new logic. > > [9/18] completely new logic. > > [10/18] a supplement for 8,9. > > [11/18] for better behavior under lowmem oom (disable oom kill) > > [12/18] clean up > > [13/18] bugfix for a possible race condition. (I'm not sure about details) > > [14/18] bugfix > > [15/18] bugfix > > [16/18] bugfix > > [17/18] bugfix > > [18/18] clean up. > > > > If distro admins are aggresive, them may backport 1,2,3,7,11 but > > it changes current logic. So, it's distro's decision. > > > > IMHO, without considering HUNKs, the patch order should be > > 13,14,15,16,17,1,2,3,7,11,4,5,6,18,12,8,9,10. > > bugfix -> patches for things making better -> refactoring -> the new implementation. > Thank you for very much for taking the time to look through each individual patch and suggest a different order. If the ordering of the patches will help move us forward, then I'd be extremely happy to do it :) > David, I have no objections to functions itself. But please start from small > good things. "Refactoring" is good but it tend to make backporting > not-straightforward. So, I think it should be done when there is no known issues. > I think you can do. > I'll reorganize the patchset itself without any implementation changes so it flows better and is more appropriately seperated as you suggest. I still believe there is no -rc material within this series (implying there is no -stable material either), but if you believe so then please reply to those patches with the new posting so Andrew can consider pushing it to Linus. Thanks Kame. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id E49E06B01B6 for ; Fri, 4 Jun 2010 05:44:58 -0400 (EDT) Date: Fri, 4 Jun 2010 11:43:32 +0200 From: Oleg Nesterov Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-ID: <20100604094332.GA8569@redhat.com> References: <20100601163627.245D.A69D9226@jp.fujitsu.com> <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100604085347.80c7b43f.kamezawa.hiroyu@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: Andrew Morton , KOSAKI Motohiro , David Rientjes , Rik van Riel , Nick Piggin , Balbir Singh , linux-mm@kvack.org List-ID: On 06/04, KAMEZAWA Hiroyuki wrote: > > On Thu, 3 Jun 2010 16:10:30 -0700 > Andrew Morton wrote: > > But this series includes both of bug fixes and new features at random. > Then, a small bugfixes, which doens't require refactoring, seems to do that. > That's irritating guys (at least me) Me too. And Kosaki tries to fix these long-standing (and obvious) bugs first, before refactoring. So far (iiuc) David technically disagrees with the single patch which removes the PF_EXITING check. OK, probably it needs more discussion (once again: I can't judge, but I understand why Kosaki removed it). Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id CE4206B01AF for ; Fri, 4 Jun 2010 06:54:49 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o54AsjKq002730 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Fri, 4 Jun 2010 19:54:45 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 83F1F45DE4F for ; Fri, 4 Jun 2010 19:54:45 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 6216445DE4E for ; Fri, 4 Jun 2010 19:54:45 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 4EED01DB8050 for ; Fri, 4 Jun 2010 19:54:45 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 036A61DB8048 for ; Fri, 4 Jun 2010 19:54:45 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100603161030.074d9b98.akpm@linux-foundation.org> References: <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> Message-Id: <20100604195328.72D9.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Fri, 4 Jun 2010 19:54:44 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: Hi Andrew, > > I've already explained the reason. 1) all-of-rewrite patches are > > always unacceptable. that's prevent our code maintainance. > > No, we'll sometime completely replace implementations. There's no hard > rule apart from "whatever makes sense". If wholesale replacement makes > sense as a patch-presentation method then we'll do that. Have you review the actual patches? And No, I don't think "complete replace with no test result" is adequate development way. And, When developers post large patch set, Usually _you_ request show demonstrate result. I haven't seen such result in this activity. I agree OOM is invoked from various callsite (because page allocator is called from various), triggered from various memory starvation and/or killable userland processes are also vary various. So, I don't think the patch author must do 100% corvarage test. And I can say, I made some brief test case for confirming this and I haven't seen critical fault. However, It doesn't give any reason to avoid code review and violate our development process. > > 2) no justification > > patches are also unacceptable. you need to write more proper patch descriptaion > > at least. > > The descriptions look better than usual from a quick scan. I haven't > really got into them yet. > > > And I'm going to have to get into it because of you guys' seeming > inability to get your act together. Inability? What do you mean inability? Almost all developers cooperate for making stabilized kernel. Is this effort inability? or meaningless? Actually, the descriptions doesn't looks better really. We sometimes ask him - which problem occur? how do you reproduce it? - which piece solve which issue? - how do you measure side effect? - how do you mesure or consider other workload user But I got only the answer, "My patch is best. resistance is futile". that's purely Baaaad. At least, All of the patch author must to write the code intention. otherwise how do we review such code? guessing intention often makes code misparse and allow to insert bug. if the patch is enough small, it is not big problem. we don't makes misparse so often. but if it's large, the big problem. Again, I don't think we can't make separate the patch as individual parts and I don't think to don't be able to write each changes intention. > The unsubstantiated "nack"s are of no use and I shall just be ignoring > them and making my own decisions. If you have specific objections then > let's hear them. In detail, please - don't refer to previous > conversations because that's all too confusing - there is benefit in > starting again. OK. I don't have any reason to confuse you. I'll fix me. My point is really simple. The majority OOM user are in desktop. We must not ignore them. such as - Any regression from desktop view are unacceptable - Any incompatibility of no desktop improvement are unacceptable - Any refusing bugfix are unacceptable - Any refusing reviewing are unacceptable (IOW, must get any developers ack. I'm ok even if they don't include me) In other word, every heuristic change have to be explained why the patch improve desktop or no side-effect desktop. (ah, ok. for cpuset change is one of exception. desktop user definitely don't use it) I and any other reviewer only want to confirm the have have no significant regression. All of patch authoer have to help this, I think. > I expect I'll be looking at the oom-killer situation in depth early > next week. It would be useful if between now and then you can send > any specific, detailed and actionable comments which you have. 1) fix bugs at fist before making new feature (a.k.a new bugs) 2) don't mix bugfix and new feature 3) make separate as natural and individual piece 4) keep small and reviewable patch size 5) stop ugly excuse, instead repeatedly rewrite until get anyone ack 6) don't ignore another developers bug report Which is unactionable? I just don't understand :/ I didn't hope says the same thing twice and he repeatedly ignore my opinion, thus, he got short answer. I didn't think this is inadequate beucase he can google past mail. The fact is, I and (gessing) all other developer don't get any pressure from our campany because enterprise vendor don't interest oom. We are making time by chopping our private time, for helping impvoe his patch. Beucase we know current oom logic doesn't fit nowadys modern desktop environment and we surely hope to remove such harm. However he repeatedly attach our goodwill and blame our tolerance. but also repeatedly said "My workload is important than other!". Then, I got upset really. The fact is, all of good developer never says "my workload is most important in the world", it makes no sense and insane. I really hate such selfish. And No. I wouldn't hope to continue full review during the author refuse to hear. Kidding me. Instead, I'll do cherry-picking good piece from the sludge at-random patches and push you them. I think that makes everybody happy, people get improvement, DaveR get the merge, and I'll free from this frustration source. Of cource, I'll refrect your review result if you can get reviewing time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 3853B6B01B9 for ; Fri, 4 Jun 2010 06:54:50 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o54AslQB025940 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Fri, 4 Jun 2010 19:54:47 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 922D045DE4F for ; Fri, 4 Jun 2010 19:54:47 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 6DCD445DD70 for ; Fri, 4 Jun 2010 19:54:47 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 4F508E08002 for ; Fri, 4 Jun 2010 19:54:47 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 0B773E08001 for ; Fri, 4 Jun 2010 19:54:44 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100603161532.8e41b42a.akpm@linux-foundation.org> References: <20100603104314.723D.A69D9226@jp.fujitsu.com> <20100603161532.8e41b42a.akpm@linux-foundation.org> Message-Id: <20100604172614.72BB.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Fri, 4 Jun 2010 19:54:43 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Nick Piggin , Rik van Riel , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: Hi > > In other word, I'm sure I'll continue to get OOM bug report in future. > > You must have some reason for believing that. Please share it with us. In past, OOM bug report havn't beed stopped. Why can I believe any miracle occur? The fact is, any heuristic change have a risk. because we can't know all of the world use case. then, I don't think we must not change anything nor we must not makes any mistake. I only want to surely care to keep trackability. > Even better: apply the patches and run some tests. If you believe > there are new failure modes then surely you can quickly prepare a > testcase which demonstrates them. > > Or just suggest a test case - I expect David will be able to test it. > > Again: without hard, tangible engineering facts I cannot take comments > such as the above into account. OK. I also aim to provide good and productive information. But I also have requests. Recently mainly Oleg pointed some race and heuristic failure. I don't want your engineer ignore such bug report. please help bugfix too, please. otherwise, I'll upset again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 359BC6B01AD for ; Fri, 4 Jun 2010 16:57:31 -0400 (EDT) Received: from hpaq14.eem.corp.google.com (hpaq14.eem.corp.google.com [172.25.149.14]) by smtp-out.google.com with ESMTP id o54KvP8B011275 for ; Fri, 4 Jun 2010 13:57:25 -0700 Received: from pxi6 (pxi6.prod.google.com [10.243.27.6]) by hpaq14.eem.corp.google.com with ESMTP id o54Kv88t024185 for ; Fri, 4 Jun 2010 13:57:24 -0700 Received: by pxi6 with SMTP id 6so484021pxi.29 for ; Fri, 04 Jun 2010 13:57:22 -0700 (PDT) Date: Fri, 4 Jun 2010 13:57:16 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100604195328.72D9.A69D9226@jp.fujitsu.com> Message-ID: References: <20100602225252.F536.A69D9226@jp.fujitsu.com> <20100603161030.074d9b98.akpm@linux-foundation.org> <20100604195328.72D9.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Fri, 4 Jun 2010, KOSAKI Motohiro wrote: > Have you review the actual patches? And No, I don't think "complete > replace with no test result" is adequate development way. > I have repeatedly said that the oom killer no longer kills KDE when run on my desktop in the presence of a memory hogging task that was written specifically to oom the machine. That's a better result than the current implementation and was discussed thoroughly during the discussion on this mailing list back in February that inspired this rewrite to begin with. I don't think there's any mystery there since you've referred to that change specifically for KDE in this thread yourself. > And, When developers post large patch set, Usually _you_ request show > demonstrate result. I haven't seen such result in this activity. > You want to see a log that says "Killed process 1234 (memory-hogger)..." instead of "Killed process 1234 (kdeinit)..."? You've supported the change from total_vm to rss as a baseline to begin with. And after all this discussion, this is the first time you've ever said you wanted to see that type of log or anything like it. > However, It doesn't give any reason to avoid code review and violate > our development process. > Nobody is avoiding code review here, that's pretty obvious, and I have no idea you're referring to when you're saying I'm violating the development process because this happens to rewrite an entire function and requires a new user interface and callsite fixups to be meaningful. You specifically asked me to push the forkbomb detector in a different patch and I did that because it makes sense to seperate that heuristic, but even then you just wrote "nack" and haven't responded with why even after I've replied twice asking. I'm really confused this behavior. > > And I'm going to have to get into it because of you guys' seeming > > inability to get your act together. > > Inability? What do you mean inability? Almost all developers cooperate > for making stabilized kernel. Is this effort inability? or meaningless? > I think he's saying that he expects that we should be able to work cooperateively in resolving any differences that we have in a respectful and technical manner on this list. But I'll also add my two cents in that and say that we should probably be leaving maintainer duties up to the actual -mm tree maintainer, he knows the development process you're talking about pretty well. > Actually, the descriptions doesn't looks better really. We sometimes > ask him > - which problem occur? how do you reproduce it? KDE gets killed, memory hogger doesn't. Run memory hogger on your desktop. KOSAKI, this isn't a surprise to you. If this is your objection, I can certainly elaborate more in the changelog but up until yesterday you've never said you have a problem with it so how am I supposed to make any forward progress on this? I can't read your mind when you say "nack" and I'd like to resolve any issues that people have, but that requires that they get involved. > - which piece solve which issue? Mostly the baseline heuristic change to rss and swap, as you well know. > - how do you measure side effect? As far as the objective of the oom killer is concerned as listed in mm/oom_kill.c's header, there is no side effects. We're trying to kill a task that will free the largest amount of memory and clearly rss and swap is a better indication fo that then total_vm. > - how do you mesure or consider other workload user > The objective of the oom killer is not different for different workloads. > But I got only the answer, "My patch is best. resistance is futile". that's > purely Baaaad. > I haven't said anything new in the above, KOSAKI, you already knew all this. I'll update the changelog to include some of this information for the next posting, but I'd really hope that this isn't the major problem that you've had the entire time that we've stalled weeks on. > OK. I don't have any reason to confuse you. I'll fix me. My point is > really simple. The majority OOM user are in desktop. We must not ignore > them. such as > > - Any regression from desktop view are unacceptable This patchset was specifically designed to improve the oom killer's behavior on the desktop! > - Any incompatibility of no desktop improvement are unacceptable I don't understand this. > - Any refusing bugfix are unacceptable I've merged most of Oleg's work into this patchset, the problem that we're having is deciding whether any of it is -rc material or not and should be pushed first. I don't think any of it is, Oleg certainly wasn't pushing it and to date I don't believe has said it's rc material, so that's something you can talk about but I'm not refusing any bugfix. > - Any refusing reviewing are unacceptable (IOW, must get any developers ack. > I'm ok even if they don't include me) > I've been begging for you to review this. > 1) fix bugs at fist before making new feature (a.k.a new bugs) Kame already suggested a new order to the patchset that I'll be restructuring. I'm curious as to why this was removed from -mm though on your suggestion before any of this became an issue. We've yet to hear that mysterious information. > 2) don't mix bugfix and new feature Andrew said bugfixes should come first, they will in the reposting, but I don't consider any of it to be -rc material. > 3) make separate as natural and individual piece I can't keep having this conversation, the patch is broken down into one functional unit as much as possible. Please leave the maintainership of this code to Andrew who has already said entire implementation changes (in this case, a single function rewrite) is allowed if it makes sense. > 4) keep small and reviewable patch size Same as above. > 5) stop ugly excuse, instead repeatedly rewrite until get anyone ack I don't know what my ugly excuse is, but I'll be reordering the patches and sending them with an updated changelog on the badness heuristic rewrite. I hope that will satisfy all your concerns. > 6) don't ignore another developers bug report > If you have a bug report that is the result of this rewrite, please come forward with it and don't carry this out by making me guess again. > I didn't hope says the same thing twice and he repeatedly ignore > my opinion, thus, he got short answer. I didn't think this is inadequate > beucase he can google past mail. > No, you've never said this is the reason why it was dropped from -mm or why it was "nack"'d early on. > However he repeatedly attach our goodwill and blame our tolerance. > but also repeatedly said "My workload is important than other!". > Then, I got upset really. > What?? I don't even have a specific workload that I'm targeting with this change, I have no idea what you're referring to, we don't run much stuff on the desktop :) > The fact is, all of good developer never says "my workload is most > important in the world", it makes no sense and insane. I really hate > such selfish. Again, this is just a ridiculous accusation. I have no idea what you're referring to since this rewrite is specifically addressed to fix the oom killer problems on the desktop. I work on servers and systems software, I don't have a desktop workload that I'm advocating for here, so perhaps you got me confused with someone else. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 9CC096B01D5 for ; Tue, 8 Jun 2010 07:41:53 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfpCa014399 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:51 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id D951545DE4F for ; Tue, 8 Jun 2010 20:41:50 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id C098845DD71 for ; Tue, 8 Jun 2010 20:41:50 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 9B81BE08001 for ; Tue, 8 Jun 2010 20:41:50 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 506911DB8014 for ; Tue, 8 Jun 2010 20:41:50 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: Message-Id: <20100606170713.8718.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:49 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; > + if (!has_intersects_mems_allowed(p)) > + continue; > > /* > * This task already has access to memory reserves and is now we have three places of oom filtering (1) select_bad_process (2) dump_tasks (3) oom_kill_task (when oom_kill_allocating_task==1 only) this patch only add the check to (1). I think we need (2) and (3) too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 02B386B01D6 for ; Tue, 8 Jun 2010 07:41:53 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfpkL014406 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:51 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 784D145DE4E for ; Tue, 8 Jun 2010 20:41:51 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 4C5F545DE4D for ; Tue, 8 Jun 2010 20:41:51 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 303661DB803F for ; Tue, 8 Jun 2010 20:41:51 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id C046F1DB8040 for ; Tue, 8 Jun 2010 20:41:50 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: Message-Id: <20100606175117.8721.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:50 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Reviewers may observe that the previous implementation would iterate > through the children and attempt to kill each until one was successful and > then the parent if none were found while the new code simply kills the > most memory-hogging task or the parent. Note that the only time > oom_kill_task() fails, however, is when a child does not have an mm or has > a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, > so the final oom_kill_task() will always succeed. probably we need to call has_intersects_mems_allowed() in this loop. likes /* Try to sacrifice the worst child first */ do { list_for_each_entry(c, &t->children, sibling) { unsigned long cpoints; if (c->mm == p->mm) continue; if (oom_unkillable(c, mem, nodemask)) continue; /* oom_badness() returns 0 if the thread is unkillable */ cpoints = oom_badness(c); if (cpoints > victim_points) { victim = c; victim_points = cpoints; } } } while_each_thread(p, t); It mean we shouldn't assume parent and child have the same mems_allowed, perhaps. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 480296B01D8 for ; Tue, 8 Jun 2010 07:41:56 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfrN9007944 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:53 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 409E145DE4E for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 1E1ED45DD71 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id F35C61DB8019 for ; Tue, 8 Jun 2010 20:41:52 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 9AAEA1DB8013 for ; Tue, 8 Jun 2010 20:41:52 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: Message-Id: <20100607084024.873B.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:51 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Tasks that do not share the same set of allowed nodes with the task that > triggered the oom should not be considered as candidates for oom kill. > > Tasks in other cpusets with a disjoint set of mems would be unfairly > penalized otherwise because of oom conditions elsewhere; an extreme > example could unfairly kill all other applications on the system if a > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > more memory than allowed. > > Killing tasks outside of current's cpuset rarely would free memory for > current anyway. To use a sane heuristic, we must ensure that killing a > task would likely free memory for current and avoid needlessly killing > others at all costs just because their potential memory freeing is > unknown. It is better to kill current than another task needlessly. I've put following historically remark in the description of the patch. We applied the exactly same patch in 2005: : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce : Author: Paul Jackson : Date: Tue Sep 6 15:18:13 2005 -0700 : : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset : : Now the real motivation for this cpuset mem_exclusive patch series seems : trivial. : : This patch keeps a task in or under one mem_exclusive cpuset from provoking an : oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive : containment, there is little to gain from oom killing a task under a : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory : allocation must come from disjoint memory nodes. : : This patch enables configuring a system so that a runaway job under one : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset : that might be using very high compute and memory resources for a prolonged : time. And we changed it to current logic in 2006 : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256 : Author: Nick Piggin : Date: Mon Sep 25 23:31:29 2006 -0700 : : [PATCH] oom: cpuset hint : : cpuset_excl_nodes_overlap does not always indicate that killing a task will : not free any memory we for us. For example, we may be asking for an : allocation from _anywhere_ in the machine, or the task in question may be : pinning memory that is outside its cpuset. Fix this by just causing : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. And we haven't get the explanation why this patch doesn't reintroduced an old issue. I don't refuse a patch if it have multiple ack. But if you have any material or number, please show us soon. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id BF1226B01DB for ; Tue, 8 Jun 2010 07:41:56 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfrCI012299 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:54 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id C0FE145DE53 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 95B5045DE4E for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 757C4E08002 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 11FB41DB8038 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: References: Message-Id: <20100607085714.8750.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:52 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. > > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. > > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > include/linux/mempolicy.h | 13 +++++++- > mm/mempolicy.c | 44 +++++++++++++++++++++++++ > mm/oom_kill.c | 77 +++++++++++++++++++++++++++----------------- > 3 files changed, 103 insertions(+), 31 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, > return node_zonelist(0, gfp_flags); > } > > -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } > +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) > +{ > + return false; > +} > + > +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + return false; > +} > > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > } > #endif > > +/* > + * mempolicy_nodemask_intersects > + * > + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default > + * policy. Otherwise, check for intersection between mask and the policy > + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local' > + * policy, always return true since it may allocate elsewhere on fallback. > + * > + * Takes task_lock(tsk) to prevent freeing of its mempolicy. > + */ > +bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + struct mempolicy *mempolicy; > + bool ret = true; > + > + if (!mask) > + return ret; > + task_lock(tsk); > + mempolicy = tsk->mempolicy; > + if (!mempolicy) > + goto out; > + > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + /* > + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to > + * allocate from, they may fallback to other nodes when oom. > + * Thus, it's possible for tsk to have allocated memory from > + * nodes in mask. > + */ > + break; > + case MPOL_BIND: > + case MPOL_INTERLEAVE: > + ret = nodes_intersects(mempolicy->v.nodes, *mask); > + break; > + default: > + BUG(); > + } > +out: > + task_unlock(tsk); > + return ret; > +} > + > /* Allocate a page in interleaved policy. > Own path because it needs to do special accounting. */ > static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > > int sysctl_panic_on_oom; > @@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock); > > /* > * Do all threads of the target process overlap our allowed nodes? > + * @tsk: task struct of which task to consider > + * @mask: nodemask passed to page allocator for mempolicy ooms > */ > -static int has_intersects_mems_allowed(struct task_struct *tsk) > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > + const nodemask_t *mask) nodemask is better name than plain "mask". > { > - struct task_struct *t; > + struct task_struct *start = tsk; > > - t = tsk; > do { > - if (cpuset_mems_allowed_intersects(current, t)) > - return 1; > - t = next_thread(t); > - } while (t != tsk); > - > - return 0; > + if (mask) { > + /* > + * If this is a mempolicy constrained oom, tsk's > + * cpuset is irrelevant. Only return true if its > + * mempolicy intersects current, otherwise it may be > + * needlessly killed. > + */ > + if (mempolicy_nodemask_intersects(tsk, mask)) > + return true; > + } else { > + /* > + * This is not a mempolicy constrained oom, so only > + * check the mems of tsk's cpuset. > + */ > + if (cpuset_mems_allowed_intersects(current, tsk)) > + return true; > + } > + tsk = next_thread(tsk); > + } while (tsk != start); > + return false; I had rewrite this to use while_each_thread(). please see it. > } > > /** > @@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > * (not docbooked, we don't want this one cluttering up the manual) > */ > static struct task_struct *select_bad_process(unsigned long *ppoints, > - struct mem_cgroup *mem) > + struct mem_cgroup *mem, enum oom_constraint constraint, > + const nodemask_t *mask) dont need constraint argument. when !CONSTRAINT_MEMORY_POLICY case, we can just pass mask==NULL. and, here is also nodemask is better namek. > { > struct task_struct *p; > struct task_struct *chosen = NULL; > @@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; > - if (!has_intersects_mems_allowed(p)) > + if (!has_intersects_mems_allowed(p, > + constraint == CONSTRAINT_MEMORY_POLICY ? mask : > + NULL)) > continue; > > /* > @@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > panic("out of memory(memcg). panic_on_oom is selected.\n"); > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem); > + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > @@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > /* > * Must be called with tasklist_lock held for read. > */ > -static void __out_of_memory(gfp_t gfp_mask, int order) > +static void __out_of_memory(gfp_t gfp_mask, int order, > + enum oom_constraint constraint, const nodemask_t *mask) > { > struct task_struct *p; > unsigned long points; > @@ -576,7 +597,7 @@ retry: > * Rambo mode: Shoot down a process and hope it solves whatever > * issues we may have. > */ > - p = select_bad_process(&points, NULL); > + p = select_bad_process(&points, NULL, constraint, mask); > > if (PTR_ERR(p) == -1UL) > return; > @@ -610,7 +631,8 @@ void pagefault_out_of_memory(void) > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > read_lock(&tasklist_lock); > - __out_of_memory(0, 0); /* unknown gfp_mask and order */ > + /* unknown gfp_mask and order */ > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > read_unlock(&tasklist_lock); > > /* > @@ -626,6 +648,7 @@ void pagefault_out_of_memory(void) > * @zonelist: zonelist pointer > * @gfp_mask: memory allocation flags > * @order: amount of memory being requested as a power of 2 > + * @nodemask: nodemask passed to page allocator > * > * If we run out of memory, we have the choice between either > * killing a random task (bad), letting the system crash (worse) > @@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > */ > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > read_lock(&tasklist_lock); > - > - switch (constraint) { > - case CONSTRAINT_MEMORY_POLICY: > - oom_kill_process(current, gfp_mask, order, 0, NULL, > - "No available memory (MPOL_BIND)"); > - break; > - > - case CONSTRAINT_NONE: > - if (sysctl_panic_on_oom) { > + if (unlikely(sysctl_panic_on_oom)) { > + /* > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > + * should not panic for cpuset or mempolicy induced memory > + * failures. > + */ > + if (constraint == CONSTRAINT_NONE) { > dump_header(NULL, gfp_mask, order, NULL); > - panic("out of memory. panic_on_oom is selected\n"); > + panic("Out of memory: panic_on_oom is enabled\n"); you shouldn't immix undocumented and unnecessary change. > } > - /* Fall-through */ > - case CONSTRAINT_CPUSET: > - __out_of_memory(gfp_mask, order); > - break; > } > - > + __out_of_memory(gfp_mask, order, constraint, nodemask); > read_unlock(&tasklist_lock); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 52A696B01DE for ; Tue, 8 Jun 2010 07:41:57 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BftmJ007957 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:55 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 0AC1845DE57 for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id CEF3F45DE52 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 9FBE0E08003 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 517751DB8040 for ; Tue, 8 Jun 2010 20:41:51 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: References: Message-Id: <20100606184014.8727.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:50 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Previously, the heuristic provided some protection for those tasks with > CAP_SYS_RAWIO, but this is no longer necessary since we will not be > killing tasks for the purposes of ISA allocations. Seems incorrect. CAP_SYS_RAWIO tasks usually both use GFP_KERNEL and GFP_DMA. Even if last allocation is GFP_KERNEL, it doesn't provide any gurantee the process doesn't have any in flight I/O. Then, we can't remove for RAWIO protection from oom heuristics. but the code itself seems ok though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 3887F6B01D6 for ; Tue, 8 Jun 2010 07:41:57 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfsE5014431 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:54 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 501B145DE51 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 1163A45DE4E for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id E2DB4E08002 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 9B11BE08003 for ; Tue, 8 Jun 2010 20:41:53 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: References: Message-Id: <20100607091034.875C.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:52 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > + list_for_each_entry(child, &tsk->children, sibling) { this loop only check childs that created by main-thread. we need to iterate sub-threads created childs. > + struct task_cputime task_time; > + unsigned long runtime; > + unsigned long rss; > + > + task_lock(child); > + if (!child->mm || child->mm == tsk->mm) { > + task_unlock(child); > + continue; > + } need to use find_lock_task_mm(). > + rss = get_mm_rss(child->mm); need rss+swap for keeping consistency. I think. > + task_unlock(child); > + > + thread_group_cputime(child, &task_time); > + runtime = cputime_to_jiffies(task_time.utime) + > + cputime_to_jiffies(task_time.stime); > + /* > + * Only threads that have run for less than a second are > + * considered toward the forkbomb penalty, these threads rarely > + * get to execute at all in such cases anyway. > + */ > + if (runtime < HZ) { > + child_rss += rss; > + forkcount++; > + } > + } > + > + return forkcount > sysctl_oom_forkbomb_thres ? > + (child_rss / sysctl_oom_forkbomb_thres) : 0; 0 divide risk is there. correct style is thres = sysctl_oom_forkbomb_thres if (!thres) return; child_rss / thres; copying local variable is must. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 219F46B01DB for ; Tue, 8 Jun 2010 07:41:57 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BftJW012317 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:55 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id D8F0145DE6F for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id A8F6E45DE4D for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 77461E38008 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 1972CE38004 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic In-Reply-To: References: Message-Id: <20100607105832.876B.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:53 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > Add a forkbomb penalty for processes that fork an excessively large > number of children to penalize that group of tasks and not others. A > threshold is configurable from userspace to determine how many first- > generation execve children (those with their own address spaces) a task > may have before it is considered a forkbomb. This can be tuned by > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > 1000. > > When a task has more than 1000 first-generation children with different > address spaces than itself, a penalty of > > (average rss of children) * (# of 1st generation execve children) > ----------------------------------------------------------------- > oom_forkbomb_thres > > is assessed. So, for example, using the default oom_forkbomb_thres of > 1000, the penalty is twice the average rss of all its execve children if > there are 2000 such tasks. A task is considered to count toward the > threshold if its total runtime is less than one second; for 1000 of such > tasks to exist, the parent process must be forking at an extremely high > rate either erroneously or maliciously. > > Even though a particular task may be designated a forkbomb and selected as > the victim, the oom killer will still kill the 1st generation execve child > with the highest badness() score in its place. The avoids killing > important servers or system daemons. When a web server forks a very large > number of threads for client connections, for example, it is much better > to kill one of those threads than to kill the server and make it > unresponsive. Reviewers need to trace patch author's intention, this description seems only focus "how to implement". but reviewers need the explaination of the big picture. The old stragegy is here (1) accumulate half of child vsz (2) instead, kill the child at first Your stragegy is here (a) usually dont accumulate child mem (b) but short lived child is accumulated (c) kill the child at first I think, at least two explaination is necessary. - Usually, legitimate process (e.g. web server, rdb) makes a lot of 1st generation child. but forkbomb usually makes multi level generation child. why do you only care 1st generation? - In usual case, your don't care the child rsz. but kill the child. That seems inconsistency than old. Why do you choose this technique? Now, I don't have any objection at all because I haven't understand your point. Ok, the concept of forkbomb detection is good. but need to describe - why do you choose this way? - how do you confirm your ways works fine? Any heuristic can't reach perfect in practical. that's ok. but unclear code intention easily makes code unmaintable. please avoid it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 07B566B01D8 for ; Tue, 8 Jun 2010 07:41:57 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BftdR012327 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:56 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 6E04A45DE57 for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 4A74945DE51 for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 0F9BAE08008 for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 8095CE08001 for ; Tue, 8 Jun 2010 20:41:54 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: Message-Id: <20100607221121.8781.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:53 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > @@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > return 0; > } > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > - message, task_pid_nr(p), p->comm, points); > + pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n", > + message, task_pid_nr(p), p->comm, points); > > - /* Try to kill a child first */ > + do_posix_clock_monotonic_gettime(&uptime); > + /* Try to sacrifice the worst child first */ > list_for_each_entry(c, &p->children, sibling) { > + unsigned long cpoints; > + > if (c->mm == p->mm) > continue; > if (mem && !task_in_mem_cgroup(c, mem)) > continue; > - if (!oom_kill_task(c)) > - return 0; > + need to the check of cpuset (and memplicy) memory intersection here, probably. otherwise, this may selected innocence task. also, OOM_DISABL check is necessary? > + /* badness() returns 0 if the thread is unkillable */ > + cpoints = badness(c, uptime.tv_sec); > + if (cpoints > victim_points) { > + victim = c; > + victim_points = cpoints; > + } > } > - return oom_kill_task(p); > + return oom_kill_task(victim); > } > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 295336B01E2 for ; Tue, 8 Jun 2010 07:42:01 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58Bfv50014489 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:57 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 80FAC45DE79 for ; Tue, 8 Jun 2010 20:41:57 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 5DACF45DE6F for ; Tue, 8 Jun 2010 20:41:57 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 5254AE38005 for ; Tue, 8 Jun 2010 20:41:56 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id D837D1DB803A for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: References: <20100604195328.72D9.A69D9226@jp.fujitsu.com> Message-Id: <20100608172820.7645.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:55 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: Hi > > Have you review the actual patches? And No, I don't think "complete > > replace with no test result" is adequate development way. > > I have repeatedly said that the oom killer no longer kills KDE when run on > my desktop in the presence of a memory hogging task that was written > specifically to oom the machine. That's a better result than the > current implementation and was discussed thoroughly during the discussion > on this mailing list back in February that inspired this rewrite to begin > with. I don't think there's any mystery there since you've referred to > that change specifically for KDE in this thread yourself. And, Revewers repeatedly said your patches have overplus material for saving KDE. and ask you the reason. We haven't said KDE is unimportant. > > And, When developers post large patch set, Usually _you_ request show > > demonstrate result. I haven't seen such result in this activity. > > You want to see a log that says "Killed process 1234 (memory-hogger)..." > instead of "Killed process 1234 (kdeinit)..."? You've supported the > change from total_vm to rss as a baseline to begin with. And after all > this discussion, this is the first time you've ever said you wanted to see > that type of log or anything like it. Did you only test the above crazy meaningless case?? We don't want you any acrobatic unactionable thing. Simply you just show what you did, please. > > However, It doesn't give any reason to avoid code review and violate > > our development process. > > > > Nobody is avoiding code review here, that's pretty obvious, and I have no > idea you're referring to when you're saying I'm violating the development > process because this happens to rewrite an entire function and requires a > new user interface and callsite fixups to be meaningful. You specifically > asked me to push the forkbomb detector in a different patch and I did that > because it makes sense to seperate that heuristic, but even then you just > wrote "nack" and haven't responded with why even after I've replied twice > asking. I'm really confused this behavior. Not exactly correct. I also requested separate adding forkbomb feature and adding forkbomb knob. I often requested the same thing to a patch author repeatedly and repeatedly. Why? Frist of all, The patch description of your forkbomb detection is here > Add a forkbomb penalty for processes that fork an excessively large > number of children to penalize that group of tasks and not others. A > threshold is configurable from userspace to determine how many first- > generation execve children (those with their own address spaces) a task > may have before it is considered a forkbomb. This can be tuned by > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to > 1000. > > When a task has more than 1000 first-generation children with different > address spaces than itself, a penalty of > > (average rss of children) * (# of 1st generation execve children) > ----------------------------------------------------------------- > oom_forkbomb_thres > > is assessed. So, for example, using the default oom_forkbomb_thres of > 1000, the penalty is twice the average rss of all its execve children if > there are 2000 such tasks. A task is considered to count toward the > threshold if its total runtime is less than one second; for 1000 of such > tasks to exist, the parent process must be forking at an extremely high > rate either erroneously or maliciously. > > Even though a particular task may be designated a forkbomb and selected as > the victim, the oom killer will still kill the 1st generation execve child > with the highest badness() score in its place. The avoids killing > important servers or system daemons. When a web server forks a very large > number of threads for client connections, for example, it is much better > to kill one of those threads than to kill the server and make it > unresponsive. This have two rotten smell. 1) the sentence is unnecessary mess. it is smell of the patch don't concentrate one thing. 2) That is strongly concentrate "what and how to implement". But reviewers don't want such imformation so much because they can read C language. reviewers need following information. - background - why do the author choose this way? - why do the author choose this default value? - how to confirm your concept and implementation correct? - etc etc thus, reviewers can trace the author thinking and makes good advise and judgement. example in this case, you wrote - default threshold is 1000 - only accumurate 1st generation execve children - time threshold is a second but not wrote why? mess sentence hide such lack of document. then, I usually enforce a divide, because a divide naturally reduce to "which place change" document and expose what lacking. Now I haven't get your intention. no test suite accelerate to can't get author think which workload is a problem workload. btw, nit. typically web server don't create so much thread because almost all of web server have a feature of limit of number of connection. (Othersise the server easily down by DoS) > > > And I'm going to have to get into it because of you guys' seeming > > > inability to get your act together. > > > > Inability? What do you mean inability? Almost all developers cooperate > > for making stabilized kernel. Is this effort inability? or meaningless? > > > > I think he's saying that he expects that we should be able to work > cooperateively in resolving any differences that we have in a respectful > and technical manner on this list. > > But I'll also add my two cents in that and say that we should probably be > leaving maintainer duties up to the actual -mm tree maintainer, he knows > the development process you're talking about pretty well. Seems I and he have some disagreement. Ho hum. Of cource, you can seek another reviewer and another ack. but during reach my eye, I enforce bugfix-at-first policy to everybody. > > > Actually, the descriptions doesn't looks better really. We sometimes > > ask him > > - which problem occur? how do you reproduce it? > > KDE gets killed, memory hogger doesn't. Run memory hogger on your > desktop. KOSAKI, this isn't a surprise to you. > > If this is your objection, I can certainly elaborate more in the changelog > but up until yesterday you've never said you have a problem with it so how > am I supposed to make any forward progress on this? I can't read your > mind when you say "nack" and I'd like to resolve any issues that people > have, but that requires that they get involved. And I also read your mind from your description. I'm not ESPer. > > - which piece solve which issue? > > Mostly the baseline heuristic change to rss and swap, as you well know. agreed. > > > - how do you measure side effect? > > As far as the objective of the oom killer is concerned as listed in > mm/oom_kill.c's header, there is no side effects. We're trying to kill a > task that will free the largest amount of memory and clearly rss and swap > is a better indication fo that then total_vm. Wait, wait. This, you said you don't consider a lot of workloads deeply. really? I guess no. perhaps, you wrote this sentence quickly. so, I just only hope to update your patch description. > > - how do you mesure or consider other workload user > > The objective of the oom killer is not different for different workloads. Seems my question is too short or unclear? Usually, we makes 5-6 brain simulation, embedded, desktop, web server, db server, hpc, finance. Different workloads certenally makes big impact. because oom killer traverce _processces_ in the workload. It's affect how to choose badness() heuristics. why not? > > But I got only the answer, "My patch is best. resistance is futile". that's > > purely Baaaad. > > > > I haven't said anything new in the above, KOSAKI, you already knew all > this. I'll update the changelog to include some of this information for > the next posting, but I'd really hope that this isn't the major problem > that you've had the entire time that we've stalled weeks on. Ho Hum. OK. > > > OK. I don't have any reason to confuse you. I'll fix me. My point is > > really simple. The majority OOM user are in desktop. We must not ignore > > them. such as > > > > - Any regression from desktop view are unacceptable > > This patchset was specifically designed to improve the oom killer's > behavior on the desktop! Again, unevaluatable feature is immixed. and reviewers are stalling. > > - Any incompatibility of no desktop improvement are unacceptable > > I don't understand this. In other word, - Any incompatibility are unacceptable because your new feature have no user. > > - Any refusing bugfix are unacceptable > > I've merged most of Oleg's work into this patchset, the problem that we're > having is deciding whether any of it is -rc material or not and should be > pushed first. I don't think any of it is, Oleg certainly wasn't pushing > it and to date I don't believe has said it's rc material, so that's > something you can talk about but I'm not refusing any bugfix. Good deverlopers alywas take another developer/user bug report at first. And, I'm going to push kill-PF_EXITING patch and dying-task-higher-priority patch although they don't help your workload. I don't believe your opposition reason is logically. (but if you made alternative patch, I'll review it preferentially) > > 1) fix bugs at fist before making new feature (a.k.a new bugs) > > Kame already suggested a new order to the patchset that I'll be > restructuring. I'm curious as to why this was removed from -mm though on > your suggestion before any of this became an issue. We've yet to hear > that mysterious information. Again and again and again. You have to get anyone's ack when you are pushing new feature. and your series still have bug and usually need 3-5 review iteration. OK, that's a part of Andrew and our reviewer's fault. These patches must dropped more earlier. Your patches got 4 times NAK from each another developers, each time, the patches had to be dropped. Sigh. > > 2) don't mix bugfix and new feature > > Andrew said bugfixes should come first, they will in the reposting, but I > don't consider any of it to be -rc material. Oleg's material can be merged, now. but yours are not. > > 3) make separate as natural and individual piece > > I can't keep having this conversation, the patch is broken down into one > functional unit as much as possible. Please leave the maintainership of > this code to Andrew who has already said entire implementation changes (in > this case, a single function rewrite) is allowed if it makes sense. I said, I'll divide them if you don't. > > 4) keep small and reviewable patch size > > Same as above. > > > 5) stop ugly excuse, instead repeatedly rewrite until get anyone ack > > I don't know what my ugly excuse is, but I'll be reordering the patches > and sending them with an updated changelog on the badness heuristic > rewrite. I hope that will satisfy all your concerns. I don't talk generic thing in this. instead I've send new bug report and new reviewing result instead. I hope I get productive response. > > 6) don't ignore another developers bug report > > > > If you have a bug report that is the result of this rewrite, please come > forward with it and don't carry this out by making me guess again. > > > I didn't hope says the same thing twice and he repeatedly ignore > > my opinion, thus, he got short answer. I didn't think this is inadequate > > beucase he can google past mail. > > > > No, you've never said this is the reason why it was dropped from -mm or > why it was "nack"'d early on. > > > However he repeatedly attach our goodwill and blame our tolerance. > > but also repeatedly said "My workload is important than other!". > > Then, I got upset really. > > > > What?? I don't even have a specific workload that I'm targeting with this > change, I have no idea what you're referring to, we don't run much stuff > on the desktop :) > > > The fact is, all of good developer never says "my workload is most > > important in the world", it makes no sense and insane. I really hate > > such selfish. > > Again, this is just a ridiculous accusation. I have no idea what you're > referring to since this rewrite is specifically addressed to fix the oom > killer problems on the desktop. I work on servers and systems software, I > don't have a desktop workload that I'm advocating for here, so perhaps you > got me confused with someone else. David, do you know other kernel engineer spent how much time for understanding a real workload and dialog various open source community and linux user company and user group? At least, All developers must make _effort_ to spent some time to investigate userland use case when they want to introduce new feature and incompatibility. Almost developers do. please read various new feature git log. few commit log are ridiculous quiet (probably the author bother cut-n-paste from ML bug report) but almost are wrote what is problem. thus, we can double check the problem and the code are matched correctly. And, if you can't test your patch on various platform, at least you must to write theorical background of your patch. it definitely help each are engineer confirm your patch don't harm their area. However, for principal, if you want to introduce any imcompatibility, you must investigate how much affect this. remark: if you think you need mathematical proof or 100% coveraged proof, it's not correct. you don't need such impossible work. We just require to confirm you investigate and consider enough large coverage. Usually, the author of small patch aren't required this. because reviewers can think affected use-case from the code. almost reviewer have much use case knowledge than typical kernel developers. but now, you are challenging full of rewrite. We don't have enough information to finish reviewing. Last of all, I've send various review result by another mail. Can you please read it? Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 607336B0239 for ; Tue, 8 Jun 2010 07:42:52 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o58BfudJ012347 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Tue, 8 Jun 2010 20:41:56 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 4931045DE52 for ; Tue, 8 Jun 2010 20:41:56 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 27B8645DE4E for ; Tue, 8 Jun 2010 20:41:56 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id CB6521DB8012 for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 73DD41DB801D for ; Tue, 8 Jun 2010 20:41:55 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms In-Reply-To: References: Message-Id: <20100608165055.8799.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Tue, 8 Jun 2010 20:41:54 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > The oom killer presently kills current whenever there is no more memory > free or reclaimable on its mempolicy's nodes. There is no guarantee that > current is a memory-hogging task or that killing it will free any > substantial amount of memory, however. > > In such situations, it is better to scan the tasklist for nodes that are > allowed to allocate on current's set of nodes and kill the task with the > highest badness() score. This ensures that the most memory-hogging task, > or the one configured by the user with /proc/pid/oom_adj, is always > selected in such scenarios. > > Reviewed-by: KOSAKI Motohiro > Signed-off-by: David Rientjes > --- > include/linux/mempolicy.h | 13 +++++++- > mm/mempolicy.c | 44 +++++++++++++++++++++++++ > mm/oom_kill.c | 77 +++++++++++++++++++++++++++----------------- > 3 files changed, 103 insertions(+), 31 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > unsigned long addr, gfp_t gfp_flags, > struct mempolicy **mpol, nodemask_t **nodemask); > extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, > return node_zonelist(0, gfp_flags); > } > > -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; } > +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) > +{ > + return false; > +} > + > +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + return false; > +} > > static inline int do_migrate_pages(struct mm_struct *mm, > const nodemask_t *from_nodes, > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > } > #endif > > +/* > + * mempolicy_nodemask_intersects > + * > + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default > + * policy. Otherwise, check for intersection between mask and the policy > + * nodemask for 'bind' or 'interleave' policy. For 'perferred' or 'local' > + * policy, always return true since it may allocate elsewhere on fallback. > + * > + * Takes task_lock(tsk) to prevent freeing of its mempolicy. > + */ > +bool mempolicy_nodemask_intersects(struct task_struct *tsk, > + const nodemask_t *mask) > +{ > + struct mempolicy *mempolicy; > + bool ret = true; > + > + if (!mask) > + return ret; > + task_lock(tsk); > + mempolicy = tsk->mempolicy; > + if (!mempolicy) > + goto out; > + > + switch (mempolicy->mode) { > + case MPOL_PREFERRED: > + /* > + * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to > + * allocate from, they may fallback to other nodes when oom. > + * Thus, it's possible for tsk to have allocated memory from > + * nodes in mask. > + */ > + break; > + case MPOL_BIND: > + case MPOL_INTERLEAVE: > + ret = nodes_intersects(mempolicy->v.nodes, *mask); > + break; > + default: > + BUG(); > + } > +out: > + task_unlock(tsk); > + return ret; > +} > + > /* Allocate a page in interleaved policy. > Own path because it needs to do special accounting. */ > static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > > int sysctl_panic_on_oom; > @@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock); > > /* > * Do all threads of the target process overlap our allowed nodes? > + * @tsk: task struct of which task to consider > + * @mask: nodemask passed to page allocator for mempolicy ooms > */ > -static int has_intersects_mems_allowed(struct task_struct *tsk) > +static bool has_intersects_mems_allowed(struct task_struct *tsk, > + const nodemask_t *mask) nodemask is better name than plain "mask". > { > - struct task_struct *t; > + struct task_struct *start = tsk; > > - t = tsk; > do { > - if (cpuset_mems_allowed_intersects(current, t)) > - return 1; > - t = next_thread(t); > - } while (t != tsk); > - > - return 0; > + if (mask) { > + /* > + * If this is a mempolicy constrained oom, tsk's > + * cpuset is irrelevant. Only return true if its > + * mempolicy intersects current, otherwise it may be > + * needlessly killed. > + */ > + if (mempolicy_nodemask_intersects(tsk, mask)) > + return true; > + } else { > + /* > + * This is not a mempolicy constrained oom, so only > + * check the mems of tsk's cpuset. > + */ > + if (cpuset_mems_allowed_intersects(current, tsk)) > + return true; > + } > + tsk = next_thread(tsk); > + } while (tsk != start); > + return false; I had rewrite this to use while_each_thread(). please see it. > } > > /** > @@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist, > * (not docbooked, we don't want this one cluttering up the manual) > */ > static struct task_struct *select_bad_process(unsigned long *ppoints, > - struct mem_cgroup *mem) > + struct mem_cgroup *mem, enum oom_constraint constraint, > + const nodemask_t *mask) dont need constraint argument. when !CONSTRAINT_MEMORY_POLICY case, we can just pass mask==NULL. and, here is also nodemask is better name. > { > struct task_struct *p; > struct task_struct *chosen = NULL; > @@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > continue; > if (mem && !task_in_mem_cgroup(p, mem)) > continue; > - if (!has_intersects_mems_allowed(p)) > + if (!has_intersects_mems_allowed(p, > + constraint == CONSTRAINT_MEMORY_POLICY ? mask : > + NULL)) > continue; > > /* > @@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask) > panic("out of memory(memcg). panic_on_oom is selected.\n"); > read_lock(&tasklist_lock); > retry: > - p = select_bad_process(&points, mem); > + p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL); > if (!p || PTR_ERR(p) == -1UL) > goto out; > > @@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) > /* > * Must be called with tasklist_lock held for read. > */ > -static void __out_of_memory(gfp_t gfp_mask, int order) > +static void __out_of_memory(gfp_t gfp_mask, int order, > + enum oom_constraint constraint, const nodemask_t *mask) > { > struct task_struct *p; > unsigned long points; > @@ -576,7 +597,7 @@ retry: > * Rambo mode: Shoot down a process and hope it solves whatever > * issues we may have. > */ > - p = select_bad_process(&points, NULL); > + p = select_bad_process(&points, NULL, constraint, mask); > > if (PTR_ERR(p) == -1UL) > return; > @@ -610,7 +631,8 @@ void pagefault_out_of_memory(void) > panic("out of memory from page fault. panic_on_oom is selected.\n"); > > read_lock(&tasklist_lock); > - __out_of_memory(0, 0); /* unknown gfp_mask and order */ > + /* unknown gfp_mask and order */ > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL); > read_unlock(&tasklist_lock); > > /* > @@ -626,6 +648,7 @@ void pagefault_out_of_memory(void) > * @zonelist: zonelist pointer > * @gfp_mask: memory allocation flags > * @order: amount of memory being requested as a power of 2 > + * @nodemask: nodemask passed to page allocator > * > * If we run out of memory, we have the choice between either > * killing a random task (bad), letting the system crash (worse) > @@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > */ > constraint = constrained_alloc(zonelist, gfp_mask, nodemask); > read_lock(&tasklist_lock); > - > - switch (constraint) { > - case CONSTRAINT_MEMORY_POLICY: > - oom_kill_process(current, gfp_mask, order, 0, NULL, > - "No available memory (MPOL_BIND)"); > - break; > - > - case CONSTRAINT_NONE: > - if (sysctl_panic_on_oom) { > + if (unlikely(sysctl_panic_on_oom)) { > + /* > + * panic_on_oom only affects CONSTRAINT_NONE, the kernel > + * should not panic for cpuset or mempolicy induced memory > + * failures. > + */ > + if (constraint == CONSTRAINT_NONE) { > dump_header(NULL, gfp_mask, order, NULL); > - panic("out of memory. panic_on_oom is selected\n"); > + panic("Out of memory: panic_on_oom is enabled\n"); you shouldn't immix unrelated change. > } > - /* Fall-through */ > - case CONSTRAINT_CPUSET: > - __out_of_memory(gfp_mask, order); > - break; > } > - > + __out_of_memory(gfp_mask, order, constraint, nodemask); > read_unlock(&tasklist_lock); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 07DD76B01C4 for ; Tue, 8 Jun 2010 14:38:05 -0400 (EDT) Received: from kpbe16.cbf.corp.google.com (kpbe16.cbf.corp.google.com [172.25.105.80]) by smtp-out.google.com with ESMTP id o58Ibx3e007308 for ; Tue, 8 Jun 2010 11:38:00 -0700 Received: from pwj6 (pwj6.prod.google.com [10.241.219.70]) by kpbe16.cbf.corp.google.com with ESMTP id o58IbwR3031940 for ; Tue, 8 Jun 2010 11:37:58 -0700 Received: by pwj6 with SMTP id 6so226050pwj.8 for ; Tue, 08 Jun 2010 11:37:58 -0700 (PDT) Date: Tue, 8 Jun 2010 11:37:52 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100606170713.8718.A69D9226@jp.fujitsu.com> Message-ID: References: <20100606170713.8718.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > continue; > > if (mem && !task_in_mem_cgroup(p, mem)) > > continue; > > + if (!has_intersects_mems_allowed(p)) > > + continue; > > > > /* > > * This task already has access to memory reserves and is > > now we have three places of oom filtering > (1) select_bad_process Done. > (2) dump_tasks dump_tasks() has never filtered on this, it's possible for tasks is other cpusets to allocate memory on our nodes. > (3) oom_kill_task (when oom_kill_allocating_task==1 only) > Why would care about cpuset attachment in oom_kill_task()? You mean oom_kill_process() to filter the children list? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 4E6C46B01C6 for ; Tue, 8 Jun 2010 14:39:03 -0400 (EDT) Received: from hpaq7.eem.corp.google.com (hpaq7.eem.corp.google.com [172.25.149.7]) by smtp-out.google.com with ESMTP id o58IcxnU016494 for ; Tue, 8 Jun 2010 11:38:59 -0700 Received: from pzk30 (pzk30.prod.google.com [10.243.19.158]) by hpaq7.eem.corp.google.com with ESMTP id o58IcR1B017091 for ; Tue, 8 Jun 2010 11:38:58 -0700 Received: by pzk30 with SMTP id 30so4279452pzk.6 for ; Tue, 08 Jun 2010 11:38:58 -0700 (PDT) Date: Tue, 8 Jun 2010 11:38:54 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations In-Reply-To: <20100606184014.8727.A69D9226@jp.fujitsu.com> Message-ID: References: <20100606184014.8727.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > Previously, the heuristic provided some protection for those tasks with > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be > > killing tasks for the purposes of ISA allocations. > > Seems incorrect. CAP_SYS_RAWIO tasks usually both use GFP_KERNEL and GFP_DMA. > Even if last allocation is GFP_KERNEL, it doesn't provide any gurantee the > process doesn't have any in flight I/O. > Right, that's why I said it "provided some protection". > Then, we can't remove for RAWIO protection from oom heuristics. but the code > itself seems ok though. > It's removed with my heuristic rewrite. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 9E5976B01C4 for ; Tue, 8 Jun 2010 14:41:21 -0400 (EDT) Received: from kpbe17.cbf.corp.google.com (kpbe17.cbf.corp.google.com [172.25.105.81]) by smtp-out.google.com with ESMTP id o58IfIrA020362 for ; Tue, 8 Jun 2010 11:41:18 -0700 Received: from pvc21 (pvc21.prod.google.com [10.241.209.149]) by kpbe17.cbf.corp.google.com with ESMTP id o58IfH3V020016 for ; Tue, 8 Jun 2010 11:41:17 -0700 Received: by pvc21 with SMTP id 21so420090pvc.20 for ; Tue, 08 Jun 2010 11:41:17 -0700 (PDT) Date: Tue, 8 Jun 2010 11:41:12 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: <20100606175117.8721.A69D9226@jp.fujitsu.com> Message-ID: References: <20100606175117.8721.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > Reviewers may observe that the previous implementation would iterate > > through the children and attempt to kill each until one was successful and > > then the parent if none were found while the new code simply kills the > > most memory-hogging task or the parent. Note that the only time > > oom_kill_task() fails, however, is when a child does not have an mm or has > > a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both cases, > > so the final oom_kill_task() will always succeed. > > probably we need to call has_intersects_mems_allowed() in this loop. likes > > /* Try to sacrifice the worst child first */ > do { > list_for_each_entry(c, &t->children, sibling) { > unsigned long cpoints; > > if (c->mm == p->mm) > continue; > if (oom_unkillable(c, mem, nodemask)) > continue; > > /* oom_badness() returns 0 if the thread is unkillable */ > cpoints = oom_badness(c); > if (cpoints > victim_points) { > victim = c; > victim_points = cpoints; > } > } > } while_each_thread(p, t); > > > It mean we shouldn't assume parent and child have the same mems_allowed, > perhaps. > I'd be happy to have that in oom_kill_process() if you pass the enum oom_constraint and only do it for CONSTRAINT_CPUSET. Please add a followup patch to my latest patch series. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 783306B01C4 for ; Tue, 8 Jun 2010 14:43:23 -0400 (EDT) Received: from hpaq2.eem.corp.google.com (hpaq2.eem.corp.google.com [172.25.149.2]) by smtp-out.google.com with ESMTP id o58IhJAk015486 for ; Tue, 8 Jun 2010 11:43:19 -0700 Received: from pxi19 (pxi19.prod.google.com [10.243.27.19]) by hpaq2.eem.corp.google.com with ESMTP id o58IgJN0029021 for ; Tue, 8 Jun 2010 11:43:18 -0700 Received: by pxi19 with SMTP id 19so1830460pxi.3 for ; Tue, 08 Jun 2010 11:43:17 -0700 (PDT) Date: Tue, 8 Jun 2010 11:43:13 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100607084024.873B.A69D9226@jp.fujitsu.com> Message-ID: References: <20100607084024.873B.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > I've put following historically remark in the description of the patch. > > > We applied the exactly same patch in 2005: > > : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce > : Author: Paul Jackson > : Date: Tue Sep 6 15:18:13 2005 -0700 > : > : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset > : > : Now the real motivation for this cpuset mem_exclusive patch series seems > : trivial. > : > : This patch keeps a task in or under one mem_exclusive cpuset from provoking an > : oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only > : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive > : containment, there is little to gain from oom killing a task under a > : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory > : allocation must come from disjoint memory nodes. > : > : This patch enables configuring a system so that a runaway job under one > : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset > : that might be using very high compute and memory resources for a prolonged > : time. > > And we changed it to current logic in 2006 > > : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256 > : Author: Nick Piggin > : Date: Mon Sep 25 23:31:29 2006 -0700 > : > : [PATCH] oom: cpuset hint > : > : cpuset_excl_nodes_overlap does not always indicate that killing a task will > : not free any memory we for us. For example, we may be asking for an > : allocation from _anywhere_ in the machine, or the task in question may be > : pinning memory that is outside its cpuset. Fix this by just causing > : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. > > And we haven't get the explanation why this patch doesn't reintroduced > an old issue. > > I don't refuse a patch if it have multiple ack. But if you have any > material or number, please show us soon. > And this patch is acked by the 2006 patch's author, Nick Piggin. There's obviously not going to be any "number" to show that this means anything, but we've run it internally for three years to prevent needless oom killing in other cpusets that don't have any indication that it will free memory that current needs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 21BF06B01C4 for ; Tue, 8 Jun 2010 14:45:30 -0400 (EDT) Received: from hpaq12.eem.corp.google.com (hpaq12.eem.corp.google.com [172.25.149.12]) by smtp-out.google.com with ESMTP id o58IjPIV025984 for ; Tue, 8 Jun 2010 11:45:25 -0700 Received: from pxi19 (pxi19.prod.google.com [10.243.27.19]) by hpaq12.eem.corp.google.com with ESMTP id o58IjOc7030478 for ; Tue, 8 Jun 2010 11:45:24 -0700 Received: by pxi19 with SMTP id 19so2322545pxi.17 for ; Tue, 08 Jun 2010 11:45:23 -0700 (PDT) Date: Tue, 8 Jun 2010 11:45:18 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: <20100607221121.8781.A69D9226@jp.fujitsu.com> Message-ID: References: <20100607221121.8781.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > @@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, > > return 0; > > } > > > > - printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", > > - message, task_pid_nr(p), p->comm, points); > > + pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n", > > + message, task_pid_nr(p), p->comm, points); > > > > - /* Try to kill a child first */ > > + do_posix_clock_monotonic_gettime(&uptime); > > + /* Try to sacrifice the worst child first */ > > list_for_each_entry(c, &p->children, sibling) { > > + unsigned long cpoints; > > + > > if (c->mm == p->mm) > > continue; > > if (mem && !task_in_mem_cgroup(c, mem)) > > continue; > > - if (!oom_kill_task(c)) > > - return 0; > > + > > need to the check of cpuset (and memplicy) memory intersection here, probably. > otherwise, this may selected innocence task. > I'll do this, then, if you don't want to post your own patch. Fine. > also, OOM_DISABL check is necessary? > No, badness() is 0 for tasks that are OOM_DISABLE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 7A55F6B01C3 for ; Tue, 8 Jun 2010 19:25:23 -0400 (EDT) Date: Tue, 8 Jun 2010 16:25:13 -0700 From: Andrew Morton Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset Message-Id: <20100608162513.c633439e.akpm@linux-foundation.org> In-Reply-To: References: <20100607084024.873B.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 11:43:13 -0700 (PDT) David Rientjes wrote: > On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > > I've put following historically remark in the description of the patch. > > > > > > We applied the exactly same patch in 2005: > > > > : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce > > : Author: Paul Jackson > > : Date: Tue Sep 6 15:18:13 2005 -0700 > > : > > : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset > > : > > : Now the real motivation for this cpuset mem_exclusive patch series seems > > : trivial. > > : > > : This patch keeps a task in or under one mem_exclusive cpuset from provoking an > > : oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only > > : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive > > : containment, there is little to gain from oom killing a task under a > > : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory > > : allocation must come from disjoint memory nodes. > > : > > : This patch enables configuring a system so that a runaway job under one > > : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset > > : that might be using very high compute and memory resources for a prolonged > > : time. > > > > And we changed it to current logic in 2006 > > > > : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256 > > : Author: Nick Piggin > > : Date: Mon Sep 25 23:31:29 2006 -0700 > > : > > : [PATCH] oom: cpuset hint > > : > > : cpuset_excl_nodes_overlap does not always indicate that killing a task will > > : not free any memory we for us. For example, we may be asking for an > > : allocation from _anywhere_ in the machine, or the task in question may be > > : pinning memory that is outside its cpuset. Fix this by just causing > > : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it. > > > > And we haven't get the explanation why this patch doesn't reintroduced > > an old issue. hm, that was some good kernel archeological research. > > I don't refuse a patch if it have multiple ack. But if you have any > > material or number, please show us soon. > > > > And this patch is acked by the 2006 patch's author, Nick Piggin. > > There's obviously not going to be any "number" to show that this means > anything, but we've run it internally for three years to prevent needless > oom killing in other cpusets that don't have any indication that it will > free memory that current needs. Well I wonder if Nick had observed some problem which the 2006 change fixed. And I wonder if David has observed some problem which the 2010 change fixes! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 8E3CB6B01C8 for ; Tue, 8 Jun 2010 19:28:15 -0400 (EDT) Date: Tue, 8 Jun 2010 16:28:07 -0700 From: Andrew Morton Subject: Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms Message-Id: <20100608162807.d73d02ef.akpm@linux-foundation.org> In-Reply-To: <20100607085714.8750.A69D9226@jp.fujitsu.com> References: <20100607085714.8750.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 20:41:52 +0900 (JST) KOSAKI Motohiro wrote: > > - panic("out of memory. panic_on_oom is selected\n"); > > + panic("Out of memory: panic_on_oom is enabled\n"); > > you shouldn't immix undocumented and unnecessary change. Well... strictly true. But there's not a lot of benefit in being all dogmatic about these things. If the change is simple and is of some benefit and deosn't muck up the patch too much, I just let it go, shrug. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 3EB6D6B01CC for ; Tue, 8 Jun 2010 19:47:30 -0400 (EDT) Date: Tue, 8 Jun 2010 16:47:22 -0700 From: Andrew Morton Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite Message-Id: <20100608164722.9724baf9.akpm@linux-foundation.org> In-Reply-To: <20100608172820.7645.A69D9226@jp.fujitsu.com> References: <20100604195328.72D9.A69D9226@jp.fujitsu.com> <20100608172820.7645.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 20:41:55 +0900 (JST) KOSAKI Motohiro wrote: > of the patch don't concentrate one thing. 2) That is strongly concentrate > "what and how to implement". But reviewers don't want such imformation so much > because they can read C language. reviewers need following information. > - background > - why do the author choose this way? > - why do the author choose this default value? > - how to confirm your concept and implementation correct? > - etc etc > > thus, reviewers can trace the author thinking and makes good advise and judgement. > example in this case, you wrote > - default threshold is 1000 > - only accumurate 1st generation execve children > - time threshold is a second > > but not wrote why? mess sentence hide such lack of document. then, I usually enforce > a divide, because a divide naturally reduce to "which place change" document and > expose what lacking. > > Now I haven't get your intention. no test suite accelerate to can't get > author think which workload is a problem workload. hey, you're starting to sound like me. > > ... > > David, do you know other kernel engineer spent how much time for understanding > a real workload and dialog various open source community and linux user company > and user group? > > At least, All developers must make _effort_ to spent some time to investigate > userland use case when they want to introduce new feature and incompatibility. > Almost developers do. please read various new feature git log. few commit log > are ridiculous quiet (probably the author bother cut-n-paste from ML bug report) > but almost are wrote what is problem. > thus, we can double check the problem and the code are matched correctly. > > And, if you can't test your patch on various platform, at least you must to > write theorical background of your patch. it definitely help each are engineer > confirm your patch don't harm their area. However, for principal, if you > want to introduce any imcompatibility, you must investigate how much affect this. > > remark: if you think you need mathematical proof or 100% coveraged proof, > it's not correct. you don't need such impossible work. We just require to > confirm you investigate and consider enough large coverage. > > Usually, the author of small patch aren't required this. because reviewers can > think affected use-case from the code. almost reviewer have much use case knowledge > than typical kernel developers. but now, you are challenging full > of rewrite. We don't have enough information to finish reviewing. > > Last of all, I've send various review result by another mail. Can you please > read it? > I think I'm beginning to understand your concerns with these patches. Finally. Yes, it's a familiar one. I do fairly commonly see patches where the description can be summarised as "change lots and lots of stuff to no apparent end" and one does have to push and poke to squeeze out the thinking and the reasons. It's a useful exercise and will sometimes cause the originator to have a rethink, and sometimes reveals that it just wasn't a good change. Maybe if we'd been more diligent about all this around 2.6.12, we wouldn't have wrecked dirty-page writeout off the tail of the LRU. Which is STILL wrecked, btw. I think I read somewhere in one of David's emails that some of this code has been floating around in Google for several years? If so, the reasons for making certain changes might even be lost and forgotten. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id D8C7E6B01D7 for ; Tue, 8 Jun 2010 19:54:37 -0400 (EDT) Received: from hpaq13.eem.corp.google.com (hpaq13.eem.corp.google.com [172.25.149.13]) by smtp-out.google.com with ESMTP id o58NsYAK008308 for ; Tue, 8 Jun 2010 16:54:34 -0700 Received: from pwi5 (pwi5.prod.google.com [10.241.219.5]) by hpaq13.eem.corp.google.com with ESMTP id o58NsXnP031221 for ; Tue, 8 Jun 2010 16:54:33 -0700 Received: by pwi5 with SMTP id 5so400425pwi.12 for ; Tue, 08 Jun 2010 16:54:32 -0700 (PDT) Date: Tue, 8 Jun 2010 16:54:31 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608162513.c633439e.akpm@linux-foundation.org> Message-ID: References: <20100607084024.873B.A69D9226@jp.fujitsu.com> <20100608162513.c633439e.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > And I wonder if David has observed some problem which the 2010 change > fixes! > Yes, as explained in my changelog. I'll paste it: Tasks that do not share the same set of allowed nodes with the task that triggered the oom should not be considered as candidates for oom kill. Tasks in other cpusets with a disjoint set of mems would be unfairly penalized otherwise because of oom conditions elsewhere; an extreme example could unfairly kill all other applications on the system if a single task in a user's cpuset sets itself to OOM_DISABLE and then uses more memory than allowed. Killing tasks outside of current's cpuset rarely would free memory for current anyway. To use a sane heuristic, we must ensure that killing a task would likely free memory for current and avoid needlessly killing others at all costs just because their potential memory freeing is unknown. It is better to kill current than another task needlessly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 1DC8F6B01D9 for ; Tue, 8 Jun 2010 20:06:40 -0400 (EDT) Date: Tue, 8 Jun 2010 17:06:30 -0700 From: Andrew Morton Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset Message-Id: <20100608170630.80753ed1.akpm@linux-foundation.org> In-Reply-To: References: <20100607084024.873B.A69D9226@jp.fujitsu.com> <20100608162513.c633439e.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010 16:54:31 -0700 (PDT) David Rientjes wrote: > On Tue, 8 Jun 2010, Andrew Morton wrote: > > > And I wonder if David has observed some problem which the 2010 change > > fixes! > > > > Yes, as explained in my changelog. I'll paste it: > > Tasks that do not share the same set of allowed nodes with the task that > triggered the oom should not be considered as candidates for oom kill. > > Tasks in other cpusets with a disjoint set of mems would be unfairly > penalized otherwise because of oom conditions elsewhere; an extreme > example could unfairly kill all other applications on the system if a > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > more memory than allowed. OK, so Nick's change didn't anticipate things being set to OOM_DISABLE? OOM_DISABLE seems pretty dangerous really - allows malicious unprivileged users to go homicidal? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 91F926B01E1 for ; Tue, 8 Jun 2010 21:07:10 -0400 (EDT) Received: from wpaz21.hot.corp.google.com (wpaz21.hot.corp.google.com [172.24.198.85]) by smtp-out.google.com with ESMTP id o59176QI021326 for ; Tue, 8 Jun 2010 18:07:07 -0700 Received: from pvg7 (pvg7.prod.google.com [10.241.210.135]) by wpaz21.hot.corp.google.com with ESMTP id o59175U7003550 for ; Tue, 8 Jun 2010 18:07:05 -0700 Received: by pvg7 with SMTP id 7so2244140pvg.39 for ; Tue, 08 Jun 2010 18:07:04 -0700 (PDT) Date: Tue, 8 Jun 2010 18:07:03 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608170630.80753ed1.akpm@linux-foundation.org> Message-ID: References: <20100607084024.873B.A69D9226@jp.fujitsu.com> <20100608162513.c633439e.akpm@linux-foundation.org> <20100608170630.80753ed1.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > Tasks that do not share the same set of allowed nodes with the task that > > triggered the oom should not be considered as candidates for oom kill. > > > > Tasks in other cpusets with a disjoint set of mems would be unfairly > > penalized otherwise because of oom conditions elsewhere; an extreme > > example could unfairly kill all other applications on the system if a > > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > > more memory than allowed. > > OK, so Nick's change didn't anticipate things being set to OOM_DISABLE? > I wrote out a more elaborate rebuttal to this in your reply to my latest patchset, but not strictly eliminating these tasks from consideration unfairly penalizes tasks in other cpusets simply because their big, there's no way to understand the scale of other cpusets compared to current's with a single divide in the heuristic (in this case, divide by 8), and there's no guarantee that killing such a task would free any memory which would have two results: (i) we need to reinvoke the oom killer to kill yet another task, and (ii) we've now unnecessarily killed a task simply because it was large and probably lost a substantial amount of work. > OOM_DISABLE seems pretty dangerous really - allows malicious > unprivileged users to go homicidal? > OOM_DISABLE doesn't get set without CAP_SYS_RESOURCE, you need that capability to decrease an oom_adj value. So my changelog could probably benefit from s/user/job/. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 100686B01B6 for ; Sun, 13 Jun 2010 07:24:57 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBOtRg021755 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:24:56 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id AAB6845DE52 for ; Sun, 13 Jun 2010 20:24:55 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 8787745DE4D for ; Sun, 13 Jun 2010 20:24:55 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 656911DB8051 for ; Sun, 13 Jun 2010 20:24:55 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 13EA01DB8045 for ; Sun, 13 Jun 2010 20:24:55 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100608170630.80753ed1.akpm@linux-foundation.org> References: <20100608170630.80753ed1.akpm@linux-foundation.org> Message-Id: <20100613184604.6184.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:54 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: kosaki.motohiro@jp.fujitsu.com, David Rientjes , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > On Tue, 8 Jun 2010 16:54:31 -0700 (PDT) > David Rientjes wrote: > > > On Tue, 8 Jun 2010, Andrew Morton wrote: > > > > > And I wonder if David has observed some problem which the 2010 change > > > fixes! > > > > > > > Yes, as explained in my changelog. I'll paste it: > > > > Tasks that do not share the same set of allowed nodes with the task that > > triggered the oom should not be considered as candidates for oom kill. > > > > Tasks in other cpusets with a disjoint set of mems would be unfairly > > penalized otherwise because of oom conditions elsewhere; an extreme > > example could unfairly kill all other applications on the system if a > > single task in a user's cpuset sets itself to OOM_DISABLE and then uses > > more memory than allowed. > > OK, so Nick's change didn't anticipate things being set to OOM_DISABLE? > > OOM_DISABLE seems pretty dangerous really - allows malicious > unprivileged users to go homicidal? Just clarify. David's patch have following Pros/Cons. Pros - 1/8 badness was inaccurate and a bit unclear why 1/8. - Usually, almost processes don't change their cpuset mask in their life time. then, cpuset_mems_allowed_intersects() is so so good heuristic. Cons - But, they can change CPUSET mask. we can't assume cpuset_mems_allowed_intersects() return always correct memory usage. - The task may have mlocked page cache out of CPUSET mask. (probably they are using cpuset.memory_spread_page, perhaps) I don't think this is OOM_DISABLE related issue. I think just heuristic choice matter. Both approaches have corner case obviously. Then, I asked most typical workload concern and test result. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 5C4716B01B9 for ; Sun, 13 Jun 2010 07:24:58 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp ([10.0.50.74]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBOsuU022656 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:24:55 +0900 Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id BCC8945DE6E for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 8342545DE70 for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 679111DB803A for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 23D1E1DB803B for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: <20100606175117.8721.A69D9226@jp.fujitsu.com> Message-Id: <20100613184150.617E.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:53 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > > It mean we shouldn't assume parent and child have the same mems_allowed, > > perhaps. > > > > I'd be happy to have that in oom_kill_process() if you pass the > enum oom_constraint and only do it for CONSTRAINT_CPUSET. Please add a > followup patch to my latest patch series. Please clarify. Why do we need CONSTRAINT_CPUSET filter? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 851B96B01B8 for ; Sun, 13 Jun 2010 07:24:58 -0400 (EDT) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5DBOsWp007416 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Sun, 13 Jun 2010 20:24:54 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 722A545DE51 for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 36EF745DD77 for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1932B1DB803B for ; Sun, 13 Jun 2010 20:24:54 +0900 (JST) Received: from m106.s.css.fujitsu.com (m106.s.css.fujitsu.com [10.249.87.106]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id AF8FF1DB803A for ; Sun, 13 Jun 2010 20:24:53 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: <20100606170713.8718.A69D9226@jp.fujitsu.com> Message-Id: <20100613180405.6178.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Sun, 13 Jun 2010 20:24:52 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > On Tue, 8 Jun 2010, KOSAKI Motohiro wrote: > > > > @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, > > > continue; > > > if (mem && !task_in_mem_cgroup(p, mem)) > > > continue; > > > + if (!has_intersects_mems_allowed(p)) > > > + continue; > > > > > > /* > > > * This task already has access to memory reserves and is > > > > now we have three places of oom filtering > > (1) select_bad_process > > Done. > > > (2) dump_tasks > > dump_tasks() has never filtered on this, it's possible for tasks is other > cpusets to allocate memory on our nodes. I have no objection because it's policy matter. but if so, dump_tasks() should display mem_allowed mask too, probably. otherwise, end-user can't understand why badness but not mem intersected task didn't killed. > > (3) oom_kill_task (when oom_kill_allocating_task==1 only) > > > > Why would care about cpuset attachment in oom_kill_task()? You mean > oom_kill_process() to filter the children list? Ah, intersting question. OK, we have to discuss oom_kill_allocating_task design at first. First of All, oom_kill_process() to filter the children list and this issue are independent and unrelated. My patch was not correct too. Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process() return 0, oom kill finished successfully. but if oom_kill_process() return 1, fallback to normall __out_of_memory(). =================================================== static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask) { struct task_struct *p; unsigned long points; if (sysctl_oom_kill_allocating_task) if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask, "Out of memory (oom_kill_allocating_task)")) return; retry: When oom_kill_process() return 1? I think It should be - current is OOM_DISABLE - current have no intersected CPUSET - current is KTHREAD - etc etc.. It mean, consist rule of !oom_kill_allocating_task case. So, my previous patch didn't care to conflict "oom: sacrifice child with highest badness score for parent" patch. Probably right way is static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, unsigned long points, struct mem_cgroup *mem, nodemask_t *nodemask, const char *message) { struct task_struct *c; struct task_struct *t = p; struct task_struct *victim = p; unsigned long victim_points = 0; struct timespec uptime; + /* This process is not oom killable, we need to retry to select + bad process */ + if (oom_unkillable(c, mem, nodemask)) + return 1; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem, nodemask); pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n", message, task_pid_nr(p), p->comm, points); or something else. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id E3AA16B01BF for ; Mon, 14 Jun 2010 04:55:09 -0400 (EDT) Received: from kpbe12.cbf.corp.google.com (kpbe12.cbf.corp.google.com [172.25.105.76]) by smtp-out.google.com with ESMTP id o5E8t4kg010755 for ; Mon, 14 Jun 2010 01:55:04 -0700 Received: from pxi5 (pxi5.prod.google.com [10.243.27.5]) by kpbe12.cbf.corp.google.com with ESMTP id o5E8t2sd007837 for ; Mon, 14 Jun 2010 01:55:03 -0700 Received: by pxi5 with SMTP id 5so497699pxi.31 for ; Mon, 14 Jun 2010 01:55:02 -0700 (PDT) Date: Mon, 14 Jun 2010 01:54:58 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: <20100613184150.617E.A69D9226@jp.fujitsu.com> Message-ID: References: <20100606175117.8721.A69D9226@jp.fujitsu.com> <20100613184150.617E.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Sun, 13 Jun 2010, KOSAKI Motohiro wrote: > > > It mean we shouldn't assume parent and child have the same mems_allowed, > > > perhaps. > > > > > > > I'd be happy to have that in oom_kill_process() if you pass the > > enum oom_constraint and only do it for CONSTRAINT_CPUSET. Please add a > > followup patch to my latest patch series. > > Please clarify. > Why do we need CONSTRAINT_CPUSET filter? > Because we don't care about intersecting mems_allowed unless it's a cpuset constrained oom. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 686436B01C1 for ; Mon, 14 Jun 2010 07:08:27 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5EB8N5g010067 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 14 Jun 2010 20:08:23 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 55EDD45DE52 for ; Mon, 14 Jun 2010 20:08:23 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 3123445DE51 for ; Mon, 14 Jun 2010 20:08:23 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 7DD751DB8038 for ; Mon, 14 Jun 2010 20:08:22 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 162C21DB803F for ; Mon, 14 Jun 2010 20:08:22 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent In-Reply-To: References: <20100613184150.617E.A69D9226@jp.fujitsu.com> Message-Id: <20100614194045.9DAB.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 14 Jun 2010 20:08:21 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > > > > It mean we shouldn't assume parent and child have the same mems_allowed, > > > > perhaps. > > > > > > > > > > I'd be happy to have that in oom_kill_process() if you pass the > > > enum oom_constraint and only do it for CONSTRAINT_CPUSET. Please add a > > > followup patch to my latest patch series. > > > > Please clarify. > > Why do we need CONSTRAINT_CPUSET filter? > > > > Because we don't care about intersecting mems_allowed unless it's a cpuset > constrained oom. OK, I caught your mention. My version have following hunk. I think simple nodemask!=NULL check is is more cleaner. ==================================================== void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *nodemask) { (snip) if (constraint != CONSTRAINT_MEMORY_POLICY) nodemask = NULL; (snip) read_lock(&tasklist_lock); __out_of_memory(gfp_mask, order, nodemask); read_unlock(&tasklist_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 80ADC6B01D0 for ; Wed, 16 Jun 2010 23:28:23 -0400 (EDT) Received: from wpaz37.hot.corp.google.com (wpaz37.hot.corp.google.com [172.24.198.101]) by smtp-out.google.com with ESMTP id o5H3SJjt012752 for ; Wed, 16 Jun 2010 20:28:20 -0700 Received: from pva18 (pva18.prod.google.com [10.241.209.18]) by wpaz37.hot.corp.google.com with ESMTP id o5H3SITm026945 for ; Wed, 16 Jun 2010 20:28:18 -0700 Received: by pva18 with SMTP id 18so90124pva.32 for ; Wed, 16 Jun 2010 20:28:18 -0700 (PDT) Date: Wed, 16 Jun 2010 20:28:13 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 08/18] oom: badness heuristic rewrite In-Reply-To: <20100608164722.9724baf9.akpm@linux-foundation.org> Message-ID: References: <20100604195328.72D9.A69D9226@jp.fujitsu.com> <20100608172820.7645.A69D9226@jp.fujitsu.com> <20100608164722.9724baf9.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: KOSAKI Motohiro , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Tue, 8 Jun 2010, Andrew Morton wrote: > > of the patch don't concentrate one thing. 2) That is strongly concentrate > > "what and how to implement". But reviewers don't want such imformation so much > > because they can read C language. reviewers need following information. > > - background > > - why do the author choose this way? > > - why do the author choose this default value? > > - how to confirm your concept and implementation correct? > > - etc etc > > > > thus, reviewers can trace the author thinking and makes good advise and judgement. > > example in this case, you wrote > > - default threshold is 1000 > > - only accumurate 1st generation execve children > > - time threshold is a second > > > > but not wrote why? mess sentence hide such lack of document. then, I usually enforce > > a divide, because a divide naturally reduce to "which place change" document and > > expose what lacking. > > > > Now I haven't get your intention. no test suite accelerate to can't get > > author think which workload is a problem workload. > > hey, you're starting to sound like me. > I can certainly elaborate on the forkbomb detector's patch description, but it would be helpful if people would bring this up as their concern rather than obfuscating it with a bunch of "nack"s and guessing. I had _thought_ that the intent was quite clear in the comments that the patch added: /* * Tasks that fork a very large number of children with seperate address spaces * may be the result of a bug, user error, malicious applications, or even those * with a very legitimate purpose such as a webserver. The oom killer assesses * a penalty equaling * * (average rss of children) * (# of 1st generation execve children) * ----------------------------------------------------------------- * sysctl_oom_forkbomb_thres * * for such tasks to target the parent. oom_kill_process() will attempt to * first kill a child, so there's no risk of killing an important system daemon * via this method. A web server, for example, may fork a very large number of * threads to respond to client connections; it's much better to kill a child * than to kill the parent, making the server unresponsive. The goal here is * to give the user a chance to recover from the error rather than deplete all * memory such that the system is unusable, it's not meant to effect a forkbomb * policy. */ I didn't think it had to be duplicated in the changelog. I'll do that. > I think I'm beginning to understand your concerns with these patches. > Finally. > > Yes, it's a familiar one. I do fairly commonly see patches where the > description can be summarised as "change lots and lots of stuff to no > apparent end" and one does have to push and poke to squeeze out the > thinking and the reasons. It's a useful exercise and will sometimes > cause the originator to have a rethink, and sometimes reveals that it > just wasn't a good change. > Show me where I have a single undocumented change in the forkbomb detector patch, please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 6EE296B01D0 for ; Wed, 16 Jun 2010 23:33:39 -0400 (EDT) Received: from hpaq2.eem.corp.google.com (hpaq2.eem.corp.google.com [172.25.149.2]) by smtp-out.google.com with ESMTP id o5H3XXMj031030 for ; Wed, 16 Jun 2010 20:33:34 -0700 Received: from pxi18 (pxi18.prod.google.com [10.243.27.18]) by hpaq2.eem.corp.google.com with ESMTP id o5H3XVgh029132 for ; Wed, 16 Jun 2010 20:33:32 -0700 Received: by pxi18 with SMTP id 18so1070844pxi.26 for ; Wed, 16 Jun 2010 20:33:31 -0700 (PDT) Date: Wed, 16 Jun 2010 20:33:28 -0700 (PDT) From: David Rientjes Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: <20100613180405.6178.A69D9226@jp.fujitsu.com> Message-ID: References: <20100606170713.8718.A69D9226@jp.fujitsu.com> <20100613180405.6178.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: KOSAKI Motohiro Cc: Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: On Sun, 13 Jun 2010, KOSAKI Motohiro wrote: > I have no objection because it's policy matter. but if so, dump_tasks() > should display mem_allowed mask too, probably. You could, but we'd want to do that all under cpuset_buffer_lock so we don't have to allocate it on the stack, which can be particularly lengthy when the page allocator is called. > > > (3) oom_kill_task (when oom_kill_allocating_task==1 only) > > > > > > > Why would care about cpuset attachment in oom_kill_task()? You mean > > oom_kill_process() to filter the children list? > > Ah, intersting question. OK, we have to discuss oom_kill_allocating_task > design at first. > > First of All, oom_kill_process() to filter the children list and this issue > are independent and unrelated. My patch was not correct too. > > Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process() > return 0, oom kill finished successfully. but if oom_kill_process() return 1, > fallback to normall __out_of_memory(). > Right. > > =================================================== > static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask) > { > struct task_struct *p; > unsigned long points; > > if (sysctl_oom_kill_allocating_task) > if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask, > "Out of memory (oom_kill_allocating_task)")) > return; > retry: > > When oom_kill_process() return 1? > I think It should be > - current is OOM_DISABLE In this case, oom_kill_task() returns 1, which causes oom_kill_process() to return 1 if current (and not one of its children) is actually selected to die. > - current have no intersected CPUSET current will always intersect its own cpuset's mems. > - current is KTHREAD find_lock_task_mm() should take care of that in oom_kill_task() just like it does for OOM_DISABLE, although we can still race with use_mm(), in which case this would be a good chance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 89BD36B01AC for ; Mon, 21 Jun 2010 07:45:50 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5LBjm0n003955 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 21 Jun 2010 20:45:48 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id E948B45DE4D for ; Mon, 21 Jun 2010 20:45:47 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id A7FA845DE4F for ; Mon, 21 Jun 2010 20:45:47 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 689F71DB804F for ; Mon, 21 Jun 2010 20:45:47 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 122B11DB8046 for ; Mon, 21 Jun 2010 20:45:47 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: <20100613180405.6178.A69D9226@jp.fujitsu.com> Message-Id: <20100617150450.FBBC.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 21 Jun 2010 20:45:46 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > > > > (3) oom_kill_task (when oom_kill_allocating_task==1 only) > > > > > > > > > > Why would care about cpuset attachment in oom_kill_task()? You mean > > > oom_kill_process() to filter the children list? > > > > Ah, intersting question. OK, we have to discuss oom_kill_allocating_task > > design at first. > > > > First of All, oom_kill_process() to filter the children list and this issue > > are independent and unrelated. My patch was not correct too. > > > > Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process() > > return 0, oom kill finished successfully. but if oom_kill_process() return 1, > > fallback to normall __out_of_memory(). > > > > Right. > > > > > =================================================== > > static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask) > > { > > struct task_struct *p; > > unsigned long points; > > > > if (sysctl_oom_kill_allocating_task) > > if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask, > > "Out of memory (oom_kill_allocating_task)")) > > return; > > retry: > > > > When oom_kill_process() return 1? > > I think It should be > > - current is OOM_DISABLE > > In this case, oom_kill_task() returns 1, which causes oom_kill_process() > to return 1 if current (and not one of its children) is actually selected > to die. Right. > > > - current have no intersected CPUSET > > current will always intersect its own cpuset's mems. Oops, It was my mistake. > > > - current is KTHREAD > > find_lock_task_mm() should take care of that in oom_kill_task() just like > it does for OOM_DISABLE, although we can still race with use_mm(), in > which case this would be a good chance. find_lock_task_mm() implementation is here. it only check ->mm. other place are using both KTHREAD check and find_lock_task_mm(). ---------------------------------------------------------------------- /* * The process p may have detached its own ->mm while exiting or through * use_mm(), but one or more of its subthreads may still have a valid * pointer. Return p, or any of its subthreads with a valid ->mm, with * task_lock() held. */ static struct task_struct *find_lock_task_mm(struct task_struct *p) { struct task_struct *t = p; do { task_lock(t); if (likely(t->mm)) return t; task_unlock(t); } while_each_thread(p, t); return NULL; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id CFA036B01AF for ; Mon, 21 Jun 2010 07:45:51 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp ([10.0.50.73]) by fgwmail6.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o5LBjnPN003988 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 21 Jun 2010 20:45:49 +0900 Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 8A7C445DE50 for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 509A045DE4E for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 36EEE1DB8038 for ; Mon, 21 Jun 2010 20:45:49 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id DB0D31DB8037 for ; Mon, 21 Jun 2010 20:45:48 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset In-Reply-To: References: <20100613180405.6178.A69D9226@jp.fujitsu.com> Message-Id: <20100621193224.B530.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Date: Mon, 21 Jun 2010 20:45:48 +0900 (JST) Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: kosaki.motohiro@jp.fujitsu.com, Andrew Morton , Rik van Riel , Nick Piggin , Oleg Nesterov , KAMEZAWA Hiroyuki , Balbir Singh , linux-mm@kvack.org List-ID: > On Sun, 13 Jun 2010, KOSAKI Motohiro wrote: > > > I have no objection because it's policy matter. but if so, dump_tasks() > > should display mem_allowed mask too, probably. > > You could, but we'd want to do that all under cpuset_buffer_lock so we > don't have to allocate it on the stack, which can be particularly lengthy > when the page allocator is called. Probably we don't need such worry. becuase a stack overflow risk depend on deepest call path. That's said, if out_of_memory() was called, page allocator did called try_to_free_pages() at first. try_to_free_pages() have much deeper stack rather than out_of_memory(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org