All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm 00/18] oom killer rewrite
@ 2010-06-01  7:18 David Rientjes
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
                   ` (17 more replies)
  0 siblings, 18 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

This is yet another version of my oom killer rewrite, now rebased to 
mmotm-2010-05-21-16-05.

This version removes the consolidation of the two existing sysctls, 
oom_kill_allocating_task and oom_dump_tasks, as recommended by a couple 
different people.

This version also makes pagefault oom handling consistent with 
panic_on_oom behavior now that all architectures have been converted to 
using the oom killer instead of simply issuing a SIGKILL for current.  
Many thanks to Nick Piggin for converting the existing archs.
---
 Documentation/feature-removal-schedule.txt |   25 +
 Documentation/filesystems/proc.txt         |  100 +++-
 Documentation/sysctl/vm.txt                |   23 +
 fs/proc/base.c                             |  107 ++++-
 include/linux/memcontrol.h                 |    8 
 include/linux/mempolicy.h                  |   13 
 include/linux/oom.h                        |   26 +
 include/linux/sched.h                      |    3 
 kernel/fork.c                              |    1 
 kernel/sysctl.c                            |   12 
 mm/memcontrol.c                            |   18 
 mm/mempolicy.c                             |   44 ++
 mm/oom_kill.c                              |  603 +++++++++++++++--------------
 mm/page_alloc.c                            |   29 -
 14 files changed, 680 insertions(+), 332 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:20   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
                   ` (16 subsequent siblings)
  17 siblings, 3 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -36,7 +36,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *tsk)
 {
@@ -168,14 +168,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
                   ` (15 subsequent siblings)
  17 siblings, 3 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.  Unfortunately,
this often kills in a seemingly random fashion based on the ordering of
the selected task's child list.  Additionally, it is not guaranteed at all
to free a large amount of memory that we need to prevent additional oom
killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful and
then the parent if none were found while the new code simply kills the
most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or has
a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -433,7 +433,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	do_posix_clock_monotonic_gettime(&uptime);
+	/* Try to sacrifice the worst child first */
 	list_for_each_entry(c, &p->children, sibling) {
+		unsigned long cpoints;
+
 		if (c->mm == p->mm)
 			continue;
 		if (mem && !task_in_mem_cgroup(c, mem))
 			continue;
-		if (!oom_kill_task(c))
-			return 0;
+
+		/* badness() returns 0 if the thread is unkillable */
+		cpoints = badness(c, uptime.tv_sec);
+		if (cpoints > victim_points) {
+			victim = c;
+			victim_points = cpoints;
+		}
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(victim);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
  2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-01  7:18 ` [patch -mm 04/18] oom: extract panic helper function David Rientjes
                   ` (14 subsequent siblings)
  17 siblings, 3 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 mm/mempolicy.c            |   44 +++++++++++++++++++++++++
 mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
 3 files changed, 103 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	if (!mask)
+		return ret;
+	task_lock(tsk);
+	mempolicy = tsk->mempolicy;
+	if (!mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+		 * allocate from, they may fallback to other nodes when oom.
+		 * Thus, it's possible for tsk to have allocated memory from
+		 * nodes in mask.
+		 */
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+out:
+	task_unlock(tsk);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+						const nodemask_t *mask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
 
-	t = tsk;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
-
-	return 0;
+		if (mask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(tsk, mask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, tsk))
+				return true;
+		}
+		tsk = next_thread(tsk);
+	} while (tsk != start);
+	return false;
 }
 
 /**
@@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+		struct mem_cgroup *mem, enum oom_constraint constraint,
+		const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p,
+				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+									 NULL))
 			continue;
 
 		/*
@@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 		panic("out of memory(memcg). panic_on_oom is selected.\n");
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -576,7 +597,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -610,7 +631,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -626,6 +648,7 @@ void pagefault_out_of_memory(void)
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
+	if (unlikely(sysctl_panic_on_oom)) {
+		/*
+		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+		 * should not panic for cpuset or mempolicy induced memory
+		 * failures.
+		 */
+		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
-			panic("out of memory. panic_on_oom is selected\n");
+			panic("Out of memory: panic_on_oom is enabled\n");
 		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
 	}
-
+	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 04/18] oom: extract panic helper function
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (2 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:33   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 05/18] oom: remove special handling for pagefault ooms David Rientjes
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

There are various points in the oom killer where the kernel must
determine whether to panic or not.  It's better to extract this to a
helper function to remove all the confusion as to its semantics.

There's no functional change with this patch.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    1 +
 mm/oom_kill.c       |   50 ++++++++++++++++++++++++++++----------------------
 2 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -22,6 +22,7 @@ enum oom_constraint {
 	CONSTRAINT_NONE,
 	CONSTRAINT_CPUSET,
 	CONSTRAINT_MEMORY_POLICY,
+	CONSTRAINT_MEMCG,
 };
 
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,17 +493,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	return oom_kill_task(victim);
 }
 
+/*
+ * Determines whether the kernel must panic because of the panic_on_oom sysctl.
+ */
+static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+				int order)
+{
+	if (likely(!sysctl_panic_on_oom))
+		return;
+	if (sysctl_panic_on_oom != 2) {
+		/*
+		 * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
+		 * does not panic for cpuset, mempolicy, or memcg allocation
+		 * failures.
+		 */
+		if (constraint != CONSTRAINT_NONE)
+			return;
+	}
+	read_lock(&tasklist_lock);
+	dump_header(NULL, gfp_mask, order, NULL);
+	read_unlock(&tasklist_lock);
+	panic("Out of memory: %s panic_on_oom is enabled\n",
+		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
 	unsigned long points = 0;
 	struct task_struct *p;
 
-	if (sysctl_panic_on_oom == 2)
-		panic("out of memory(memcg). panic_on_oom is selected.\n");
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -627,9 +650,7 @@ void pagefault_out_of_memory(void)
 		/* Got some memory back in the last second. */
 		return;
 
-	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
-
+	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
 	read_lock(&tasklist_lock);
 	/* unknown gfp_mask and order */
 	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
@@ -666,28 +687,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		/* Got some memory back in the last second. */
 		return;
 
-	if (sysctl_panic_on_oom == 2) {
-		dump_header(NULL, gfp_mask, order, NULL);
-		panic("out of memory. Compulsory panic_on_oom is selected.\n");
-	}
-
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
-	if (unlikely(sysctl_panic_on_oom)) {
-		/*
-		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
-		 * should not panic for cpuset or mempolicy induced memory
-		 * failures.
-		 */
-		if (constraint == CONSTRAINT_NONE) {
-			dump_header(NULL, gfp_mask, order, NULL);
-			panic("Out of memory: panic_on_oom is enabled\n");
-		}
-	}
 	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 05/18] oom: remove special handling for pagefault ooms
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (3 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 04/18] oom: extract panic helper function David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:34   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 06/18] oom: move sysctl declarations to oom.h David Rientjes
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

It is possible to remove the special pagefault oom handler by simply oom
locking all system zones and then calling directly into out_of_memory().

All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
parallel oom killing in progress that will lead to eventual memory freeing
so it's not necessary to needlessly kill another task.  The context in
which the pagefault is allocating memory is unknown to the oom killer, so
this is done on a system-wide level.

If a task has already been oom killed and hasn't fully exited yet, this
will be a no-op since select_bad_process() recognizes tasks across the
system with TIF_MEMDIE set.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   86 +++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -603,6 +603,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /*
+ * Try to acquire the oom killer lock for all system zones.  Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+	struct zone *zone;
+	int ret = 1;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		if (zone_is_oom_locked(zone)) {
+			ret = 0;
+			goto out;
+		}
+	for_each_populated_zone(zone)
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+	spin_unlock(&zone_scan_lock);
+	return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+	struct zone *zone;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	spin_unlock(&zone_scan_lock);
+}
+
+
+/*
  * Must be called with tasklist_lock held for read.
  */
 static void __out_of_memory(gfp_t gfp_mask, int order,
@@ -637,33 +675,6 @@ retry:
 		goto retry;
 }
 
-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
-	unsigned long freed = 0;
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
-	read_lock(&tasklist_lock);
-	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
-	read_unlock(&tasklist_lock);
-
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory.
-	 */
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-}
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -680,7 +691,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
-	enum oom_constraint constraint;
+	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -691,7 +702,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
 	__out_of_memory(gfp_mask, order, constraint, nodemask);
@@ -704,3 +716,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing.  If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+	if (try_set_system_oom()) {
+		out_of_memory(NULL, 0, 0, NULL);
+		clear_system_oom();
+	}
+	if (!test_thread_flag(TIF_MEMDIE))
+		schedule_timeout_uninterruptible(1);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 06/18] oom: move sysctl declarations to oom.h
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (4 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 05/18] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:34   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 07/18] oom: enable oom tasklist dump by default David Rientjes
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

The three oom killer sysctl variables (sysctl_oom_dump_tasks,
sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
declared in include/linux/oom.h rather than kernel/sysctl.c.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    5 +++++
 kernel/sysctl.c     |    4 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,5 +44,10 @@ static inline void oom_killer_enable(void)
 {
 	oom_killer_disabled = false;
 }
+
+/* sysctls */
+extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_panic_on_oom;
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -55,6 +55,7 @@
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -87,9 +88,6 @@
 /* External variables not in a header file. */
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 07/18] oom: enable oom tasklist dump by default
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (5 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 06/18] oom: move sysctl declarations to oom.h David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:36   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 08/18] oom: badness heuristic rewrite David Rientjes
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
very helpful information in diagnosing why a user's task has been killed.
It emits useful information such as each eligible thread's memory usage
that can determine why the system is oom, so it should be enabled by
default.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |    2 +-
 mm/oom_kill.c               |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -511,7 +511,7 @@ information may not be desired.
 If this is set to non-zero, this information is shown whenever the
 OOM killer actually kills a memory-hogging task.
 
-The default value is 0.
+The default value is 1 (enabled).
 
 ==============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,7 +32,7 @@
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
+int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (6 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 07/18] oom: enable oom tasklist dump by default David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:36   ` KOSAKI Motohiro
  2010-06-01  7:46   ` Nick Piggin
  2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   94 ++++++++-----
 fs/proc/base.c                     |   99 +++++++++++++-
 include/linux/memcontrol.h         |    8 +
 include/linux/oom.h                |   14 ++-
 include/linux/sched.h              |    3 +-
 kernel/fork.c                      |    1 +
 mm/memcontrol.c                    |   18 +++
 mm/oom_kill.c                      |  267 ++++++++++++++++--------------------
 8 files changed, 311 insertions(+), 193 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
+  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+								score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1234,42 +1235,61 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- 	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- 	to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
-	points <<= oom_adj when it is positive and
-	points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted.  The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000.  If it is using half of its allowed memory, its score will be 500.
+
+There is an additional factor included in the badness score: root
+processes are given 3% extra memory over other tasks.
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called.  If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes.  If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit.  Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill.  Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it.  The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory.  A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score.  Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible.  This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,6 +63,7 @@
 #include <linux/namei.h>
 #include <linux/mnt_namespace.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/stacktrace.h>
@@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
-	struct timespec uptime;
 
-	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = badness(task, uptime.tv_sec);
+		points = oom_badness(task->group_leader,
+					global_page_state(NR_INACTIVE_ANON) +
+					global_page_state(NR_ACTIVE_ANON) +
+					global_page_state(NR_INACTIVE_FILE) +
+					global_page_state(NR_ACTIVE_FILE) +
+					total_swap_pages);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_adj = oom_adjust;
-
+	/*
+	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+	 * value is always attainable.
+	 */
+	if (task->signal->oom_adj == OOM_ADJUST_MAX)
+		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+	else
+		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+								-OOM_DISABLE;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.llseek		= generic_file_llseek,
 };
 
+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	unsigned long flags;
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+	if (lock_task_sighand(task, &flags)) {
+		oom_score_adj = task->signal->oom_score_adj;
+		unlock_task_sighand(task, &flags);
+	}
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	unsigned long flags;
+	long oom_score_adj;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+	if (err)
+		return -EINVAL;
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	if (!lock_task_sighand(task, &flags)) {
+		put_task_struct(task);
+		return -ESRCH;
+	}
+	if (oom_score_adj < task->signal->oom_score_adj &&
+			!capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	task->signal->oom_score_adj = oom_score_adj;
+	/*
+	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+	 * always attainable.
+	 */
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		task->signal->oom_adj = OOM_DISABLE;
+	else
+		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+							OOM_SCORE_ADJ_MAX;
+	unlock_task_sighand(task, &flags);
+	put_task_struct(task);
+	return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+	.read		= oom_score_adj_read,
+	.write		= oom_score_adj_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,24 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -25,6 +35,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+extern unsigned int oom_badness(struct task_struct *p,
+					unsigned long totalpages);
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,7 +629,8 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;	/* OOM kill score adjustment (bit shift) */
+	int oom_adj;		/* OOM kill score adjustment (bit shift) */
+	int oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
+	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 }
 
 /*
+ * Return the memory (and swap, if configured) limit for a memcg.
+ */
+u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+{
+	u64 limit;
+	u64 memsw;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
+			total_swap_pages;
+	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+	/*
+	 * If memsw is finite and limits the amount of swap space available
+	 * to this memcg, return that limit.
+	 */
+	return min(limit, memsw);
+}
+
+/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
+ *  Copyright (C)  2010  Google, Inc
+ *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -34,7 +36,6 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
@@ -94,37 +95,33 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 }
 
 /**
- * badness - calculate a numeric value for how bad this task has been
+ * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+ * @totalpages: total present RAM allowed for page allocation
  *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
- *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the principle
- *    of least surprise ... (be careful when you change it)
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible.  The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 {
-	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
-	struct task_struct *child;
-	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
+	int points;
 
-	if (oom_adj == OOM_DISABLE)
+	/*
+	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+	 * need to be executed for something that can't be killed.
+	 */
+	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return 0;
 
+	/*
+	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+	 * priority for oom killing.
+	 */
+	if (p->flags & PF_OOM_ORIGIN)
+		return 1000;
+
 	task_lock(p);
 	mm = p->mm;
 	if (!mm) {
@@ -133,98 +130,37 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	}
 
 	/*
-	 * The memory size of the process is the basis for the badness.
+	 * The memory controller may have a limit of 0 bytes, so avoid a divide
+	 * by zero if necessary.
 	 */
-	points = mm->total_vm;
+	if (!totalpages)
+		totalpages = 1;
 
 	/*
-	 * After this unlock we can no longer dereference local variable `mm'
+	 * The baseline for the badness score is the proportion of RAM that each
+	 * task's rss and swap space use.
 	 */
+	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+			totalpages;
 	task_unlock(p);
 
 	/*
-	 * swapoff can easily use up all memory, so kill those first.
-	 */
-	if (p->flags & PF_OOM_ORIGIN)
-		return ULONG_MAX;
-
-	/*
-	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add half the vmsize of the children if they
-	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of children. In case a single
-	 * child is eating the vast majority of memory, adding only half
-	 * to the parents will make the child our kill candidate of choice.
-	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
-	}
-
-	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
-
-	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
-	 */
-	if (task_nice(p) > 0)
-		points *= 2;
-
-	/*
-	 * Superuser processes are usually more important, so we make it
-	 * less likely that we kill those.
-	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
-	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
-		points /= 4;
-
-	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
+	 * Root processes get 3% bonus, just like the __vm_enough_memory() used
+	 * by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+		points -= 30;
 
 	/*
-	 * Adjust the score by oom_adj.
+	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the
+	 * range may either completely disable oom killing or always prefer a
+	 * certain task.
 	 */
-	if (oom_adj) {
-		if (oom_adj > 0) {
-			if (!points)
-				points = 1;
-			points <<= oom_adj;
-		} else
-			points >>= -(oom_adj);
-	}
+	points += p->signal->oom_score_adj;
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+	if (points < 0)
+		return 0;
+	return (points <= 1000) ? points : 1000;
 }
 
 /*
@@ -232,12 +168,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				    gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	bool cpuset_limited = false;
+	int nid;
+
+	/* Default to all anonymous memory, page cache, and swap */
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 
+	if (!zonelist)
+		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -247,26 +195,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
-	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
-	 * feature. mempolicy is an only user of nodemask here.
-	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+	 * the page allocator means a mempolicy is in effect.  Cpuset policy
+	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, *nodemask)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
 		return CONSTRAINT_MEMORY_POLICY;
+	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			return CONSTRAINT_CPUSET;
-
+			cpuset_limited = true;
+
+	if (cpuset_limited) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, cpuset_current_mems_allowed)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
+		return CONSTRAINT_CPUSET;
+	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -277,18 +246,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints,
-		struct mem_cgroup *mem, enum oom_constraint constraint,
-		const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+		unsigned long totalpages, struct mem_cgroup *mem,
+		enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
-	struct timespec uptime;
 	*ppoints = 0;
 
-	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned long points;
+		unsigned int points;
 
 		/*
 		 * skip kernel threads and tasks which have already released
@@ -333,13 +300,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = ULONG_MAX;
+			*ppoints = 1000;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, totalpages);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -355,7 +322,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
  *
  * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
  * shown.
@@ -367,7 +334,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	struct task_struct *g, *p;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
-	       "name\n");
+	       "oom_score_adj name\n");
 	do_each_thread(g, p) {
 		struct mm_struct *mm;
 
@@ -387,10 +354,10 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			task_unlock(p);
 			continue;
 		}
-		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
 		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
-		       p->comm);
+		       p->signal->oom_score_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -399,8 +366,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d, oom_score_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj,
+		current->signal->oom_score_adj);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -465,7 +433,7 @@ static int oom_kill_task(struct task_struct *p)
 	 * change to NULL at any time since we do not hold task_lock(p).
 	 * However, this is of no concern to us.
 	 */
-	if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return 1;
 
 	__oom_kill_task(p, 1);
@@ -474,13 +442,12 @@ static int oom_kill_task(struct task_struct *p)
 }
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned int points, unsigned long totalpages,
+			    struct mem_cgroup *mem, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *c;
-	unsigned long victim_points = 0;
-	struct timespec uptime;
+	unsigned int victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -494,21 +461,20 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	do_posix_clock_monotonic_gettime(&uptime);
 	/* Try to sacrifice the worst child first */
 	list_for_each_entry(c, &p->children, sibling) {
-		unsigned long cpoints;
+		unsigned int cpoints;
 
 		if (c->mm == p->mm)
 			continue;
 		if (mem && !task_in_mem_cgroup(c, mem))
 			continue;
 
-		/* badness() returns 0 if the thread is unkillable */
-		cpoints = badness(c, uptime.tv_sec);
+		/* oom_badness() returns 0 if the thread is unkillable */
+		cpoints = oom_badness(c, totalpages);
 		if (cpoints > victim_points) {
 			victim = c;
 			victim_points = cpoints;
@@ -520,17 +486,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
+	unsigned long limit;
 	unsigned long points = 0;
 	struct task_struct *p;
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
+	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
+	p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -643,22 +611,22 @@ static void clear_system_oom(void)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
-	unsigned long points;
+	unsigned int points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+			NULL, "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL, constraint, mask);
+	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -670,7 +638,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 			     "Out of memory"))
 		goto retry;
 }
@@ -690,6 +658,7 @@ retry:
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
+	unsigned long totalpages;
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
@@ -702,11 +671,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	if (zonelist)
-		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+						&totalpages);
 	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
-	__out_of_memory(gfp_mask, order, constraint, nodemask);
+	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (7 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 08/18] oom: badness heuristic rewrite David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:37   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-01  7:18 ` [patch -mm 10/18] oom: deprecate oom_adj tunable David Rientjes
                   ` (8 subsequent siblings)
  17 siblings, 3 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

Add a forkbomb penalty for processes that fork an excessively large
number of children to penalize that group of tasks and not others.  A
threshold is configurable from userspace to determine how many first-
generation execve children (those with their own address spaces) a task
may have before it is considered a forkbomb.  This can be tuned by
altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks.  A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected as
the victim, the oom killer will still kill the 1st generation execve child
with the highest badness() score in its place.  The avoids killing
important servers or system daemons.  When a web server forks a very large
number of threads for client connections, for example, it is much better
to kill one of those threads than to kill the server and make it
unresponsive.

[oleg@redhat.com: optimize task_lock when iterating children]
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |    7 +++-
 Documentation/sysctl/vm.txt        |   21 ++++++++++++
 include/linux/oom.h                |    3 ++
 kernel/sysctl.c                    |    8 +++++
 mm/oom_kill.c                      |   60 ++++++++++++++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1248,8 +1248,11 @@ may allocate from based on an estimation of its current memory and swap use.
 For example, if a task is using all allowed memory, its badness score will be
 1000.  If it is using half of its allowed memory, its score will be 500.
 
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
+There are a couple of additional factor included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
 
 The amount of "allowed" memory depends on the context in which the oom killer
 was called.  If it is due to the memory assigned to the allocating task's cpuset
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -46,6 +46,7 @@ Currently, these files are in /proc/sys/vm:
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
 - oom_dump_tasks
+- oom_forkbomb_thres
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
@@ -515,6 +516,26 @@ The default value is 1 (enabled).
 
 ==============================================================
 
+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb.  Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+	(average rss of execve children) * (# of 1st generation execve children)
+	------------------------------------------------------------------------
+				oom_forkbomb_thres
+
+in the oom killer's badness heuristic.  Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
 oom_kill_allocating_task
 
 This enables or disables killing the OOM-triggering task in
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -16,6 +16,9 @@
 #define OOM_SCORE_ADJ_MIN	(-1000)
 #define OOM_SCORE_ADJ_MAX	1000
 
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES	1000
+
 #ifdef __KERNEL__
 
 #include <linux/sched.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1001,6 +1001,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_forkbomb_thres",
+		.data		= &sysctl_oom_forkbomb_thres,
+		.maxlen		= sizeof(sysctl_oom_forkbomb_thres),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,7 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
@@ -94,6 +95,64 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 	return false;
 }
 
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose such as a webserver.  The oom killer assesses
+ * a penalty equaling
+ *
+ *	(average rss of children) * (# of 1st generation execve children)
+ *	-----------------------------------------------------------------
+ *			sysctl_oom_forkbomb_thres
+ *
+ * for such tasks to target the parent.  oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method.  A web server, for example, may fork a very large number of
+ * threads to respond to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive.  The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
+ */
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
+{
+	struct task_struct *child;
+	unsigned long child_rss = 0;
+	int forkcount = 0;
+
+	if (!sysctl_oom_forkbomb_thres)
+		return 0;
+	list_for_each_entry(child, &tsk->children, sibling) {
+		struct task_cputime task_time;
+		unsigned long runtime;
+		unsigned long rss;
+
+		task_lock(child);
+		if (!child->mm || child->mm == tsk->mm) {
+			task_unlock(child);
+			continue;
+		}
+		rss = get_mm_rss(child->mm);
+		task_unlock(child);
+
+		thread_group_cputime(child, &task_time);
+		runtime = cputime_to_jiffies(task_time.utime) +
+			  cputime_to_jiffies(task_time.stime);
+		/*
+		 * Only threads that have run for less than a second are
+		 * considered toward the forkbomb penalty, these threads rarely
+		 * get to execute at all in such cases anyway.
+		 */
+		if (runtime < HZ) {
+			child_rss += rss;
+			forkcount++;
+		}
+	}
+
+	return forkcount > sysctl_oom_forkbomb_thres ?
+				(child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -143,6 +202,7 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
 			totalpages;
 	task_unlock(p);
+	points += oom_forkbomb_penalty(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory() used

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 10/18] oom: deprecate oom_adj tunable
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (8 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:37   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is May 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface.  Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   25 +++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 +++
 fs/proc/base.c                             |    8 ++++++++
 include/linux/oom.h                        |    3 +++
 4 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -174,6 +174,31 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	May 2012
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warnings will be
+	suppressed until the kernel is rebooted.
+
+---------------------------
+
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1288,6 +1288,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1044,6 +1044,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
+	/*
+	 * Warn that /proc/pid/oom_adj is deprecated, see
+	 * Documentation/feature-removal-schedule.txt.
+	 */
+	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+			"please use /proc/%d/oom_score_adj instead.\n",
+			current->comm, task_pid_nr(current),
+			task_pid_nr(current), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,6 +2,9 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (9 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 10/18] oom: deprecate oom_adj tunable David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:38   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-01  7:18 ` [patch -mm 12/18] oom: remove unnecessary code and cleanup David Rientjes
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is no
guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   29 ++++++++++++++++++++---------
 1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2052,15 +2055,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 12/18] oom: remove unnecessary code and cleanup
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (10 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
@ 2010-06-01  7:18 ` David Rientjes
  2010-06-01  7:40   ` KOSAKI Motohiro
  2010-06-01  7:19 ` [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit David Rientjes
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity is
not always emitted here; we need not print an error message if an exiting
task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   64 ++++++++++++++------------------------------------------
 1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -439,67 +439,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
  */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
-		return;
-	}
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
-
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
-
 	force_sig(SIGKILL, p);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p)
 {
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+	task_lock(p);
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		task_unlock(p);
 		return 1;
+	}
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(p->mm->total_vm),
+		K(get_mm_counter(p->mm, MM_ANONPAGES)),
+		K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	task_unlock(p);
 
-	__oom_kill_task(p, 1);
-
+	__oom_kill_task(p);
 	return 0;
 }
+#undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned int points, unsigned long totalpages,
@@ -517,7 +485,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
+		__oom_kill_task(p);
 		return 0;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (11 preceding siblings ...)
  2010-06-01  7:18 ` [patch -mm 12/18] oom: remove unnecessary code and cleanup David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  2010-06-01  7:40   ` KOSAKI Motohiro
  2010-06-01  7:19 ` [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

Tasks detach its ->mm prior to exiting so it's possible that in progress
oom kills or already exiting tasks may be missed during the oom killer's
tasklist scan.  When an eligible task is found with either TIF_MEMDIE or
PF_EXITING set, the oom killer is supposed to be a no-op to avoid
needlessly killing additional tasks.  This closes the race between a task
detaching its ->mm and being removed from the tasklist.

Out of memory conditions as the result of memory controllers will
automatically filter tasks that have detached their ->mm (since
task_in_mem_cgroup() will return 0).  This is acceptable, however, since
memcg constrained ooms aren't the result of a lack of memory resources but
rather a limit imposed by userspace that requires a task be killed
regardless.

[oleg@redhat.com: fix PF_EXITING check for !p->mm tasks]
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -317,12 +317,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	for_each_process(p) {
 		unsigned int points;
 
-		/*
-		 * skip kernel threads and tasks which have already released
-		 * their mm.
-		 */
-		if (!p->mm)
-			continue;
 		/* skip the init task */
 		if (is_global_init(p))
 			continue;
@@ -355,7 +349,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (p->flags & PF_EXITING) {
+		if (p->flags & PF_EXITING && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 
@@ -363,6 +357,12 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			*ppoints = 1000;
 		}
 
+		/*
+		 * skip kernel threads and tasks which have already released
+		 * their mm.
+		 */
+		if (!p->mm)
+			continue;
 		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (12 preceding siblings ...)
  2010-06-01  7:19 ` [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  2010-06-01  7:41   ` KOSAKI Motohiro
  2010-06-01  7:19 ` [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

From: Oleg Nesterov <oleg@redhat.com>

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -317,8 +317,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 	for_each_process(p) {
 		unsigned int points;
 
-		/* skip the init task */
-		if (is_global_init(p))
+		/* skip the init task and kthreads */
+		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (13 preceding siblings ...)
  2010-06-01  7:19 ` [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  2010-06-01  7:41   ` KOSAKI Motohiro
  2010-06-01  7:19 ` [patch -mm 16/18] oom: give current access to memory reserves if it has been killed David Rientjes
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

From: Oleg Nesterov <oleg@redhat.com>

Almost all ->mm == NUL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
  just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
  thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
  ->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
  task without ->mm, but no matter what oom_badness() returns the
  task can be chosen if nothing else has been found yet.

  Change oom_badness() to return int, change it to return -1 if
  find_lock_task_mm() fails, and change select_bad_process() to
  check points >= 0.

Note! This patch is not enough, we need more changes.

	- oom_badness() was fixed, but oom_kill_task() still ignores
	  the task without ->mm

	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
	  and it also needs other changes to actually find the first
	  first-descendant children

This will be addressed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   37 +++++++++++++++++++------------------
 1 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -95,6 +95,20 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 	return false;
 }
 
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+	struct task_struct *t = p;
+
+	do {
+		task_lock(t);
+		if (likely(t->mm))
+			return t;
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return NULL;
+}
+
 /*
  * Tasks that fork a very large number of children with seperate address spaces
  * may be the result of a bug, user error, malicious applications, or even those
@@ -164,7 +178,6 @@ static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
  */
 unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 {
-	struct mm_struct *mm;
 	int points;
 
 	/*
@@ -181,12 +194,9 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 	if (p->flags & PF_OOM_ORIGIN)
 		return 1000;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
+	p = find_lock_task_mm(p);
+	if (!p)
 		return 0;
-	}
 
 	/*
 	 * The memory controller may have a limit of 0 bytes, so avoid a divide
@@ -199,8 +209,8 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss and swap space use.
 	 */
-	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
+	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) *
+			1000 / totalpages;
 	task_unlock(p);
 	points += oom_forkbomb_penalty(p);
 
@@ -357,17 +367,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			*ppoints = 1000;
 		}
 
-		/*
-		 * skip kernel threads and tasks which have already released
-		 * their mm.
-		 */
-		if (!p->mm)
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-			continue;
-
 		points = oom_badness(p, totalpages);
-		if (points > *ppoints || !chosen) {
+		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 16/18] oom: give current access to memory reserves if it has been killed
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (14 preceding siblings ...)
  2010-06-01  7:19 ` [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  2010-06-01  7:44   ` KOSAKI Motohiro
  2010-06-01  7:19 ` [patch -mm 17/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
  2010-06-01  7:19 ` [patch -mm 18/18] oom: clean up oom_kill_task() David Rientjes
  17 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

It's possible to livelock the page allocator if a thread has mm->mmap_sem
and fails to make forward progress because the oom killer selects another
thread sharing the same ->mm to kill that cannot exit until the semaphore
is dropped.

The oom killer will not kill multiple tasks at the same time; each oom
killed task must exit before another task may be killed.  Thus, if one
thread is holding mm->mmap_sem and cannot allocate memory, all threads
sharing the same ->mm are blocked from exiting as well.  In the oom kill
case, that means the thread holding mm->mmap_sem will never free
additional memory since it cannot get access to memory reserves and the
thread that depends on it with access to memory reserves cannot exit
because it cannot acquire the semaphore.  Thus, the page allocators
livelocks.

When the oom killer is called and current happens to have a pending
SIGKILL, this patch automatically gives it access to memory reserves and
returns.  Upon returning to the page allocator, its allocation will
hopefully succeed so it can quickly exit and free its memory.  If not, the
page allocator will fail the allocation if it is not __GFP_NOFAIL.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -697,6 +697,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 
 	/*
+	 * If current has a pending SIGKILL, then automatically select it.  The
+	 * goal is to allow it to allocate so that it may quickly exit and free
+	 * its memory.
+	 */
+	if (fatal_signal_pending(current)) {
+		set_tsk_thread_flag(current, TIF_MEMDIE);
+		return;
+	}
+
+	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 17/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (15 preceding siblings ...)
  2010-06-01  7:19 ` [patch -mm 16/18] oom: give current access to memory reserves if it has been killed David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  2010-06-01  7:19 ` [patch -mm 18/18] oom: clean up oom_kill_task() David Rientjes
  17 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

It's unnecessary to SIGKILL a task that is already PF_EXITING and can
actually cause a NULL pointer dereference of the sighand if it has already
been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
reserves and can quickly exit as the comment implies.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -486,7 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p);
+		set_tsk_thread_flag(p, TIF_MEMDIE);
 		return 0;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [patch -mm 18/18] oom: clean up oom_kill_task()
  2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
                   ` (16 preceding siblings ...)
  2010-06-01  7:19 ` [patch -mm 17/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
@ 2010-06-01  7:19 ` David Rientjes
  17 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

__oom_kill_task() only has a single caller, so merge it into that
function.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -440,17 +440,6 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-/*
- * Give the oom killed task high priority and access to memory reserves so that
- * it may quickly exit and free its memory.
- */
-static void __oom_kill_task(struct task_struct *p)
-{
-	p->rt.time_slice = HZ;
-	set_tsk_thread_flag(p, TIF_MEMDIE);
-	force_sig(SIGKILL, p);
-}
-
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p)
 {
@@ -465,7 +454,9 @@ static int oom_kill_task(struct task_struct *p)
 		K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
-	__oom_kill_task(p);
+	p->rt.time_slice = HZ;
+	set_tsk_thread_flag(p, TIF_MEMDIE);
+	force_sig(SIGKILL, p);
 	return 0;
 }
 #undef K

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-06-01  7:20   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.  To use a sane heuristic, we must ensure that killing a
> task would likely free memory for current and avoid needlessly killing
> others at all costs just because their potential memory freeing is
> unknown.  It is better to kill current than another task needlessly.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

ack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 04/18] oom: extract panic helper function
  2010-06-01  7:18 ` [patch -mm 04/18] oom: extract panic helper function David Rientjes
@ 2010-06-01  7:33   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> There are various points in the oom killer where the kernel must
> determine whether to panic or not.  It's better to extract this to a
> helper function to remove all the confusion as to its semantics.
> 
> There's no functional change with this patch.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

ack



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 05/18] oom: remove special handling for pagefault ooms
  2010-06-01  7:18 ` [patch -mm 05/18] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-06-01  7:34   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> It is possible to remove the special pagefault oom handler by simply oom
> locking all system zones and then calling directly into out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory freeing
> so it's not necessary to needlessly kill another task.  The context in
> which the pagefault is allocating memory is unknown to the oom killer, so
> this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 
> Acked-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: David Rientjes <rientjes@google.com>

ack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 06/18] oom: move sysctl declarations to oom.h
  2010-06-01  7:18 ` [patch -mm 06/18] oom: move sysctl declarations to oom.h David Rientjes
@ 2010-06-01  7:34   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> The three oom killer sysctl variables (sysctl_oom_dump_tasks,
> sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
> declared in include/linux/oom.h rather than kernel/sysctl.c.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

ack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 07/18] oom: enable oom tasklist dump by default
  2010-06-01  7:18 ` [patch -mm 07/18] oom: enable oom tasklist dump by default David Rientjes
@ 2010-06-01  7:36   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
> very helpful information in diagnosing why a user's task has been killed.
> It emits useful information such as each eligible thread's memory usage
> that can determine why the system is oom, so it should be enabled by
> default.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

ack



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01  7:18 ` [patch -mm 08/18] oom: badness heuristic rewrite David Rientjes
@ 2010-06-01  7:36   ` KOSAKI Motohiro
  2010-06-01 18:44     ` David Rientjes
  2010-06-01  7:46   ` Nick Piggin
  1 sibling, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions.  The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.
> 
> The baseline for the heuristic is a proportion of memory that each task is
> currently using in memory plus swap compared to the amount of "allowable"
> memory.  "Allowable," in this sense, means the system-wide resources for
> unconstrained oom conditions, the set of mempolicy nodes, the mems
> attached to current's cpuset, or a memory controller's limit.  The
> proportion is given on a scale of 0 (never kill) to 1000 (always kill),
> roughly meaning that if a task has a badness() score of 500 that the task
> consumes approximately 50% of allowable memory resident in RAM or in swap
> space.
> 
> The proportion is always relative to the amount of "allowable" memory and
> not the total amount of RAM systemwide so that mempolicies and cpusets may
> operate in isolation; they shall not need to know the true size of the
> machine on which they are running if they are bound to a specific set of
> nodes or mems, respectively.
> 
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs.  In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.
> 
> Because of the change in the badness() heuristic's baseline, it is also
> necessary to introduce a new user interface to tune it.  It's not possible
> to redefine the meaning of /proc/pid/oom_adj with a new scale since the
> ABI cannot be changed for backward compatability.  Instead, a new tunable,
> /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
> be used to polarize the heuristic such that certain tasks are never
> considered for oom kill while others may always be considered.  The value
> is added directly into the badness() score so a value of -500, for
> example, means to discount 50% of its memory consumption in comparison to
> other tasks either on the system, bound to the mempolicy, in the cpuset,
> or sharing the same memory controller.
> 
> /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> these per-task tunables will rescale the value of the other to an
> equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> a bitshift on the badness score, it now shares the same linear growth as
> /proc/pid/oom_score_adj but with different granularity.  This is required
> so the ABI is not broken with userspace applications and allows oom_adj to
> be deprecated for future removal.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

nack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
@ 2010-06-01  7:37   ` KOSAKI Motohiro
  2010-06-01 18:57     ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Add a forkbomb penalty for processes that fork an excessively large
> number of children to penalize that group of tasks and not others.  A
> threshold is configurable from userspace to determine how many first-
> generation execve children (those with their own address spaces) a task
> may have before it is considered a forkbomb.  This can be tuned by
> altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> 1000.
> 
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
> 
> 	(average rss of children) * (# of 1st generation execve children)
> 	-----------------------------------------------------------------
> 			oom_forkbomb_thres
> 
> is assessed.  So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks.  A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
> 
> Even though a particular task may be designated a forkbomb and selected as
> the victim, the oom killer will still kill the 1st generation execve child
> with the highest badness() score in its place.  The avoids killing
> important servers or system daemons.  When a web server forks a very large
> number of threads for client connections, for example, it is much better
> to kill one of those threads than to kill the server and make it
> unresponsive.
> 
> [oleg@redhat.com: optimize task_lock when iterating children]
> Signed-off-by: David Rientjes <rientjes@google.com>

nack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 10/18] oom: deprecate oom_adj tunable
  2010-06-01  7:18 ` [patch -mm 10/18] oom: deprecate oom_adj tunable David Rientjes
@ 2010-06-01  7:37   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> /proc/pid/oom_adj is now deprecated so that that it may eventually be
> removed.  The target date for removal is May 2012.
> 
> A warning will be printed to the kernel log if a task attempts to use this
> interface.  Future warning will be suppressed until the kernel is rebooted
> to prevent spamming the kernel log.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/feature-removal-schedule.txt |   25 +++++++++++++++++++++++++
>  Documentation/filesystems/proc.txt         |    3 +++
>  fs/proc/base.c                             |    8 ++++++++
>  include/linux/oom.h                        |    3 +++
>  4 files changed, 39 insertions(+), 0 deletions(-)

nack



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-01  7:18 ` [patch -mm 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
@ 2010-06-01  7:38   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  1 sibling, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help.  The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O.  Killing such an
> application may leave the hardware in an unspecified state and there is no
> guarantee that it will be able to make a timely exit.
> 
> Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
> not used so that the task can perhaps recover or try again later.
> 
> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
> 
> high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

ack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-06-01  7:39   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.  Unfortunately,
> this often kills in a seemingly random fashion based on the ordering of
> the selected task's child list.  Additionally, it is not guaranteed at all
> to free a large amount of memory that we need to prevent additional oom
> killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful and
> then the parent if none were found while the new code simply kills the
> most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or has
> a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
> so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

ack


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms
  2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
@ 2010-06-01  7:39   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

ack



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 12/18] oom: remove unnecessary code and cleanup
  2010-06-01  7:18 ` [patch -mm 12/18] oom: remove unnecessary code and cleanup David Rientjes
@ 2010-06-01  7:40   ` KOSAKI Motohiro
  2010-06-01 18:58     ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Remove the redundancy in __oom_kill_task() since:
> 
>  - init can never be passed to this function: it will never be PF_EXITING
>    or selectable from select_bad_process(), and
> 
>  - it will never be passed a task from oom_kill_task() without an ->mm
>    and we're unconcerned about detachment from exiting tasks, there's no
>    reason to protect them against SIGKILL or access to memory reserves.
> 
> Also moves the kernel log message to a higher level since the verbosity is
> not always emitted here; we need not print an error message if an exiting
> task is given a longer timeslice.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

need respin.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01  7:19 ` [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit David Rientjes
@ 2010-06-01  7:40   ` KOSAKI Motohiro
  2010-06-01 18:59     ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Tasks detach its ->mm prior to exiting so it's possible that in progress
> oom kills or already exiting tasks may be missed during the oom killer's
> tasklist scan.  When an eligible task is found with either TIF_MEMDIE or
> PF_EXITING set, the oom killer is supposed to be a no-op to avoid
> needlessly killing additional tasks.  This closes the race between a task
> detaching its ->mm and being removed from the tasklist.
> 
> Out of memory conditions as the result of memory controllers will
> automatically filter tasks that have detached their ->mm (since
> task_in_mem_cgroup() will return 0).  This is acceptable, however, since
> memcg constrained ooms aren't the result of a lack of memory resources but
> rather a limit imposed by userspace that requires a task be killed
> regardless.
> 
> [oleg@redhat.com: fix PF_EXITING check for !p->mm tasks]
> Acked-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: David Rientjes <rientjes@google.com>

need respin.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-01  7:19 ` [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
@ 2010-06-01  7:41   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> From: Oleg Nesterov <oleg@redhat.com>
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)

need respin.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-01  7:19 ` [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
@ 2010-06-01  7:41   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> From: Oleg Nesterov <oleg@redhat.com>
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what oom_badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
>   Change oom_badness() to return int, change it to return -1 if
>   find_lock_task_mm() fails, and change select_bad_process() to
>   check points >= 0.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- oom_badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> 	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
> 	  and it also needs other changes to actually find the first
> 	  first-descendant children
> 
> This will be addressed later.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

need respin.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 16/18] oom: give current access to memory reserves if it has been killed
  2010-06-01  7:19 ` [patch -mm 16/18] oom: give current access to memory reserves if it has been killed David Rientjes
@ 2010-06-01  7:44   ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-01  7:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> It's possible to livelock the page allocator if a thread has mm->mmap_sem
> and fails to make forward progress because the oom killer selects another
> thread sharing the same ->mm to kill that cannot exit until the semaphore
> is dropped.
> 
> The oom killer will not kill multiple tasks at the same time; each oom
> killed task must exit before another task may be killed.  Thus, if one
> thread is holding mm->mmap_sem and cannot allocate memory, all threads
> sharing the same ->mm are blocked from exiting as well.  In the oom kill
> case, that means the thread holding mm->mmap_sem will never free
> additional memory since it cannot get access to memory reserves and the
> thread that depends on it with access to memory reserves cannot exit
> because it cannot acquire the semaphore.  Thus, the page allocators
> livelocks.
> 
> When the oom killer is called and current happens to have a pending
> SIGKILL, this patch automatically gives it access to memory reserves and
> returns.  Upon returning to the page allocator, its allocation will
> hopefully succeed so it can quickly exit and free its memory.  If not, the
> page allocator will fail the allocation if it is not __GFP_NOFAIL.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

ack.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01  7:18 ` [patch -mm 08/18] oom: badness heuristic rewrite David Rientjes
  2010-06-01  7:36   ` KOSAKI Motohiro
@ 2010-06-01  7:46   ` Nick Piggin
  2010-06-01 18:56     ` David Rientjes
  1 sibling, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2010-06-01  7:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

On Tue, Jun 01, 2010 at 12:18:43AM -0700, David Rientjes wrote:
> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions.  The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.

Do you have particular ways of testing this (and other heuristics
changes such as the forkbomb detector)?

Such that you can look at your test case or workload and see that
it is really improved?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01  7:36   ` KOSAKI Motohiro
@ 2010-06-01 18:44     ` David Rientjes
  2010-06-02 13:54       ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01 18:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, KOSAKI Motohiro wrote:

> > This a complete rewrite of the oom killer's badness() heuristic which is
> > used to determine which task to kill in oom conditions.  The goal is to
> > make it as simple and predictable as possible so the results are better
> > understood and we end up killing the task which will lead to the most
> > memory freeing while still respecting the fine-tuning from userspace.
> > 
> > The baseline for the heuristic is a proportion of memory that each task is
> > currently using in memory plus swap compared to the amount of "allowable"
> > memory.  "Allowable," in this sense, means the system-wide resources for
> > unconstrained oom conditions, the set of mempolicy nodes, the mems
> > attached to current's cpuset, or a memory controller's limit.  The
> > proportion is given on a scale of 0 (never kill) to 1000 (always kill),
> > roughly meaning that if a task has a badness() score of 500 that the task
> > consumes approximately 50% of allowable memory resident in RAM or in swap
> > space.
> > 
> > The proportion is always relative to the amount of "allowable" memory and
> > not the total amount of RAM systemwide so that mempolicies and cpusets may
> > operate in isolation; they shall not need to know the true size of the
> > machine on which they are running if they are bound to a specific set of
> > nodes or mems, respectively.
> > 
> > Root tasks are given 3% extra memory just like __vm_enough_memory()
> > provides in LSMs.  In the event of two tasks consuming similar amounts of
> > memory, it is generally better to save root's task.
> > 
> > Because of the change in the badness() heuristic's baseline, it is also
> > necessary to introduce a new user interface to tune it.  It's not possible
> > to redefine the meaning of /proc/pid/oom_adj with a new scale since the
> > ABI cannot be changed for backward compatability.  Instead, a new tunable,
> > /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
> > be used to polarize the heuristic such that certain tasks are never
> > considered for oom kill while others may always be considered.  The value
> > is added directly into the badness() score so a value of -500, for
> > example, means to discount 50% of its memory consumption in comparison to
> > other tasks either on the system, bound to the mempolicy, in the cpuset,
> > or sharing the same memory controller.
> > 
> > /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> > units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> > these per-task tunables will rescale the value of the other to an
> > equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> > a bitshift on the badness score, it now shares the same linear growth as
> > /proc/pid/oom_score_adj but with different granularity.  This is required
> > so the ABI is not broken with userspace applications and allows oom_adj to
> > be deprecated for future removal.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> nack
> 

Why?

If it's because the patch is too big, I've explained a few times that 
functionally you can't break it apart into anything meaningful.  I do not 
believe it is better to break functional changes into smaller patches that 
simply change function signatures to pass additional arguments that are 
unused in the first patch, for example.

If it's because it adds /proc/pid/oom_score_adj in the same patch, that's 
allowed since otherwise it would be useless with the old heuristic.  In 
other words, you cannot apply oom_score_adj's meaning to the bitshift in 
any sane way.

I'll suggest what I have multiple times: the easiest way to review the 
functional change here is to merge the patch into your own tree and then 
review oom_badness().  I agree that the way the diff comes out it is a 
little difficult to read just from the patch form, so merging it and 
reviewing the actual heuristic function is the easiest way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01  7:46   ` Nick Piggin
@ 2010-06-01 18:56     ` David Rientjes
  2010-06-02 13:54       ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01 18:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, Oleg Nesterov, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, Nick Piggin wrote:

> > This a complete rewrite of the oom killer's badness() heuristic which is
> > used to determine which task to kill in oom conditions.  The goal is to
> > make it as simple and predictable as possible so the results are better
> > understood and we end up killing the task which will lead to the most
> > memory freeing while still respecting the fine-tuning from userspace.
> 
> Do you have particular ways of testing this (and other heuristics
> changes such as the forkbomb detector)?
> 

Yes, the patch prior to this one in the series, "oom: enable oom tasklist 
dump by default", allows you to examine the oom_score_adj of all eligible 
tasks.  Used in combination with /proc/pid/oom_score, which reports the 
result of the badness heuristic to userspace, I tested the result of the 
change by ensuring that it worked as intended.  Since we'll now see a 
tasklist dump of all eligible tasks whenever someone reports an oom 
problem (hopefully fewer reports as a result of this rewrite than 
currently!), it's much easier to determine (i) why the oom killer was 
called, and (ii) why a particular task was chosen for kill.  That's been 
my testing philosophy.

The forkbomb detector does add a minimal bias to tasks that have a large 
number of execve children, just as the current oom killer does (although 
the bias is much smaller with my heursitic).  Rik and I had a lengthy 
conversation on linux-mm about that when it was first proposed.  The key 
to that particular bias is that you must remember that even though a task 
is selected for oom kill that the oom killer still attempts to kill an 
execve child first.  So the end result is that an important system daemon, 
such as a webserver, doesn't actually get oom killed when it's selected as 
a result of this, but it's more of a bias toward the children to be killed 
(a client) instead.  We're guaranteed that a child will be killed if a 
task is chosen as the result of a tiebreaker because of the forkbomb 
detector because it surely has a child with a different mm that is 
eligible.  This isn't meant to be enforce a kernel-wide forkbomb policy, 
which would obviously be better implemented elsewhere, but rather bias the 
children when a parent is forking an egregiously large number of tasks.  
"Egregious" in this case is defined as whatever the user uses for 
oom_forkbomb_thres, which I believe defaults to a sane value of 1000.

> Such that you can look at your test case or workload and see that
> it is really improved?
> 

I'm glad you asked that because some recent conversation has been 
slightly confusing to me about how this affects the desktop; this rewrite 
significantly improves the oom killer's response for desktop users.  The 
core ideas were developed in the thread from this mailing list back in 
February called "Improving OOM killer" at 
http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report 
that vital system tasks such as kdeinit are killed whenever a memory 
hogging task is forked either intentionally or unintentionally.  I argued 
for a while that KDE should be taking proper precautions by adjusting its 
own oom_adj score and that of its forked children as it's an inherited 
value, but I was eventually convinced that an overall improvement to the 
heuristic must be made to kill a task that was known to free a large 
amount of memory that is resident in RAM and that we have a consistent way 
of defining oom priorities when a task is run uncontained and when it is a 
member of a memcg or cpuset (or even mempolicy now), even in the case when 
it's contained out from under the task's knowledge.  When faced with 
memory pressure from an out of control or memory hogging task on the 
desktop, the oom killer now kills it instead of a vital task such as an X 
server (and oracle, webserver, etc on server platforms) because of the use 
of the task's rss instead of total_vm statistic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01  7:37   ` KOSAKI Motohiro
@ 2010-06-01 18:57     ` David Rientjes
  2010-06-03 20:33       ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01 18:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, KOSAKI Motohiro wrote:

> > Add a forkbomb penalty for processes that fork an excessively large
> > number of children to penalize that group of tasks and not others.  A
> > threshold is configurable from userspace to determine how many first-
> > generation execve children (those with their own address spaces) a task
> > may have before it is considered a forkbomb.  This can be tuned by
> > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> > 1000.
> > 
> > When a task has more than 1000 first-generation children with different
> > address spaces than itself, a penalty of
> > 
> > 	(average rss of children) * (# of 1st generation execve children)
> > 	-----------------------------------------------------------------
> > 			oom_forkbomb_thres
> > 
> > is assessed.  So, for example, using the default oom_forkbomb_thres of
> > 1000, the penalty is twice the average rss of all its execve children if
> > there are 2000 such tasks.  A task is considered to count toward the
> > threshold if its total runtime is less than one second; for 1000 of such
> > tasks to exist, the parent process must be forking at an extremely high
> > rate either erroneously or maliciously.
> > 
> > Even though a particular task may be designated a forkbomb and selected as
> > the victim, the oom killer will still kill the 1st generation execve child
> > with the highest badness() score in its place.  The avoids killing
> > important servers or system daemons.  When a web server forks a very large
> > number of threads for client connections, for example, it is much better
> > to kill one of those threads than to kill the server and make it
> > unresponsive.
> > 
> > [oleg@redhat.com: optimize task_lock when iterating children]
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> nack
> 

Why?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 12/18] oom: remove unnecessary code and cleanup
  2010-06-01  7:40   ` KOSAKI Motohiro
@ 2010-06-01 18:58     ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01 18:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, KOSAKI Motohiro wrote:

> > Remove the redundancy in __oom_kill_task() since:
> > 
> >  - init can never be passed to this function: it will never be PF_EXITING
> >    or selectable from select_bad_process(), and
> > 
> >  - it will never be passed a task from oom_kill_task() without an ->mm
> >    and we're unconcerned about detachment from exiting tasks, there's no
> >    reason to protect them against SIGKILL or access to memory reserves.
> > 
> > Also moves the kernel log message to a higher level since the verbosity is
> > not always emitted here; we need not print an error message if an exiting
> > task is given a longer timeslice.
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> need respin.
> 

This is a duplicate of the same patch that you earlier added your 
Reviewed-by line as cited above, what has changed?  This applies fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01  7:40   ` KOSAKI Motohiro
@ 2010-06-01 18:59     ` David Rientjes
  2010-06-01 20:43       ` Oleg Nesterov
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-01 18:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, KOSAKI Motohiro wrote:

> > Tasks detach its ->mm prior to exiting so it's possible that in progress
> > oom kills or already exiting tasks may be missed during the oom killer's
> > tasklist scan.  When an eligible task is found with either TIF_MEMDIE or
> > PF_EXITING set, the oom killer is supposed to be a no-op to avoid
> > needlessly killing additional tasks.  This closes the race between a task
> > detaching its ->mm and being removed from the tasklist.
> > 
> > Out of memory conditions as the result of memory controllers will
> > automatically filter tasks that have detached their ->mm (since
> > task_in_mem_cgroup() will return 0).  This is acceptable, however, since
> > memcg constrained ooms aren't the result of a lack of memory resources but
> > rather a limit imposed by userspace that requires a task be killed
> > regardless.
> > 
> > [oleg@redhat.com: fix PF_EXITING check for !p->mm tasks]
> > Acked-by: Nick Piggin <npiggin@suse.de>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> need respin.
> 

No, it applies to mmotm-2010-05-21-16-05 as all of these patches do.  I 
know you've pushed Oleg's patches but they are also included here so no 
respin is necessary unless they are merged first (and I think that should 
only happen if Andrew considers them to be rc material).  I'll base my 
patchsets on the -mm tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01 18:59     ` David Rientjes
@ 2010-06-01 20:43       ` Oleg Nesterov
  2010-06-01 21:19         ` David Rientjes
                           ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Oleg Nesterov @ 2010-06-01 20:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On 06/01, David Rientjes wrote:
>
> No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> know you've pushed Oleg's patches

(plus other fixes)

> but they are also included here so no
> respin is necessary unless they are merged first (and I think that should
> only happen if Andrew considers them to be rc material).

Well, I disagree.

I think it is always better to push the simple bugfixes first, then
change/improve the logic.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01 20:43       ` Oleg Nesterov
@ 2010-06-01 21:19         ` David Rientjes
  2010-06-02  0:28         ` KAMEZAWA Hiroyuki
  2010-06-02 13:54         ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-01 21:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, Oleg Nesterov wrote:

> On 06/01, David Rientjes wrote:
> >
> > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> > know you've pushed Oleg's patches
> 
> (plus other fixes)
> 

You're suggesting that I should develop my patches on top of what I 
speculate that Andrew will eventually merge in -mm?  I don't have that 
kind of time, sorry.

> > but they are also included here so no
> > respin is necessary unless they are merged first (and I think that should
> > only happen if Andrew considers them to be rc material).
> 
> Well, I disagree.
> 
> I think it is always better to push the simple bugfixes first, then
> change/improve the logic.
> 

Unless your fixes, which seem to still be under development considering 
your discussion with KOSAKI in those threads, are going into 2.6.35 during 
the rc cycle, then there's no difference in them being merged as part of 
this patchset since they are duplicated here.  So you'll need to convince 
Andrew they are rc material otherwise it doesn't matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01 20:43       ` Oleg Nesterov
  2010-06-01 21:19         ` David Rientjes
@ 2010-06-02  0:28         ` KAMEZAWA Hiroyuki
  2010-06-02  9:49           ` David Rientjes
  2010-06-02 13:54         ` KOSAKI Motohiro
  2 siblings, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-02  0:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Rik van Riel,
	Nick Piggin, Balbir Singh, linux-mm

On Tue, 1 Jun 2010 22:43:42 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/01, David Rientjes wrote:
> >
> > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> > know you've pushed Oleg's patches
> 
> (plus other fixes)
> 
> > but they are also included here so no
> > respin is necessary unless they are merged first (and I think that should
> > only happen if Andrew considers them to be rc material).
> 
> Well, I disagree.
> 
> I think it is always better to push the simple bugfixes first, then
> change/improve the logic.
> 
yes..yes...I hope David finish easy-to-be-merged ones and go to new stage.
IOW, please reduce size of patches sent at once.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-02  0:28         ` KAMEZAWA Hiroyuki
@ 2010-06-02  9:49           ` David Rientjes
  2010-06-02 10:46             ` Nick Piggin
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-02  9:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Oleg Nesterov, KOSAKI Motohiro, Andrew Morton, Rik van Riel,
	Nick Piggin, Balbir Singh, linux-mm

On Wed, 2 Jun 2010, KAMEZAWA Hiroyuki wrote:

> > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> > > know you've pushed Oleg's patches
> > 
> > (plus other fixes)
> > 
> > > but they are also included here so no
> > > respin is necessary unless they are merged first (and I think that should
> > > only happen if Andrew considers them to be rc material).
> > 
> > Well, I disagree.
> > 
> > I think it is always better to push the simple bugfixes first, then
> > change/improve the logic.
> > 
> yes..yes...I hope David finish easy-to-be-merged ones and go to new stage.
> IOW, please reduce size of patches sent at once.
> 

How do you define "easy-to-be-merged"?  We've been through several 
iterations of this patchset where the end result is that it's been merged 
in -mm once, removed from -mm six weeks later, and nobody providing any 
feedback that I can work from.  Providing simple "nack" emails does 
nothing for the development of the patchset unless you actively get 
involved in the review process and subsequent discussion on how to move 
forward.

Listen, I want to hear everybody's ideas and suggestions on improvements.  
In fact, I think I've responded in a way that demonstrates that quite 
well: I've dropped the consolidation of sysctls, I've avoided deprecation 
of existing sysctls, I've unified the semantics of panic_on_oom, and I've 
split out patches where possible.  All of those were at the requests of 
people whom I've asked to review this patchset time and time again.

Kame, you've been very helpful in your feedback with regards to this 
patchset and I've valued your feedback from the first revision.  We had 
some differing views of how to handle task selection early on in other 
threads, but I sincerely enjoy hearing your feedback because it's 
interesting and challenging; you find things that I've missed and 
challenge me to defend decisions that were made.  I really, really like 
doing that type of development, I just wish we all could make some forward 
progress on this thing instead of staling out all the time.

I'm asking everyone to please review this work and comment on what you 
don't like or provide suggestions on how to improve it.  It's been posted 
in its various forms about eight times now over the course of a few 
months, I really hope there's no big surprises in it to anyone anymore.  
Sure, there are cleanups here that possibly could be considered rc 
material even though they admittedly aren't critical, but that isn't a 
reason to just stall out all of this work.  I'm sure Andrew can decide 
what he wants to merge into 2.6.35-rc2 after looking at the discussion and 
analyzing the impact; let us please focus on the actual implementation and 
design choices of the new oom killer presented here rather than get 
sidetracked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-02  9:49           ` David Rientjes
@ 2010-06-02 10:46             ` Nick Piggin
  2010-06-02 21:35               ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2010-06-02 10:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Oleg Nesterov, KOSAKI Motohiro, Andrew Morton,
	Rik van Riel, Balbir Singh, linux-mm

On Wed, Jun 02, 2010 at 02:49:49AM -0700, David Rientjes wrote:
> On Wed, 2 Jun 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> > > > know you've pushed Oleg's patches
> > > 
> > > (plus other fixes)
> > > 
> > > > but they are also included here so no
> > > > respin is necessary unless they are merged first (and I think that should
> > > > only happen if Andrew considers them to be rc material).
> > > 
> > > Well, I disagree.
> > > 
> > > I think it is always better to push the simple bugfixes first, then
> > > change/improve the logic.
> > > 
> > yes..yes...I hope David finish easy-to-be-merged ones and go to new stage.
> > IOW, please reduce size of patches sent at once.
> > 
> 
> How do you define "easy-to-be-merged"?  We've been through several 
> iterations of this patchset where the end result is that it's been merged 
> in -mm once, removed from -mm six weeks later, and nobody providing any 
> feedback that I can work from.  Providing simple "nack" emails does 
> nothing for the development of the patchset unless you actively get 
> involved in the review process and subsequent discussion on how to move 
> forward.
> 
> Listen, I want to hear everybody's ideas and suggestions on improvements.  
> In fact, I think I've responded in a way that demonstrates that quite 
> well: I've dropped the consolidation of sysctls, I've avoided deprecation 
> of existing sysctls, I've unified the semantics of panic_on_oom, and I've 
> split out patches where possible.  All of those were at the requests of 
> people whom I've asked to review this patchset time and time again.
> 
> Kame, you've been very helpful in your feedback with regards to this 
> patchset and I've valued your feedback from the first revision.  We had 
> some differing views of how to handle task selection early on in other 
> threads, but I sincerely enjoy hearing your feedback because it's 
> interesting and challenging; you find things that I've missed and 
> challenge me to defend decisions that were made.  I really, really like 
> doing that type of development, I just wish we all could make some forward 
> progress on this thing instead of staling out all the time.

Well there are a large number of patches with no objections, some of
which are bug-fixes which may need to be backported to earlier kernels.
It would be nice if the patchset would be rearranged so all these can
be merged soon (I don't want the situation where a couple of patches
hold up your entire patchset again).

When you are reduced to a few patches changing major functionality, it
could be eaiser to get those reviewed and merged on their own.

 
> I'm asking everyone to please review this work and comment on what you 
> don't like or provide suggestions on how to improve it.  It's been posted 
> in its various forms about eight times now over the course of a few 
> months, I really hope there's no big surprises in it to anyone anymore.  
> Sure, there are cleanups here that possibly could be considered rc 
> material even though they admittedly aren't critical, but that isn't a 
> reason to just stall out all of this work.  I'm sure Andrew can decide 
> what he wants to merge into 2.6.35-rc2 after looking at the discussion and 
> analyzing the impact; let us please focus on the actual implementation and 
> design choices of the new oom killer presented here rather than get 
> sidetracked.

Well the merge window is closed and even if it wasn't the patches would
be better to sit in -mm for a bit. So I don't think there is a big rush
now, let's just get it right so everything is lined up to get into the
next merge window.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-01 20:43       ` Oleg Nesterov
  2010-06-01 21:19         ` David Rientjes
  2010-06-02  0:28         ` KAMEZAWA Hiroyuki
@ 2010-06-02 13:54         ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Rik van Riel,
	Nick Piggin, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> On 06/01, David Rientjes wrote:
> >
> > No, it applies to mmotm-2010-05-21-16-05 as all of these patches do. I
> > know you've pushed Oleg's patches
> 
> (plus other fixes)
> 
> > but they are also included here so no
> > respin is necessary unless they are merged first (and I think that should
> > only happen if Andrew considers them to be rc material).
> 
> Well, I disagree.
> 
> I think it is always better to push the simple bugfixes first, then
> change/improve the logic.

Yep. That's exactly the reason why I would push his patch series at first.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01 18:56     ` David Rientjes
@ 2010-06-02 13:54       ` KOSAKI Motohiro
  2010-06-02 21:23         ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Nick Piggin, Andrew Morton, Rik van Riel,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> > Such that you can look at your test case or workload and see that
> > it is really improved?
> > 
> 
> I'm glad you asked that because some recent conversation has been 
> slightly confusing to me about how this affects the desktop; this rewrite 
> significantly improves the oom killer's response for desktop users.  The 
> core ideas were developed in the thread from this mailing list back in 
> February called "Improving OOM killer" at 
> http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report 
> that vital system tasks such as kdeinit are killed whenever a memory 
> hogging task is forked either intentionally or unintentionally.  I argued 
> for a while that KDE should be taking proper precautions by adjusting its 
> own oom_adj score and that of its forked children as it's an inherited 
> value, but I was eventually convinced that an overall improvement to the 
> heuristic must be made to kill a task that was known to free a large 
> amount of memory that is resident in RAM and that we have a consistent way 
> of defining oom priorities when a task is run uncontained and when it is a 
> member of a memcg or cpuset (or even mempolicy now), even in the case when 
> it's contained out from under the task's knowledge.  When faced with 
> memory pressure from an out of control or memory hogging task on the 
> desktop, the oom killer now kills it instead of a vital task such as an X 
> server (and oracle, webserver, etc on server platforms) because of the use 
> of the task's rss instead of total_vm statistic.

The above story teach us oom-killer need some improvement. but it haven't
prove your patches are correct solution. that's why you got to ask testing way.

Nobody have objection to fix KDE OOM issue.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-01 18:44     ` David Rientjes
@ 2010-06-02 13:54       ` KOSAKI Motohiro
  2010-06-02 21:20         ` David Rientjes
  2010-06-03 23:10         ` Andrew Morton
  0 siblings, 2 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-02 13:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Why?
> 
> If it's because the patch is too big, I've explained a few times that 
> functionally you can't break it apart into anything meaningful.  I do not 
> believe it is better to break functional changes into smaller patches that 
> simply change function signatures to pass additional arguments that are 
> unused in the first patch, for example.
> 
> If it's because it adds /proc/pid/oom_score_adj in the same patch, that's 
> allowed since otherwise it would be useless with the old heuristic.  In 
> other words, you cannot apply oom_score_adj's meaning to the bitshift in 
> any sane way.
> 
> I'll suggest what I have multiple times: the easiest way to review the 
> functional change here is to merge the patch into your own tree and then 
> review oom_badness().  I agree that the way the diff comes out it is a 
> little difficult to read just from the patch form, so merging it and 
> reviewing the actual heuristic function is the easiest way.

I've already explained the reason. 1) all-of-rewrite patches are 
always unacceptable. that's prevent our code maintainance. 2) no justification
patches are also unacceptable. you need to write more proper patch descriptaion
at least.

We don't need pointless suggestion. you only need to fix the patch.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-02 13:54       ` KOSAKI Motohiro
@ 2010-06-02 21:20         ` David Rientjes
  2010-06-03 23:10         ` Andrew Morton
  1 sibling, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-02 21:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:

> I've already explained the reason. 1) all-of-rewrite patches are 
> always unacceptable. that's prevent our code maintainance.

How else would you propose to completely change a heuristic??  By doing it 
in steps where the intermediate changes make an absolute mess of it first 
and then slowly work toward the end result?

This is a complete rewrite of the badness() heuristic, it introduces a new 
userspace interface, oom_score_adj, which it heavily relies upon 
(otherwise it'd be impossible to disable oom killing completely for 
certain tasks, for example), so naturally that needs to be included.

I've followed your suggestion of splitting out the forkbomb detector into 
the next patch, which you don't even have any feedback for either other 
than "nack", so what else do you want from me??

Please follow my suggestion that I've repeatedly made: merge the patch 
locally and check out the new oom_badness() function and see if there's 
anything you're concerned with.  In other words, please actually review 
the implementation and design.

 > 2) no justification
> patches are also unacceptable. you need to write more proper patch descriptaion
> at least.
> 

What needs to be included in the patch description that isn't already?  I 
think it's intention and implementation is clearly spelled out.

> We don't need pointless suggestion. you only need to fix the patch.
> 

It's a review tip to make it easier to read the patch since the complete 
rewrite of oom_badness() is difficult to read in patch form because of the 
breaks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-02 13:54       ` KOSAKI Motohiro
@ 2010-06-02 21:23         ` David Rientjes
  2010-06-03  0:05           ` KAMEZAWA Hiroyuki
  2010-06-03  3:07           ` KOSAKI Motohiro
  0 siblings, 2 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-02 21:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:

> > I'm glad you asked that because some recent conversation has been 
> > slightly confusing to me about how this affects the desktop; this rewrite 
> > significantly improves the oom killer's response for desktop users.  The 
> > core ideas were developed in the thread from this mailing list back in 
> > February called "Improving OOM killer" at 
> > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report 
> > that vital system tasks such as kdeinit are killed whenever a memory 
> > hogging task is forked either intentionally or unintentionally.  I argued 
> > for a while that KDE should be taking proper precautions by adjusting its 
> > own oom_adj score and that of its forked children as it's an inherited 
> > value, but I was eventually convinced that an overall improvement to the 
> > heuristic must be made to kill a task that was known to free a large 
> > amount of memory that is resident in RAM and that we have a consistent way 
> > of defining oom priorities when a task is run uncontained and when it is a 
> > member of a memcg or cpuset (or even mempolicy now), even in the case when 
> > it's contained out from under the task's knowledge.  When faced with 
> > memory pressure from an out of control or memory hogging task on the 
> > desktop, the oom killer now kills it instead of a vital task such as an X 
> > server (and oracle, webserver, etc on server platforms) because of the use 
> > of the task's rss instead of total_vm statistic.
> 
> The above story teach us oom-killer need some improvement. but it haven't
> prove your patches are correct solution. that's why you got to ask testing way.
> 

I would consider what I said above, "when faced with memory pressure from 
an out of control or memory hogging task on the desktop, the oom killer 
now kills it instead of a vital task such as an X server because of the 
use of the task's rss instead of total_vm statistic" as an improvement 
over killing X in those cases which it currently does.  How do you 
disagree?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit
  2010-06-02 10:46             ` Nick Piggin
@ 2010-06-02 21:35               ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-02 21:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Oleg Nesterov, KOSAKI Motohiro, Andrew Morton,
	Rik van Riel, Balbir Singh, linux-mm

On Wed, 2 Jun 2010, Nick Piggin wrote:

> Well there are a large number of patches with no objections, some of
> which are bug-fixes which may need to be backported to earlier kernels.
> It would be nice if the patchset would be rearranged so all these can
> be merged soon (I don't want the situation where a couple of patches
> hold up your entire patchset again).
> 

I've written fixes in this patchset and have merged Oleg's work into it, 
but I would stress that none of these are really bugfixes that fix an 
unstable condition: killing a task outside of current's cpuset even though 
it was needless isn't a bugfix, recalling the oom killer once a kthread 
has called unuse_mm() isn't a bugfix, etc.  So while they definitely are 
fixes that we'd like to see upstream at some point, hence they were merged 
here as well, their impact is not as severe as it may have been described 
outside of this thread.

I definitely don't want that situation where a couple of patches hold it 
up either, I'm waiting for something to work on.

> When you are reduced to a few patches changing major functionality, it
> could be eaiser to get those reviewed and merged on their own.
> 

What patches specifically do you think are 2.6.35-rc2 material?  
Otherwise, in my opinion, holding up this entire thing from being merged 
doesn't make a lot of sense based on order of patches.

> Well the merge window is closed and even if it wasn't the patches would
> be better to sit in -mm for a bit. So I don't think there is a big rush
> now, let's just get it right so everything is lined up to get into the
> next merge window.
> 

They already sat in -mm for six weeks, so I had stopped my work thinking 
they already had a path upstream then were abruptly removed with the only 
alternative left to me in being to fold incremental fixes into one another 
and repost.  There have been no changes to what was sitting in -mm for 
six weeks other than dropping the consolidation of sysctls, the unifying 
of the panic_on_oom semantics for pagefault ooms, and refactoring of the 
patchset.

I'm left in the position where people want certain patches merged first 
even though they won't say it's rc material, they want to me to base my 
patchset off what they speculatively believe Andrew will eventually merge 
in -mm in the first place from others, and they refuse to review both the 
implementation and design of the new heursitic.  It compounds my work 
every day with absolutely no forward progress being made and we've stalled 
out on all this work because nobody is actually getting involved in 
reviewing the patchset for Andrew.

I honestly don't understand why this entire patchset cannot be merged 
right now with a target of 2.6.36.  If you disagree, please show me the 
patches that you believe are rc material and the problems that they fix 
that are either regressions from current code or have a severe enough 
impact to warrant that type of consideration.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-02 21:23         ` David Rientjes
@ 2010-06-03  0:05           ` KAMEZAWA Hiroyuki
  2010-06-03  6:44             ` David Rientjes
  2010-06-03  3:07           ` KOSAKI Motohiro
  1 sibling, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-03  0:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Nick Piggin, Andrew Morton, Rik van Riel,
	Oleg Nesterov, Balbir Singh, linux-mm

On Wed, 2 Jun 2010 14:23:53 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 2 Jun 2010, KOSAKI Motohiro wrote:
> 
> > > I'm glad you asked that because some recent conversation has been 
> > > slightly confusing to me about how this affects the desktop; this rewrite 
> > > significantly improves the oom killer's response for desktop users.  The 
> > > core ideas were developed in the thread from this mailing list back in 
> > > February called "Improving OOM killer" at 
> > > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report 
> > > that vital system tasks such as kdeinit are killed whenever a memory 
> > > hogging task is forked either intentionally or unintentionally.  I argued 
> > > for a while that KDE should be taking proper precautions by adjusting its 
> > > own oom_adj score and that of its forked children as it's an inherited 
> > > value, but I was eventually convinced that an overall improvement to the 
> > > heuristic must be made to kill a task that was known to free a large 
> > > amount of memory that is resident in RAM and that we have a consistent way 
> > > of defining oom priorities when a task is run uncontained and when it is a 
> > > member of a memcg or cpuset (or even mempolicy now), even in the case when 
> > > it's contained out from under the task's knowledge.  When faced with 
> > > memory pressure from an out of control or memory hogging task on the 
> > > desktop, the oom killer now kills it instead of a vital task such as an X 
> > > server (and oracle, webserver, etc on server platforms) because of the use 
> > > of the task's rss instead of total_vm statistic.
> > 
> > The above story teach us oom-killer need some improvement. but it haven't
> > prove your patches are correct solution. that's why you got to ask testing way.
> > 
> 
> I would consider what I said above, "when faced with memory pressure from 
> an out of control or memory hogging task on the desktop, the oom killer 
> now kills it instead of a vital task such as an X server because of the 
> use of the task's rss instead of total_vm statistic" as an improvement 
> over killing X in those cases which it currently does.  How do you 
> disagree?
> 

It was you who disagree using RSS for oom killing in the last winter.
By what observation did you change your mind ? (Don't take this as criticism.
I'm just curious.) 

My stand point:
I don't like the new interface at all but welcome the concept for using RSS .
And I and my custoemr will never use the new interface other than OOM_DISABLE.
So, I don't say ack nor nack.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-02 21:23         ` David Rientjes
  2010-06-03  0:05           ` KAMEZAWA Hiroyuki
@ 2010-06-03  3:07           ` KOSAKI Motohiro
  2010-06-03  6:48             ` David Rientjes
  2010-06-03 23:15             ` Andrew Morton
  1 sibling, 2 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-03  3:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Nick Piggin, Andrew Morton, Rik van Riel,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> > The above story teach us oom-killer need some improvement. but it haven't
> > prove your patches are correct solution. that's why you got to ask testing way.
> 
> I would consider what I said above, "when faced with memory pressure from 
> an out of control or memory hogging task on the desktop, the oom killer 
> now kills it instead of a vital task such as an X server because of the 
> use of the task's rss instead of total_vm statistic" as an improvement 
> over killing X in those cases which it currently does.  How do you 
> disagree?

People observed simple s/total_vm/rss/ patch solve X issue. Then,
other additional pieces need to explain why that's necessary and
how to confirm it.

In other word, I'm sure I'll continue to get OOM bug report in future.
I'll need to decide revert or not revert each patches. no infomation is
unwelcome. also, that's the reason why all of rewrite patch is wrong.
if it will be merged, small bug report eventually is going to make
all of revert. that doesn't fit our developerment process.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03  0:05           ` KAMEZAWA Hiroyuki
@ 2010-06-03  6:44             ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-03  6:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Nick Piggin, Andrew Morton, Rik van Riel,
	Oleg Nesterov, Balbir Singh, linux-mm

On Thu, 3 Jun 2010, KAMEZAWA Hiroyuki wrote:

> > > > I'm glad you asked that because some recent conversation has been 
> > > > slightly confusing to me about how this affects the desktop; this rewrite 
> > > > significantly improves the oom killer's response for desktop users.  The 
> > > > core ideas were developed in the thread from this mailing list back in 
> > > > February called "Improving OOM killer" at 
> > > > http://marc.info/?t=126506191200004&r=4&w=2 -- users constantly report 
> > > > that vital system tasks such as kdeinit are killed whenever a memory 
> > > > hogging task is forked either intentionally or unintentionally.  I argued 
> > > > for a while that KDE should be taking proper precautions by adjusting its 
> > > > own oom_adj score and that of its forked children as it's an inherited 
> > > > value, but I was eventually convinced that an overall improvement to the 
> > > > heuristic must be made to kill a task that was known to free a large 
> > > > amount of memory that is resident in RAM and that we have a consistent way 
> > > > of defining oom priorities when a task is run uncontained and when it is a 
> > > > member of a memcg or cpuset (or even mempolicy now), even in the case when 
> > > > it's contained out from under the task's knowledge.  When faced with 
> > > > memory pressure from an out of control or memory hogging task on the 
> > > > desktop, the oom killer now kills it instead of a vital task such as an X 
> > > > server (and oracle, webserver, etc on server platforms) because of the use 
> > > > of the task's rss instead of total_vm statistic.
> > > 
> > > The above story teach us oom-killer need some improvement. but it haven't
> > > prove your patches are correct solution. that's why you got to ask testing way.
> > > 
> > 
> > I would consider what I said above, "when faced with memory pressure from 
> > an out of control or memory hogging task on the desktop, the oom killer 
> > now kills it instead of a vital task such as an X server because of the 
> > use of the task's rss instead of total_vm statistic" as an improvement 
> > over killing X in those cases which it currently does.  How do you 
> > disagree?
> > 
> 
> It was you who disagree using RSS for oom killing in the last winter.
> By what observation did you change your mind ? (Don't take this as criticism.
> I'm just curious.) 
> 

The fact that when I ran the new heuristic it improved the oom killer on 
my desktop to save KDE and kill a memory-hogging task that stressed it.  I 
became supportive of the idea through the discussion that went on 
specifically about using total_vm as a baseline and was convinced that it 
was better to use rss as well as a more powerful user interface so that 
admins could more accurately set their oom kill priorities even when their 
cpuset, memcg, or mempolicy placement was changed out from under it.

> My stand point:
> I don't like the new interface at all but welcome the concept for using RSS .

Using rss is not a new interface.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03  3:07           ` KOSAKI Motohiro
@ 2010-06-03  6:48             ` David Rientjes
  2010-06-03 23:15             ` Andrew Morton
  1 sibling, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-03  6:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Thu, 3 Jun 2010, KOSAKI Motohiro wrote:

> > I would consider what I said above, "when faced with memory pressure from 
> > an out of control or memory hogging task on the desktop, the oom killer 
> > now kills it instead of a vital task such as an X server because of the 
> > use of the task's rss instead of total_vm statistic" as an improvement 
> > over killing X in those cases which it currently does.  How do you 
> > disagree?
> 
> People observed simple s/total_vm/rss/ patch solve X issue.

It doesn't, you need to consider swap as well.

> Then,
> other additional pieces need to explain why that's necessary and
> how to confirm it.
> 

Are you talking about oom_score_adj?  Please read the patch description.

> In other word, I'm sure I'll continue to get OOM bug report in future.
> I'll need to decide revert or not revert each patches. no infomation is
> unwelcome. also, that's the reason why all of rewrite patch is wrong.
> if it will be merged, small bug report eventually is going to make
> all of revert. that doesn't fit our developerment process.
> 

You're speculating that a new problem will be introduced with this change 
that you cannot describe but are concerned that you won't be able to debug 
that unknown issue without simply reverting the entire change?  These 
"nack"ing reasons of yours are getting more and more interesting, I must 
say.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01 18:57     ` David Rientjes
@ 2010-06-03 20:33       ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-03 20:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 1 Jun 2010, David Rientjes wrote:

> On Tue, 1 Jun 2010, KOSAKI Motohiro wrote:
> 
> > > Add a forkbomb penalty for processes that fork an excessively large
> > > number of children to penalize that group of tasks and not others.  A
> > > threshold is configurable from userspace to determine how many first-
> > > generation execve children (those with their own address spaces) a task
> > > may have before it is considered a forkbomb.  This can be tuned by
> > > altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> > > 1000.
> > > 
> > > When a task has more than 1000 first-generation children with different
> > > address spaces than itself, a penalty of
> > > 
> > > 	(average rss of children) * (# of 1st generation execve children)
> > > 	-----------------------------------------------------------------
> > > 			oom_forkbomb_thres
> > > 
> > > is assessed.  So, for example, using the default oom_forkbomb_thres of
> > > 1000, the penalty is twice the average rss of all its execve children if
> > > there are 2000 such tasks.  A task is considered to count toward the
> > > threshold if its total runtime is less than one second; for 1000 of such
> > > tasks to exist, the parent process must be forking at an extremely high
> > > rate either erroneously or maliciously.
> > > 
> > > Even though a particular task may be designated a forkbomb and selected as
> > > the victim, the oom killer will still kill the 1st generation execve child
> > > with the highest badness() score in its place.  The avoids killing
> > > important servers or system daemons.  When a web server forks a very large
> > > number of threads for client connections, for example, it is much better
> > > to kill one of those threads than to kill the server and make it
> > > unresponsive.
> > > 
> > > [oleg@redhat.com: optimize task_lock when iterating children]
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > nack
> > 
> 
> Why?
> 

Still waiting for an answer to this, KOSAKI.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-02 13:54       ` KOSAKI Motohiro
  2010-06-02 21:20         ` David Rientjes
@ 2010-06-03 23:10         ` Andrew Morton
  2010-06-03 23:53           ` KAMEZAWA Hiroyuki
  2010-06-04 10:54           ` KOSAKI Motohiro
  1 sibling, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2010-06-03 23:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Wed,  2 Jun 2010 22:54:03 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > Why?
> > 
> > If it's because the patch is too big, I've explained a few times that 
> > functionally you can't break it apart into anything meaningful.  I do not 
> > believe it is better to break functional changes into smaller patches that 
> > simply change function signatures to pass additional arguments that are 
> > unused in the first patch, for example.
> > 
> > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's 
> > allowed since otherwise it would be useless with the old heuristic.  In 
> > other words, you cannot apply oom_score_adj's meaning to the bitshift in 
> > any sane way.
> > 
> > I'll suggest what I have multiple times: the easiest way to review the 
> > functional change here is to merge the patch into your own tree and then 
> > review oom_badness().  I agree that the way the diff comes out it is a 
> > little difficult to read just from the patch form, so merging it and 
> > reviewing the actual heuristic function is the easiest way.
> 
> I've already explained the reason. 1) all-of-rewrite patches are 
> always unacceptable. that's prevent our code maintainance.

No, we'll sometime completely replace implementations.  There's no hard
rule apart from "whatever makes sense".  If wholesale replacement makes
sense as a patch-presentation method then we'll do that.

> 2) no justification
> patches are also unacceptable. you need to write more proper patch descriptaion
> at least.

The descriptions look better than usual from a quick scan.  I haven't
really got into them yet.


And I'm going to have to get into it because of you guys' seeming
inability to get your act together.

The unsubstantiated "nack"s are of no use and I shall just be ignoring
them and making my own decisions.  If you have specific objections then
let's hear them.  In detail, please - don't refer to previous
conversations because that's all too confusing - there is benefit in
starting again.

I expect I'll be looking at the oom-killer situation in depth early
next week.  It would be useful if between now and then you can send
any specific, detailed and actionable comments which you have.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03  3:07           ` KOSAKI Motohiro
  2010-06-03  6:48             ` David Rientjes
@ 2010-06-03 23:15             ` Andrew Morton
  2010-06-04 10:54               ` KOSAKI Motohiro
  1 sibling, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2010-06-03 23:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Nick Piggin, Rik van Riel, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Thu,  3 Jun 2010 12:07:50 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> In other word, I'm sure I'll continue to get OOM bug report in future.

You must have some reason for believing that.  Please share it with us.

Even better: apply the patches and run some tests.  If you believe
there are new failure modes then surely you can quickly prepare a
testcase which demonstrates them.

Or just suggest a test case - I expect David will be able to test it.

Again: without hard, tangible engineering facts I cannot take comments
such as the above into account.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:10         ` Andrew Morton
@ 2010-06-03 23:53           ` KAMEZAWA Hiroyuki
  2010-06-04  0:04             ` Andrew Morton
                               ` (2 more replies)
  2010-06-04 10:54           ` KOSAKI Motohiro
  1 sibling, 3 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-03 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, linux-mm

On Thu, 3 Jun 2010 16:10:30 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed,  2 Jun 2010 22:54:03 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > Why?
> > > 
> > > If it's because the patch is too big, I've explained a few times that 
> > > functionally you can't break it apart into anything meaningful.  I do not 
> > > believe it is better to break functional changes into smaller patches that 
> > > simply change function signatures to pass additional arguments that are 
> > > unused in the first patch, for example.
> > > 
> > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's 
> > > allowed since otherwise it would be useless with the old heuristic.  In 
> > > other words, you cannot apply oom_score_adj's meaning to the bitshift in 
> > > any sane way.
> > > 
> > > I'll suggest what I have multiple times: the easiest way to review the 
> > > functional change here is to merge the patch into your own tree and then 
> > > review oom_badness().  I agree that the way the diff comes out it is a 
> > > little difficult to read just from the patch form, so merging it and 
> > > reviewing the actual heuristic function is the easiest way.
> > 
> > I've already explained the reason. 1) all-of-rewrite patches are 
> > always unacceptable. that's prevent our code maintainance.
> 
> No, we'll sometime completely replace implementations.  There's no hard
> rule apart from "whatever makes sense".  If wholesale replacement makes
> sense as a patch-presentation method then we'll do that.
> 
I agree. 

IMHO.

But this series includes both of bug fixes and new features at random.
Then, a small bugfixes, which doens't require refactoring, seems to do that.
That's irritating guys (at least me) because it seems that he tries to sneak
his own new logic into bugfix and moreover, it makes backport to distro difficult.
I'd like to beg him separate them into 2 series as bugfix and something new.


Thanks,
-Kame

 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:53           ` KAMEZAWA Hiroyuki
@ 2010-06-04  0:04             ` Andrew Morton
  2010-06-04  0:20               ` KAMEZAWA Hiroyuki
  2010-06-04  9:19             ` David Rientjes
  2010-06-04  9:43             ` Oleg Nesterov
  2 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2010-06-04  0:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, linux-mm

On Fri, 4 Jun 2010 08:53:47 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 3 Jun 2010 16:10:30 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed,  2 Jun 2010 22:54:03 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > Why?
> > > > 
> > > > If it's because the patch is too big, I've explained a few times that 
> > > > functionally you can't break it apart into anything meaningful.  I do not 
> > > > believe it is better to break functional changes into smaller patches that 
> > > > simply change function signatures to pass additional arguments that are 
> > > > unused in the first patch, for example.
> > > > 
> > > > If it's because it adds /proc/pid/oom_score_adj in the same patch, that's 
> > > > allowed since otherwise it would be useless with the old heuristic.  In 
> > > > other words, you cannot apply oom_score_adj's meaning to the bitshift in 
> > > > any sane way.
> > > > 
> > > > I'll suggest what I have multiple times: the easiest way to review the 
> > > > functional change here is to merge the patch into your own tree and then 
> > > > review oom_badness().  I agree that the way the diff comes out it is a 
> > > > little difficult to read just from the patch form, so merging it and 
> > > > reviewing the actual heuristic function is the easiest way.
> > > 
> > > I've already explained the reason. 1) all-of-rewrite patches are 
> > > always unacceptable. that's prevent our code maintainance.
> > 
> > No, we'll sometime completely replace implementations.  There's no hard
> > rule apart from "whatever makes sense".  If wholesale replacement makes
> > sense as a patch-presentation method then we'll do that.
> > 
> I agree. 
> 
> IMHO.
> 
> But this series includes both of bug fixes and new features at random.
> Then, a small bugfixes, which doens't require refactoring, seems to do that.
> That's irritating guys (at least me) because it seems that he tries to sneak
> his own new logic into bugfix and moreover, it makes backport to distro difficult.
> I'd like to beg him separate them into 2 series as bugfix and something new.
> 

Sure, bugfixes should come separately and first.  For a number of
reasons:

- people (including the -stable maintainers) might want to backport them

- we might end up not merging the larger, bugfix-including patches at all

- the large bugfix-including patches might blow up and need
  reverting.  If we do that, we accidentally revert bugfixes!

Have we identified specifically which bugfixes should be separated out
in this fashion?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-04  0:04             ` Andrew Morton
@ 2010-06-04  0:20               ` KAMEZAWA Hiroyuki
  2010-06-04  5:57                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-04  0:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, linux-mm

On Thu, 3 Jun 2010 17:04:43 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Sure, bugfixes should come separately and first.  For a number of
> reasons:
> 
> - people (including the -stable maintainers) might want to backport them
> 
> - we might end up not merging the larger, bugfix-including patches at all
> 
> - the large bugfix-including patches might blow up and need
>   reverting.  If we do that, we accidentally revert bugfixes!
> 
> Have we identified specifically which bugfixes should be separated out
> in this fashion?
> 

In my personal observation

 [1/18]  for better behavior under cpuset.
 [2/18]  for better behavior under cpuset.
 [3/18]  for better behavior under mempolicy.
 [4/18]  refactoring.
 [5/18]  refactoring.
 [6/18]  clean up.
 [7/18]  changing the deault sysctl value.
 [8/18]  completely new logic.
 [9/18]  completely new logic.
 [10/18] a supplement for 8,9.
 [11/18] for better behavior under lowmem oom (disable oom kill)
 [12/18] clean up
 [13/18] bugfix for a possible race condition. (I'm not sure about details)
 [14/18] bugfix
 [15/18] bugfix
 [16/18] bugfix
 [17/18] bugfix
 [18/18] clean up.

If distro admins are aggresive, them may backport 1,2,3,7,11 but
it changes current logic. So, it's distro's decision.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-04  0:20               ` KAMEZAWA Hiroyuki
@ 2010-06-04  5:57                 ` KAMEZAWA Hiroyuki
  2010-06-04  9:22                   ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-04  5:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, KOSAKI Motohiro, David Rientjes, Rik van Riel,
	Nick Piggin, Oleg Nesterov, Balbir Singh, linux-mm

On Fri, 4 Jun 2010 09:20:47 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 3 Jun 2010 17:04:43 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Sure, bugfixes should come separately and first.  For a number of
> > reasons:
> > 
> > - people (including the -stable maintainers) might want to backport them
> > 
> > - we might end up not merging the larger, bugfix-including patches at all
> > 
> > - the large bugfix-including patches might blow up and need
> >   reverting.  If we do that, we accidentally revert bugfixes!
> > 
> > Have we identified specifically which bugfixes should be separated out
> > in this fashion?
> > 
> 
> In my personal observation
> 
>  [1/18]  for better behavior under cpuset.
>  [2/18]  for better behavior under cpuset.
>  [3/18]  for better behavior under mempolicy.
>  [4/18]  refactoring.
>  [5/18]  refactoring.
>  [6/18]  clean up.
>  [7/18]  changing the deault sysctl value.
>  [8/18]  completely new logic.
>  [9/18]  completely new logic.
>  [10/18] a supplement for 8,9.
>  [11/18] for better behavior under lowmem oom (disable oom kill)
>  [12/18] clean up
>  [13/18] bugfix for a possible race condition. (I'm not sure about details)
>  [14/18] bugfix
>  [15/18] bugfix
>  [16/18] bugfix
>  [17/18] bugfix
>  [18/18] clean up.
> 
> If distro admins are aggresive, them may backport 1,2,3,7,11 but
> it changes current logic. So, it's distro's decision.
> 

IMHO, without considering HUNKs, the patch order should be

  13,14,15,16,17,1,2,3,7,11,4,5,6,18,12,8,9,10.

bugfix -> patches for things making better -> refactoring -> the new implementation.

David, I have no objections to functions itself. But please start from small
good things. "Refactoring" is good but it tend to make backporting
not-straightforward. So, I think it should be done when there is no known issues.
I think you can do.

Bye,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:53           ` KAMEZAWA Hiroyuki
  2010-06-04  0:04             ` Andrew Morton
@ 2010-06-04  9:19             ` David Rientjes
  2010-06-04  9:43             ` Oleg Nesterov
  2 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-04  9:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, linux-mm

On Fri, 4 Jun 2010, KAMEZAWA Hiroyuki wrote:

> > No, we'll sometime completely replace implementations.  There's no hard
> > rule apart from "whatever makes sense".  If wholesale replacement makes
> > sense as a patch-presentation method then we'll do that.
> > 
> I agree. 
> 
> IMHO.
> 
> But this series includes both of bug fixes and new features at random.
> Then, a small bugfixes, which doens't require refactoring, seems to do that.
> That's irritating guys (at least me) because it seems that he tries to sneak
> his own new logic into bugfix and moreover, it makes backport to distro difficult.

I'll reply to your proposed patch order in your other email, but please 
don't think that I'm trying to sneak anything in with this series :)  It's 
been posted here for months and everything has been fully open to review 
and comment.  Most of the patches that have been added on after the 
heuristic rewrite were things that came up later in testing and 
inspection, so I understand how the series has a somewhat awkward flow.  
I'll fix that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-04  5:57                 ` KAMEZAWA Hiroyuki
@ 2010-06-04  9:22                   ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-04  9:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, KOSAKI Motohiro, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, linux-mm

On Fri, 4 Jun 2010, KAMEZAWA Hiroyuki wrote:

> > In my personal observation
> > 
> >  [1/18]  for better behavior under cpuset.
> >  [2/18]  for better behavior under cpuset.
> >  [3/18]  for better behavior under mempolicy.
> >  [4/18]  refactoring.
> >  [5/18]  refactoring.
> >  [6/18]  clean up.
> >  [7/18]  changing the deault sysctl value.
> >  [8/18]  completely new logic.
> >  [9/18]  completely new logic.
> >  [10/18] a supplement for 8,9.
> >  [11/18] for better behavior under lowmem oom (disable oom kill)
> >  [12/18] clean up
> >  [13/18] bugfix for a possible race condition. (I'm not sure about details)
> >  [14/18] bugfix
> >  [15/18] bugfix
> >  [16/18] bugfix
> >  [17/18] bugfix
> >  [18/18] clean up.
> > 
> > If distro admins are aggresive, them may backport 1,2,3,7,11 but
> > it changes current logic. So, it's distro's decision.
> > 
> 
> IMHO, without considering HUNKs, the patch order should be
> 
>   13,14,15,16,17,1,2,3,7,11,4,5,6,18,12,8,9,10.
> 
> bugfix -> patches for things making better -> refactoring -> the new implementation.
> 

Thank you for very much for taking the time to look through each 
individual patch and suggest a different order.  If the ordering of the 
patches will help move us forward, then I'd be extremely happy to do it :)

> David, I have no objections to functions itself. But please start from small
> good things. "Refactoring" is good but it tend to make backporting
> not-straightforward. So, I think it should be done when there is no known issues.
> I think you can do.
> 

I'll reorganize the patchset itself without any implementation changes so 
it flows better and is more appropriately seperated as you suggest.  I 
still believe there is no -rc material within this series (implying there 
is no -stable material either), but if you believe so then please reply to 
those patches with the new posting so Andrew can consider pushing it to 
Linus.

Thanks Kame.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:53           ` KAMEZAWA Hiroyuki
  2010-06-04  0:04             ` Andrew Morton
  2010-06-04  9:19             ` David Rientjes
@ 2010-06-04  9:43             ` Oleg Nesterov
  2 siblings, 0 replies; 99+ messages in thread
From: Oleg Nesterov @ 2010-06-04  9:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, KOSAKI Motohiro, David Rientjes, Rik van Riel,
	Nick Piggin, Balbir Singh, linux-mm

On 06/04, KAMEZAWA Hiroyuki wrote:
>
> On Thu, 3 Jun 2010 16:10:30 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> But this series includes both of bug fixes and new features at random.
> Then, a small bugfixes, which doens't require refactoring, seems to do that.
> That's irritating guys (at least me)

Me too.

And Kosaki tries to fix these long-standing (and obvious) bugs first,
before refactoring.

So far (iiuc) David technically disagrees with the single patch which
removes the PF_EXITING check. OK, probably it needs more discussion
(once again: I can't judge, but I understand why Kosaki removed it).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:15             ` Andrew Morton
@ 2010-06-04 10:54               ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-04 10:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Nick Piggin, Rik van Riel,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

Hi


> > In other word, I'm sure I'll continue to get OOM bug report in future.
> 
> You must have some reason for believing that.  Please share it with us.

In past, OOM bug report havn't beed stopped. Why can I believe any miracle 
occur?

The fact is, any heuristic change have a risk. because we can't know
all of the world use case. then, I don't think we must not change anything
nor we must not makes any mistake. I only want to surely care to keep
trackability.



> Even better: apply the patches and run some tests.  If you believe
> there are new failure modes then surely you can quickly prepare a
> testcase which demonstrates them.
> 
> Or just suggest a test case - I expect David will be able to test it.
> 
> Again: without hard, tangible engineering facts I cannot take comments
> such as the above into account.

OK. I also aim to provide good and productive information. But I also 
have requests.
Recently mainly Oleg pointed some race and heuristic failure. I don't
want your engineer ignore such bug report. please help bugfix too, please.
otherwise, I'll upset again.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-03 23:10         ` Andrew Morton
  2010-06-03 23:53           ` KAMEZAWA Hiroyuki
@ 2010-06-04 10:54           ` KOSAKI Motohiro
  2010-06-04 20:57             ` David Rientjes
  1 sibling, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-04 10:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

Hi Andrew,

> > I've already explained the reason. 1) all-of-rewrite patches are 
> > always unacceptable. that's prevent our code maintainance.
> 
> No, we'll sometime completely replace implementations.  There's no hard
> rule apart from "whatever makes sense".  If wholesale replacement makes
> sense as a patch-presentation method then we'll do that.

Have you review the actual patches? And No, I don't think "complete 
replace with no test result" is adequate development way.

And, When developers post large patch set, Usually _you_ request show
demonstrate result. I haven't seen such result in this activity.

I agree OOM is invoked from various callsite (because page allocator is 
called from various),  triggered from various memory starvation and/or 
killable userland processes are also vary various. So, I don't think 
the patch author must do 100% corvarage test.

And I can say, I made some brief test case for confirming this and
I haven't seen critical fault. 

However, It doesn't give any reason to avoid code review and violate
our development process.


> > 2) no justification
> > patches are also unacceptable. you need to write more proper patch descriptaion
> > at least.
> 
> The descriptions look better than usual from a quick scan.  I haven't
> really got into them yet.
> 
> 
> And I'm going to have to get into it because of you guys' seeming
> inability to get your act together.

Inability? What do you mean inability? Almost all developers cooperate 
for making stabilized kernel. Is this effort inability? or meaningless?

Actually, the descriptions doesn't looks better really. We sometimes
ask him
 - which problem occur? how do you reproduce it?
 - which piece solve which issue?
 - how do you measure side effect?
 - how do you mesure or consider other workload user

But I got only the answer, "My patch is best. resistance is futile". that's
purely Baaaad.

At least, All of the patch author must to write the code intention. otherwise
how do we review such code? guessing intention often makes code misparse
and allow to insert bug. if the patch is enough small, it is not big problem.
we don't makes misparse so often. but if it's large, the big problem.

Again, I don't think we can't make separate the patch as individual parts
and I don't think to don't be able to write each changes intention.


> The unsubstantiated "nack"s are of no use and I shall just be ignoring
> them and making my own decisions.  If you have specific objections then
> let's hear them.  In detail, please - don't refer to previous
> conversations because that's all too confusing - there is benefit in
> starting again.

OK. I don't have any reason to confuse you. I'll fix me. My point is
really simple. The majority OOM user are in desktop. We must not ignore
them. such as

 - Any regression from desktop view are unacceptable
 - Any incompatibility of no desktop improvement are unacceptable
 - Any refusing bugfix are unacceptable
 - Any refusing reviewing are unacceptable (IOW, must get any developers ack.
   I'm ok even if they don't include me)

In other word, every heuristic change have to be explained why the patch
improve desktop or no side-effect desktop.
(ah, ok. for cpuset change is one of exception. desktop user definitely
don't use it)

I and any other reviewer only want to confirm the have have no significant
regression. All of patch authoer have to help this, I think.


> I expect I'll be looking at the oom-killer situation in depth early
> next week.  It would be useful if between now and then you can send
> any specific, detailed and actionable comments which you have.

1) fix bugs at fist before making new feature (a.k.a new bugs)
2) don't mix bugfix and new feature
3) make separate as natural and individual piece
4) keep small and reviewable patch size
5) stop ugly excuse, instead repeatedly rewrite until get anyone ack
6) don't ignore another developers bug report

Which is unactionable? I just don't understand :/
I didn't hope says the same thing twice and he repeatedly ignore
my opinion, thus, he got short answer. I didn't think this is inadequate
beucase he can google past mail.

The fact is, I and (gessing) all other developer don't get any pressure 
from our campany because enterprise vendor don't interest oom. We are 
making time by chopping our private time, for helping impvoe his patch. 
Beucase we know current oom logic doesn't fit nowadys modern desktop
environment and we surely hope to remove such harm.

However he repeatedly attach our goodwill and blame our tolerance. 
but also repeatedly said "My workload is important than other!".
Then, I got upset really.

The fact is, all of good developer never says "my workload is most
important in the world", it makes no sense and insane. I really hate
such selfish.


And No. I wouldn't hope to continue full review during the author refuse
to hear. Kidding me. Instead, I'll do cherry-picking good piece from the 
sludge at-random patches and push you them. I think that makes everybody 
happy, people get improvement, DaveR get the merge, and I'll free from 
this frustration source. Of cource, I'll refrect your review result 
if you can get reviewing time.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-04 10:54           ` KOSAKI Motohiro
@ 2010-06-04 20:57             ` David Rientjes
  2010-06-08 11:41               ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-04 20:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Fri, 4 Jun 2010, KOSAKI Motohiro wrote:

> Have you review the actual patches? And No, I don't think "complete 
> replace with no test result" is adequate development way.
> 

I have repeatedly said that the oom killer no longer kills KDE when run on 
my desktop in the presence of a memory hogging task that was written 
specifically to oom the machine.  That's a better result than the 
current implementation and was discussed thoroughly during the discussion 
on this mailing list back in February that inspired this rewrite to begin 
with.  I don't think there's any mystery there since you've referred to 
that change specifically for KDE in this thread yourself.

> And, When developers post large patch set, Usually _you_ request show
> demonstrate result. I haven't seen such result in this activity.
> 

You want to see a log that says "Killed process 1234 (memory-hogger)..." 
instead of "Killed process 1234 (kdeinit)..."?  You've supported the 
change from total_vm to rss as a baseline to begin with.  And after all 
this discussion, this is the first time you've ever said you wanted to see 
that type of log or anything like it.

> However, It doesn't give any reason to avoid code review and violate
> our development process.
> 

Nobody is avoiding code review here, that's pretty obvious, and I have no 
idea you're referring to when you're saying I'm violating the development 
process because this happens to rewrite an entire function and requires a 
new user interface and callsite fixups to be meaningful.  You specifically 
asked me to push the forkbomb detector in a different patch and I did that 
because it makes sense to seperate that heuristic, but even then you just 
wrote "nack" and haven't responded with why even after I've replied twice 
asking.  I'm really confused this behavior.

> > And I'm going to have to get into it because of you guys' seeming
> > inability to get your act together.
> 
> Inability? What do you mean inability? Almost all developers cooperate 
> for making stabilized kernel. Is this effort inability? or meaningless?
> 

I think he's saying that he expects that we should be able to work 
cooperateively in resolving any differences that we have in a respectful 
and technical manner on this list.

But I'll also add my two cents in that and say that we should probably be 
leaving maintainer duties up to the actual -mm tree maintainer, he knows 
the development process you're talking about pretty well.

> Actually, the descriptions doesn't looks better really. We sometimes
> ask him
>  - which problem occur? how do you reproduce it?

KDE gets killed, memory hogger doesn't.  Run memory hogger on your 
desktop.  KOSAKI, this isn't a surprise to you.

If this is your objection, I can certainly elaborate more in the changelog 
but up until yesterday you've never said you have a problem with it so how 
am I supposed to make any forward progress on this?  I can't read your 
mind when you say "nack" and I'd like to resolve any issues that people 
have, but that requires that they get involved.

>  - which piece solve which issue?

Mostly the baseline heuristic change to rss and swap, as you well know.

>  - how do you measure side effect?

As far as the objective of the oom killer is concerned as listed in 
mm/oom_kill.c's header, there is no side effects.  We're trying to kill a 
task that will free the largest amount of memory and clearly rss and swap 
is a better indication fo that then total_vm.

>  - how do you mesure or consider other workload user
> 

The objective of the oom killer is not different for different workloads.

> But I got only the answer, "My patch is best. resistance is futile". that's
> purely Baaaad.
> 

I haven't said anything new in the above, KOSAKI, you already knew all 
this.  I'll update the changelog to include some of this information for 
the next posting, but I'd really hope that this isn't the major problem 
that you've had the entire time that we've stalled weeks on.

> OK. I don't have any reason to confuse you. I'll fix me. My point is
> really simple. The majority OOM user are in desktop. We must not ignore
> them. such as
> 
>  - Any regression from desktop view are unacceptable

This patchset was specifically designed to improve the oom killer's 
behavior on the desktop!

>  - Any incompatibility of no desktop improvement are unacceptable

I don't understand this.

>  - Any refusing bugfix are unacceptable

I've merged most of Oleg's work into this patchset, the problem that we're 
having is deciding whether any of it is -rc material or not and should be 
pushed first.  I don't think any of it is, Oleg certainly wasn't pushing 
it and to date I don't believe has said it's rc material, so that's 
something you can talk about but I'm not refusing any bugfix.

>  - Any refusing reviewing are unacceptable (IOW, must get any developers ack.
>    I'm ok even if they don't include me)
> 

I've been begging for you to review this.

> 1) fix bugs at fist before making new feature (a.k.a new bugs)

Kame already suggested a new order to the patchset that I'll be 
restructuring.  I'm curious as to why this was removed from -mm though on 
your suggestion before any of this became an issue.  We've yet to hear 
that mysterious information.

> 2) don't mix bugfix and new feature

Andrew said bugfixes should come first, they will in the reposting, but I 
don't consider any of it to be -rc material.

> 3) make separate as natural and individual piece

I can't keep having this conversation, the patch is broken down into one 
functional unit as much as possible.  Please leave the maintainership of 
this code to Andrew who has already said entire implementation changes (in 
this case, a single function rewrite) is allowed if it makes sense.

> 4) keep small and reviewable patch size

Same as above.

> 5) stop ugly excuse, instead repeatedly rewrite until get anyone ack

I don't know what my ugly excuse is, but I'll be reordering the patches 
and sending them with an updated changelog on the badness heuristic 
rewrite.  I hope that will satisfy all your concerns.

> 6) don't ignore another developers bug report
> 

If you have a bug report that is the result of this rewrite, please come 
forward with it and don't carry this out by making me guess again.

> I didn't hope says the same thing twice and he repeatedly ignore
> my opinion, thus, he got short answer. I didn't think this is inadequate
> beucase he can google past mail.
> 

No, you've never said this is the reason why it was dropped from -mm or 
why it was "nack"'d early on.

> However he repeatedly attach our goodwill and blame our tolerance. 
> but also repeatedly said "My workload is important than other!".
> Then, I got upset really.
> 

What??  I don't even have a specific workload that I'm targeting with this 
change, I have no idea what you're referring to, we don't run much stuff 
on the desktop :)

> The fact is, all of good developer never says "my workload is most
> important in the world", it makes no sense and insane. I really hate
> such selfish.

Again, this is just a ridiculous accusation.  I have no idea what you're 
referring to since this rewrite is specifically addressed to fix the oom 
killer problems on the desktop.  I work on servers and systems software, I 
don't have a desktop workload that I'm advocating for here, so perhaps you 
got me confused with someone else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
  2010-06-01  7:20   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:37     ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> +		if (!has_intersects_mems_allowed(p))
> +			continue;
>  
>  		/*
>  		 * This task already has access to memory reserves and is

now we have three places of oom filtering
  (1) select_bad_process
  (2) dump_tasks
  (3) oom_kill_task (when oom_kill_allocating_task==1 only)

this patch only add the check to (1). I think we need (2) and (3) too.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:41     ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful and
> then the parent if none were found while the new code simply kills the
> most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or has
> a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
> so the final oom_kill_task() will always succeed.

probably we need to call has_intersects_mems_allowed() in this loop. likes

        /* Try to sacrifice the worst child first */
        do {
                list_for_each_entry(c, &t->children, sibling) {
                        unsigned long cpoints;

                        if (c->mm == p->mm)
                                continue;
                        if (oom_unkillable(c, mem, nodemask))
                                continue;

                        /* oom_badness() returns 0 if the thread is unkillable */
                        cpoints = oom_badness(c);
                        if (cpoints > victim_points) {
                                victim = c;
                                victim_points = cpoints;
                        }
                }
        } while_each_thread(p, t);


It mean we shouldn't assume parent and child have the same mems_allowed,
perhaps.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-01  7:18 ` [patch -mm 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
  2010-06-01  7:38   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:38     ` David Rientjes
  1 sibling, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.

Seems incorrect. CAP_SYS_RAWIO tasks usually both use GFP_KERNEL and GFP_DMA.
Even if last allocation is GFP_KERNEL, it doesn't provide any gurantee the
process doesn't have any in flight I/O.

Then, we can't remove for RAWIO protection from oom heuristics. but the code
itself seems ok though.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
  2010-06-01  7:20   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:43     ` David Rientjes
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.  To use a sane heuristic, we must ensure that killing a
> task would likely free memory for current and avoid needlessly killing
> others at all costs just because their potential memory freeing is
> unknown.  It is better to kill current than another task needlessly.

I've put following historically remark in the description of the patch.


    We applied the exactly same patch in 2005:

        : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce
        : Author: Paul Jackson <pj@sgi.com>
        : Date:   Tue Sep 6 15:18:13 2005 -0700
        :
        : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
        :
        : Now the real motivation for this cpuset mem_exclusive patch series seems
        : trivial.
        :
        : This patch keeps a task in or under one mem_exclusive cpuset from provoking an
        : oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
        : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
        : containment, there is little to gain from oom killing a task under a
        : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
        : allocation must come from disjoint memory nodes.
        :
        : This patch enables configuring a system so that a runaway job under one
        : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
        : that might be using very high compute and memory resources for a prolonged
        : time.

    And we changed it to current logic in 2006

        : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256
        : Author: Nick Piggin <npiggin@suse.de>
        : Date:   Mon Sep 25 23:31:29 2006 -0700
        :
        : [PATCH] oom: cpuset hint
        :
        : cpuset_excl_nodes_overlap does not always indicate that killing a task will
        : not free any memory we for us.  For example, we may be asking for an
        : allocation from _anywhere_ in the machine, or the task in question may be
        : pinning memory that is outside its cpuset.  Fix this by just causing
        : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.

    And we haven't get the explanation why this patch doesn't reintroduced
    an old issue. 

I don't refuse a patch if it have multiple ack. But if you have any
material or number, please show us soon.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms
  2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 23:28     ` Andrew Morton
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mempolicy.h |   13 +++++++-
>  mm/mempolicy.c            |   44 +++++++++++++++++++++++++
>  mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
>  3 files changed, 103 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +				const nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> +	return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +			const nodemask_t *mask)
> +{
> +	return false;
> +}
>  
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  }
>  #endif
>  
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy.  Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
> + * policy, always return true since it may allocate elsewhere on fallback.
> + *
> + * Takes task_lock(tsk) to prevent freeing of its mempolicy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +					const nodemask_t *mask)
> +{
> +	struct mempolicy *mempolicy;
> +	bool ret = true;
> +
> +	if (!mask)
> +		return ret;
> +	task_lock(tsk);
> +	mempolicy = tsk->mempolicy;
> +	if (!mempolicy)
> +		goto out;
> +
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		/*
> +		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
> +		 * allocate from, they may fallback to other nodes when oom.
> +		 * Thus, it's possible for tsk to have allocated memory from
> +		 * nodes in mask.
> +		 */
> +		break;
> +	case MPOL_BIND:
> +	case MPOL_INTERLEAVE:
> +		ret = nodes_intersects(mempolicy->v.nodes, *mask);
> +		break;
> +	default:
> +		BUG();
> +	}
> +out:
> +	task_unlock(tsk);
> +	return ret;
> +}
> +
>  /* Allocate a page in interleaved policy.
>     Own path because it needs to do special accounting. */
>  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/module.h>
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> @@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  
>  /*
>   * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +						const nodemask_t *mask)

nodemask is better name than plain "mask".

>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
>  
> -	t = tsk;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> -			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> -
> -	return 0;
> +		if (mask) {
> +			/*
> +			 * If this is a mempolicy constrained oom, tsk's
> +			 * cpuset is irrelevant.  Only return true if its
> +			 * mempolicy intersects current, otherwise it may be
> +			 * needlessly killed.
> +			 */
> +			if (mempolicy_nodemask_intersects(tsk, mask))
> +				return true;
> +		} else {
> +			/*
> +			 * This is not a mempolicy constrained oom, so only
> +			 * check the mems of tsk's cpuset.
> +			 */
> +			if (cpuset_mems_allowed_intersects(current, tsk))
> +				return true;
> +		}
> +		tsk = next_thread(tsk);
> +	} while (tsk != start);
> +	return false;

I had rewrite this to use while_each_thread(). please see it.


>  }
>  
>  /**
> @@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
>  static struct task_struct *select_bad_process(unsigned long *ppoints,
> -						struct mem_cgroup *mem)
> +		struct mem_cgroup *mem, enum oom_constraint constraint,
> +		const nodemask_t *mask)

dont need constraint argument. when !CONSTRAINT_MEMORY_POLICY case,
we can just pass mask==NULL.

and, here is also nodemask is better namek.


>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> @@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> -		if (!has_intersects_mems_allowed(p))
> +		if (!has_intersects_mems_allowed(p,
> +				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> +									 NULL))
>  			continue;
>  
>  		/*
> @@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  		panic("out of memory(memcg). panic_on_oom is selected.\n");
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
>  	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> +			enum oom_constraint constraint, const nodemask_t *mask)
>  {
>  	struct task_struct *p;
>  	unsigned long points;
> @@ -576,7 +597,7 @@ retry:
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL);
> +	p = select_bad_process(&points, NULL, constraint, mask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -610,7 +631,8 @@ void pagefault_out_of_memory(void)
>  		panic("out of memory from page fault. panic_on_oom is selected.\n");
>  
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> +	/* unknown gfp_mask and order */
> +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -626,6 +648,7 @@ void pagefault_out_of_memory(void)
>   * @zonelist: zonelist pointer
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
>   *
>   * If we run out of memory, we have the choice between either
>   * killing a random task (bad), letting the system crash (worse)
> @@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
> -
> -	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
> -	case CONSTRAINT_NONE:
> -		if (sysctl_panic_on_oom) {
> +	if (unlikely(sysctl_panic_on_oom)) {
> +		/*
> +		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> +		 * should not panic for cpuset or mempolicy induced memory
> +		 * failures.
> +		 */
> +		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> -			panic("out of memory. panic_on_oom is selected\n");
> +			panic("Out of memory: panic_on_oom is enabled\n");

you shouldn't immix undocumented and unnecessary change.

>  		}
> -		/* Fall-through */
> -	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> -		break;
>  	}
> -
> +	__out_of_memory(gfp_mask, order, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  
>  	/*



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
  2010-06-01  7:37   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> +	list_for_each_entry(child, &tsk->children, sibling) {

this loop only check childs that created by main-thread.
we need to iterate sub-threads created childs.

> +		struct task_cputime task_time;
> +		unsigned long runtime;
> +		unsigned long rss;
> +
> +		task_lock(child);
> +		if (!child->mm || child->mm == tsk->mm) {
> +			task_unlock(child);
> +			continue;
> +		}

need to use find_lock_task_mm().



> +		rss = get_mm_rss(child->mm);

need rss+swap for keeping consistency. I think.

> +		task_unlock(child);
> +
> +		thread_group_cputime(child, &task_time);
> +		runtime = cputime_to_jiffies(task_time.utime) +
> +			  cputime_to_jiffies(task_time.stime);
> +		/*
> +		 * Only threads that have run for less than a second are
> +		 * considered toward the forkbomb penalty, these threads rarely
> +		 * get to execute at all in such cases anyway.
> +		 */
> +		if (runtime < HZ) {
> +			child_rss += rss;
> +			forkcount++;
> +		}
> +	}
> +
> +	return forkcount > sysctl_oom_forkbomb_thres ?
> +				(child_rss / sysctl_oom_forkbomb_thres) : 0;

0 divide risk is there.
correct style is

	thres = sysctl_oom_forkbomb_thres
	if (!thres)
		return;
	child_rss / thres;

copying local variable is must.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic
  2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
  2010-06-01  7:37   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> Add a forkbomb penalty for processes that fork an excessively large
> number of children to penalize that group of tasks and not others.  A
> threshold is configurable from userspace to determine how many first-
> generation execve children (those with their own address spaces) a task
> may have before it is considered a forkbomb.  This can be tuned by
> altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> 1000.
> 
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
> 
> 	(average rss of children) * (# of 1st generation execve children)
> 	-----------------------------------------------------------------
> 			oom_forkbomb_thres
> 
> is assessed.  So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks.  A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
> 
> Even though a particular task may be designated a forkbomb and selected as
> the victim, the oom killer will still kill the 1st generation execve child
> with the highest badness() score in its place.  The avoids killing
> important servers or system daemons.  When a web server forks a very large
> number of threads for client connections, for example, it is much better
> to kill one of those threads than to kill the server and make it
> unresponsive.

Reviewers need to trace patch author's intention, this description seems
only focus "how to implement". but reviewers need the explaination of the 
big picture.

The old stragegy is here
  (1) accumulate half of child vsz
  (2) instead, kill the child at first

Your stragegy is here
  (a) usually dont accumulate child mem
  (b) but short lived child is accumulated
  (c) kill the child at first

I think, at least two explaination is necessary.

 - Usually, legitimate process (e.g. web server, rdb) makes a lot of
   1st generation child. but forkbomb usually makes multi level generation
   child. why do you only care 1st generation?
 - In usual case, your don't care the child rsz. but kill the child.
   That seems inconsistency than old. Why do you choose this technique?

Now, I don't have any objection at all because I haven't understand your point.
Ok, the concept of forkbomb detection is good. but need to describe

 - why do you choose this way?
 - how do you confirm your ways works fine?

Any heuristic can't reach perfect in practical. that's ok. but unclear
code intention easily makes code unmaintable. please avoid it.






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:45     ` David Rientjes
  2 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> @@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		return 0;
>  	}
>  
> -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> -					message, task_pid_nr(p), p->comm, points);
> +	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);
>  
> -	/* Try to kill a child first */
> +	do_posix_clock_monotonic_gettime(&uptime);
> +	/* Try to sacrifice the worst child first */
>  	list_for_each_entry(c, &p->children, sibling) {
> +		unsigned long cpoints;
> +
>  		if (c->mm == p->mm)
>  			continue;
>  		if (mem && !task_in_mem_cgroup(c, mem))
>  			continue;
> -		if (!oom_kill_task(c))
> -			return 0;
> +

need to the check of cpuset (and memplicy) memory intersection here, probably.
otherwise, this may selected innocence task.

also, OOM_DISABL check is necessary?

> +		/* badness() returns 0 if the thread is unkillable */
> +		cpoints = badness(c, uptime.tv_sec);
> +		if (cpoints > victim_points) {
> +			victim = c;
> +			victim_points = cpoints;
> +		}
>  	}
> -	return oom_kill_task(p);
> +	return oom_kill_task(victim);
>  }
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms
  2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
  2010-06-01  7:39   ` KOSAKI Motohiro
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mempolicy.h |   13 +++++++-
>  mm/mempolicy.c            |   44 +++++++++++++++++++++++++
>  mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
>  3 files changed, 103 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +				const nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> +	return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +			const nodemask_t *mask)
> +{
> +	return false;
> +}
>  
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  }
>  #endif
>  
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy.  Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
> + * policy, always return true since it may allocate elsewhere on fallback.
> + *
> + * Takes task_lock(tsk) to prevent freeing of its mempolicy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +					const nodemask_t *mask)
> +{
> +	struct mempolicy *mempolicy;
> +	bool ret = true;
> +
> +	if (!mask)
> +		return ret;
> +	task_lock(tsk);
> +	mempolicy = tsk->mempolicy;
> +	if (!mempolicy)
> +		goto out;
> +
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		/*
> +		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
> +		 * allocate from, they may fallback to other nodes when oom.
> +		 * Thus, it's possible for tsk to have allocated memory from
> +		 * nodes in mask.
> +		 */
> +		break;
> +	case MPOL_BIND:
> +	case MPOL_INTERLEAVE:
> +		ret = nodes_intersects(mempolicy->v.nodes, *mask);
> +		break;
> +	default:
> +		BUG();
> +	}
> +out:
> +	task_unlock(tsk);
> +	return ret;
> +}
> +
>  /* Allocate a page in interleaved policy.
>     Own path because it needs to do special accounting. */
>  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/module.h>
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> @@ -37,19 +38,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  
>  /*
>   * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +						const nodemask_t *mask)

nodemask is better name than plain "mask".

>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
>  
> -	t = tsk;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> -			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> -
> -	return 0;
> +		if (mask) {
> +			/*
> +			 * If this is a mempolicy constrained oom, tsk's
> +			 * cpuset is irrelevant.  Only return true if its
> +			 * mempolicy intersects current, otherwise it may be
> +			 * needlessly killed.
> +			 */
> +			if (mempolicy_nodemask_intersects(tsk, mask))
> +				return true;
> +		} else {
> +			/*
> +			 * This is not a mempolicy constrained oom, so only
> +			 * check the mems of tsk's cpuset.
> +			 */
> +			if (cpuset_mems_allowed_intersects(current, tsk))
> +				return true;
> +		}
> +		tsk = next_thread(tsk);
> +	} while (tsk != start);
> +	return false;

I had rewrite this to use while_each_thread(). please see it.


>  }
>  
>  /**
> @@ -237,7 +254,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
>  static struct task_struct *select_bad_process(unsigned long *ppoints,
> -						struct mem_cgroup *mem)
> +		struct mem_cgroup *mem, enum oom_constraint constraint,
> +		const nodemask_t *mask)

dont need constraint argument. when !CONSTRAINT_MEMORY_POLICY case,
we can just pass mask==NULL.

and, here is also nodemask is better name.


>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> @@ -259,7 +277,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> -		if (!has_intersects_mems_allowed(p))
> +		if (!has_intersects_mems_allowed(p,
> +				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> +									 NULL))
>  			continue;
>  
>  		/*
> @@ -483,7 +503,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  		panic("out of memory(memcg). panic_on_oom is selected.\n");
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
>  	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> +			enum oom_constraint constraint, const nodemask_t *mask)
>  {
>  	struct task_struct *p;
>  	unsigned long points;
> @@ -576,7 +597,7 @@ retry:
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL);
> +	p = select_bad_process(&points, NULL, constraint, mask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -610,7 +631,8 @@ void pagefault_out_of_memory(void)
>  		panic("out of memory from page fault. panic_on_oom is selected.\n");
>  
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> +	/* unknown gfp_mask and order */
> +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -626,6 +648,7 @@ void pagefault_out_of_memory(void)
>   * @zonelist: zonelist pointer
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
>   *
>   * If we run out of memory, we have the choice between either
>   * killing a random task (bad), letting the system crash (worse)
> @@ -654,24 +677,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
> -
> -	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
> -	case CONSTRAINT_NONE:
> -		if (sysctl_panic_on_oom) {
> +	if (unlikely(sysctl_panic_on_oom)) {
> +		/*
> +		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> +		 * should not panic for cpuset or mempolicy induced memory
> +		 * failures.
> +		 */
> +		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> -			panic("out of memory. panic_on_oom is selected\n");
> +			panic("Out of memory: panic_on_oom is enabled\n");

you shouldn't immix unrelated change.

>  		}
> -		/* Fall-through */
> -	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> -		break;
>  	}
> -
> +	__out_of_memory(gfp_mask, order, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  
>  	/*



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-04 20:57             ` David Rientjes
@ 2010-06-08 11:41               ` KOSAKI Motohiro
  2010-06-08 23:47                 ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

Hi

> > Have you review the actual patches? And No, I don't think "complete 
> > replace with no test result" is adequate development way.
> 
> I have repeatedly said that the oom killer no longer kills KDE when run on 
> my desktop in the presence of a memory hogging task that was written 
> specifically to oom the machine.  That's a better result than the 
> current implementation and was discussed thoroughly during the discussion 
> on this mailing list back in February that inspired this rewrite to begin 
> with.  I don't think there's any mystery there since you've referred to 
> that change specifically for KDE in this thread yourself.

And, Revewers repeatedly said your patches have overplus material for
saving KDE. and ask you the reason. We haven't said KDE is unimportant.


> > And, When developers post large patch set, Usually _you_ request show
> > demonstrate result. I haven't seen such result in this activity.
> 
> You want to see a log that says "Killed process 1234 (memory-hogger)..." 
> instead of "Killed process 1234 (kdeinit)..."?  You've supported the 
> change from total_vm to rss as a baseline to begin with.  And after all 
> this discussion, this is the first time you've ever said you wanted to see 
> that type of log or anything like it.

Did you only test the above crazy meaningless case??
We don't want you any acrobatic unactionable thing. Simply you just show
what you did, please.

> > However, It doesn't give any reason to avoid code review and violate
> > our development process.
> > 
> 
> Nobody is avoiding code review here, that's pretty obvious, and I have no 
> idea you're referring to when you're saying I'm violating the development 
> process because this happens to rewrite an entire function and requires a 
> new user interface and callsite fixups to be meaningful.  You specifically 
> asked me to push the forkbomb detector in a different patch and I did that 
> because it makes sense to seperate that heuristic, but even then you just 
> wrote "nack" and haven't responded with why even after I've replied twice 
> asking.  I'm really confused this behavior.

Not exactly correct.
I also requested separate adding forkbomb feature and adding forkbomb knob.
I often requested the same thing to a patch author repeatedly and repeatedly.

Why?

Frist of all, The patch description of your forkbomb detection is here

	> Add a forkbomb penalty for processes that fork an excessively large
	> number of children to penalize that group of tasks and not others.  A
	> threshold is configurable from userspace to determine how many first-
	> generation execve children (those with their own address spaces) a task
	> may have before it is considered a forkbomb.  This can be tuned by
	> altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
	> 1000.
	> 
	> When a task has more than 1000 first-generation children with different
	> address spaces than itself, a penalty of
	> 
	> 	(average rss of children) * (# of 1st generation execve children)
	> 	-----------------------------------------------------------------
	> 			oom_forkbomb_thres
	> 
	> is assessed.  So, for example, using the default oom_forkbomb_thres of
	> 1000, the penalty is twice the average rss of all its execve children if
	> there are 2000 such tasks.  A task is considered to count toward the
	> threshold if its total runtime is less than one second; for 1000 of such
	> tasks to exist, the parent process must be forking at an extremely high
	> rate either erroneously or maliciously.
	> 
	> Even though a particular task may be designated a forkbomb and selected as
	> the victim, the oom killer will still kill the 1st generation execve child
	> with the highest badness() score in its place.  The avoids killing
	> important servers or system daemons.  When a web server forks a very large
	> number of threads for client connections, for example, it is much better
	> to kill one of those threads than to kill the server and make it
	> unresponsive.

This have two rotten smell. 1) the sentence is unnecessary mess. it is smell
of the patch don't concentrate one thing. 2) That is strongly concentrate 
"what and how to implement". But reviewers don't want such imformation so much 
because they can read C language. reviewers need following information.
  - background
  - why do the author choose this way?
  - why do the author choose this default value?
  - how to confirm your concept and implementation correct?
  - etc etc

thus, reviewers can trace the author thinking and makes good advise and judgement.
example in this case, you wrote
 - default threshold is 1000
 - only accumurate 1st generation execve children
 - time threshold is a second

but not wrote why? mess sentence hide such lack of document. then, I usually enforce
a divide, because a divide naturally reduce to "which place change" document and 
expose what lacking. 

Now I haven't get your intention. no test suite accelerate to can't get
author think which workload is a problem workload.

btw, nit. typically web server don't create so much thread because almost all of
web server have a feature of limit of number of connection. (Othersise the server
easily down by DoS)


> > > And I'm going to have to get into it because of you guys' seeming
> > > inability to get your act together.
> > 
> > Inability? What do you mean inability? Almost all developers cooperate 
> > for making stabilized kernel. Is this effort inability? or meaningless?
> > 
> 
> I think he's saying that he expects that we should be able to work 
> cooperateively in resolving any differences that we have in a respectful 
> and technical manner on this list.
> 
> But I'll also add my two cents in that and say that we should probably be 
> leaving maintainer duties up to the actual -mm tree maintainer, he knows 
> the development process you're talking about pretty well.

Seems I and he have some disagreement. Ho hum. Of cource, you can seek
another reviewer and another ack. but during reach my eye, I enforce
bugfix-at-first policy to everybody.

> 
> > Actually, the descriptions doesn't looks better really. We sometimes
> > ask him
> >  - which problem occur? how do you reproduce it?
> 
> KDE gets killed, memory hogger doesn't.  Run memory hogger on your 
> desktop.  KOSAKI, this isn't a surprise to you.
> 
> If this is your objection, I can certainly elaborate more in the changelog 
> but up until yesterday you've never said you have a problem with it so how 
> am I supposed to make any forward progress on this?  I can't read your 
> mind when you say "nack" and I'd like to resolve any issues that people 
> have, but that requires that they get involved.

And I also read your mind from your description. I'm not ESPer.


> >  - which piece solve which issue?
> 
> Mostly the baseline heuristic change to rss and swap, as you well know.

agreed.

> 
> >  - how do you measure side effect?
> 
> As far as the objective of the oom killer is concerned as listed in 
> mm/oom_kill.c's header, there is no side effects.  We're trying to kill a 
> task that will free the largest amount of memory and clearly rss and swap 
> is a better indication fo that then total_vm.

Wait, wait.
This, you said you don't consider a lot of workloads deeply. really?
I guess no.

perhaps, you wrote this sentence quickly. so, I just only hope to update
your patch description.


> >  - how do you mesure or consider other workload user
> 
> The objective of the oom killer is not different for different workloads.

Seems my question is too short or unclear?

Usually, we makes 5-6 brain simulation, embedded, desktop, web server,
db server, hpc, finance. Different workloads certenally makes big impact.
because oom killer traverce _processces_ in the workload. It's affect how 
to choose badness() heuristics. why not?


> > But I got only the answer, "My patch is best. resistance is futile". that's
> > purely Baaaad.
> > 
> 
> I haven't said anything new in the above, KOSAKI, you already knew all 
> this.  I'll update the changelog to include some of this information for 
> the next posting, but I'd really hope that this isn't the major problem 
> that you've had the entire time that we've stalled weeks on.

Ho Hum. OK.

> 
> > OK. I don't have any reason to confuse you. I'll fix me. My point is
> > really simple. The majority OOM user are in desktop. We must not ignore
> > them. such as
> > 
> >  - Any regression from desktop view are unacceptable
> 
> This patchset was specifically designed to improve the oom killer's 
> behavior on the desktop!

Again, unevaluatable feature is immixed. and reviewers are stalling.


> >  - Any incompatibility of no desktop improvement are unacceptable
> 
> I don't understand this.

In other word,
 - Any incompatibility are unacceptable

because your new feature have no user.


> >  - Any refusing bugfix are unacceptable
> 
> I've merged most of Oleg's work into this patchset, the problem that we're 
> having is deciding whether any of it is -rc material or not and should be 
> pushed first.  I don't think any of it is, Oleg certainly wasn't pushing 
> it and to date I don't believe has said it's rc material, so that's 
> something you can talk about but I'm not refusing any bugfix.

Good deverlopers alywas take another developer/user bug report at first.
And, I'm going to push kill-PF_EXITING patch and dying-task-higher-priority
patch although they don't help your workload. I don't believe your 
opposition reason is logically.
(but if you made alternative patch, I'll review it preferentially)

> > 1) fix bugs at fist before making new feature (a.k.a new bugs)
> 
> Kame already suggested a new order to the patchset that I'll be 
> restructuring.  I'm curious as to why this was removed from -mm though on 
> your suggestion before any of this became an issue.  We've yet to hear 
> that mysterious information.

Again and again and again. You have to get anyone's ack when you are pushing
new feature. and your series still have bug and usually need 3-5 review iteration.
OK, that's a part of Andrew and our reviewer's fault. These patches must 
dropped more earlier. Your patches got 4 times NAK from each another 
developers, each time, the patches had to be dropped. Sigh.


> > 2) don't mix bugfix and new feature
> 
> Andrew said bugfixes should come first, they will in the reposting, but I 
> don't consider any of it to be -rc material.

Oleg's material can be merged, now. but yours are not.


> > 3) make separate as natural and individual piece
> 
> I can't keep having this conversation, the patch is broken down into one 
> functional unit as much as possible.  Please leave the maintainership of 
> this code to Andrew who has already said entire implementation changes (in 
> this case, a single function rewrite) is allowed if it makes sense.

I said, I'll divide them if you don't. 


> > 4) keep small and reviewable patch size
> 
> Same as above.
> 
> > 5) stop ugly excuse, instead repeatedly rewrite until get anyone ack
> 
> I don't know what my ugly excuse is, but I'll be reordering the patches 
> and sending them with an updated changelog on the badness heuristic 
> rewrite.  I hope that will satisfy all your concerns.

I don't talk generic thing in this. instead I've send new bug report
and new reviewing result instead. I hope I get productive response.

> > 6) don't ignore another developers bug report
> > 
> 
> If you have a bug report that is the result of this rewrite, please come 
> forward with it and don't carry this out by making me guess again.
> 
> > I didn't hope says the same thing twice and he repeatedly ignore
> > my opinion, thus, he got short answer. I didn't think this is inadequate
> > beucase he can google past mail.
> > 
> 
> No, you've never said this is the reason why it was dropped from -mm or 
> why it was "nack"'d early on.
> 
> > However he repeatedly attach our goodwill and blame our tolerance. 
> > but also repeatedly said "My workload is important than other!".
> > Then, I got upset really.
> > 
> 
> What??  I don't even have a specific workload that I'm targeting with this 
> change, I have no idea what you're referring to, we don't run much stuff 
> on the desktop :)
>
> > The fact is, all of good developer never says "my workload is most
> > important in the world", it makes no sense and insane. I really hate
> > such selfish.
> 
> Again, this is just a ridiculous accusation.  I have no idea what you're 
> referring to since this rewrite is specifically addressed to fix the oom 
> killer problems on the desktop.  I work on servers and systems software, I 
> don't have a desktop workload that I'm advocating for here, so perhaps you 
> got me confused with someone else.

David, do you know other kernel engineer spent how much time for understanding
a real workload and dialog various open source community and linux user company
and user group?

At least, All developers must make _effort_ to spent some time to investigate 
userland use case when they want to introduce new feature and incompatibility.
Almost developers do. please read various new feature git log. few commit log
are ridiculous quiet (probably the author bother cut-n-paste from ML bug report)
but almost are wrote what is problem.
thus, we can double check the problem and the code are matched correctly.

And, if you can't test your patch on various platform, at least you must to
write theorical background of your patch. it definitely help each are engineer
confirm your patch don't harm their area. However, for principal, if you
want to introduce any imcompatibility, you must investigate how much affect this.

remark: if you think you need mathematical proof or 100% coveraged proof,
it's not correct. you don't need such impossible work. We just require to
confirm you investigate and consider enough large coverage.

Usually, the author of small patch aren't required this. because reviewers can
think affected use-case from the code. almost reviewer have much use case knowledge
than typical kernel developers. but now, you are challenging full
of rewrite. We don't have enough information to finish reviewing.

Last of all, I've send various review result by another mail. Can you please
read it?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:37     ` David Rientjes
  2010-06-13 11:24       ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-08 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  			continue;
> >  		if (mem && !task_in_mem_cgroup(p, mem))
> >  			continue;
> > +		if (!has_intersects_mems_allowed(p))
> > +			continue;
> >  
> >  		/*
> >  		 * This task already has access to memory reserves and is
> 
> now we have three places of oom filtering
>   (1) select_bad_process

Done.

>   (2) dump_tasks

dump_tasks() has never filtered on this, it's possible for tasks is other 
cpusets to allocate memory on our nodes.

>   (3) oom_kill_task (when oom_kill_allocating_task==1 only)
> 

Why would care about cpuset attachment in oom_kill_task()?  You mean 
oom_kill_process() to filter the children list?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:38     ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-08 18:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > Previously, the heuristic provided some protection for those tasks with
> > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > killing tasks for the purposes of ISA allocations.
> 
> Seems incorrect. CAP_SYS_RAWIO tasks usually both use GFP_KERNEL and GFP_DMA.
> Even if last allocation is GFP_KERNEL, it doesn't provide any gurantee the
> process doesn't have any in flight I/O.
> 

Right, that's why I said it "provided some protection".

> Then, we can't remove for RAWIO protection from oom heuristics. but the code
> itself seems ok though.
> 

It's removed with my heuristic rewrite.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:41     ` David Rientjes
  2010-06-13 11:24       ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-08 18:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > Reviewers may observe that the previous implementation would iterate
> > through the children and attempt to kill each until one was successful and
> > then the parent if none were found while the new code simply kills the
> > most memory-hogging task or the parent.  Note that the only time
> > oom_kill_task() fails, however, is when a child does not have an mm or has
> > a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
> > so the final oom_kill_task() will always succeed.
> 
> probably we need to call has_intersects_mems_allowed() in this loop. likes
> 
>         /* Try to sacrifice the worst child first */
>         do {
>                 list_for_each_entry(c, &t->children, sibling) {
>                         unsigned long cpoints;
> 
>                         if (c->mm == p->mm)
>                                 continue;
>                         if (oom_unkillable(c, mem, nodemask))
>                                 continue;
> 
>                         /* oom_badness() returns 0 if the thread is unkillable */
>                         cpoints = oom_badness(c);
>                         if (cpoints > victim_points) {
>                                 victim = c;
>                                 victim_points = cpoints;
>                         }
>                 }
>         } while_each_thread(p, t);
> 
> 
> It mean we shouldn't assume parent and child have the same mems_allowed,
> perhaps.
> 

I'd be happy to have that in oom_kill_process() if you pass the
enum oom_constraint and only do it for CONSTRAINT_CPUSET.  Please add a 
followup patch to my latest patch series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:43     ` David Rientjes
  2010-06-08 23:25       ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-08 18:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> I've put following historically remark in the description of the patch.
> 
> 
>     We applied the exactly same patch in 2005:
> 
>         : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce
>         : Author: Paul Jackson <pj@sgi.com>
>         : Date:   Tue Sep 6 15:18:13 2005 -0700
>         :
>         : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
>         :
>         : Now the real motivation for this cpuset mem_exclusive patch series seems
>         : trivial.
>         :
>         : This patch keeps a task in or under one mem_exclusive cpuset from provoking an
>         : oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
>         : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
>         : containment, there is little to gain from oom killing a task under a
>         : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
>         : allocation must come from disjoint memory nodes.
>         :
>         : This patch enables configuring a system so that a runaway job under one
>         : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
>         : that might be using very high compute and memory resources for a prolonged
>         : time.
> 
>     And we changed it to current logic in 2006
> 
>         : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256
>         : Author: Nick Piggin <npiggin@suse.de>
>         : Date:   Mon Sep 25 23:31:29 2006 -0700
>         :
>         : [PATCH] oom: cpuset hint
>         :
>         : cpuset_excl_nodes_overlap does not always indicate that killing a task will
>         : not free any memory we for us.  For example, we may be asking for an
>         : allocation from _anywhere_ in the machine, or the task in question may be
>         : pinning memory that is outside its cpuset.  Fix this by just causing
>         : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.
> 
>     And we haven't get the explanation why this patch doesn't reintroduced
>     an old issue. 
> 
> I don't refuse a patch if it have multiple ack. But if you have any
> material or number, please show us soon.
> 

And this patch is acked by the 2006 patch's author, Nick Piggin.

There's obviously not going to be any "number" to show that this means 
anything, but we've run it internally for three years to prevent needless 
oom killing in other cpusets that don't have any indication that it will 
free memory that current needs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:45     ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-08 18:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > @@ -447,19 +450,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  		return 0;
> >  	}
> >  
> > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > -					message, task_pid_nr(p), p->comm, points);
> > +	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
> > +		message, task_pid_nr(p), p->comm, points);
> >  
> > -	/* Try to kill a child first */
> > +	do_posix_clock_monotonic_gettime(&uptime);
> > +	/* Try to sacrifice the worst child first */
> >  	list_for_each_entry(c, &p->children, sibling) {
> > +		unsigned long cpoints;
> > +
> >  		if (c->mm == p->mm)
> >  			continue;
> >  		if (mem && !task_in_mem_cgroup(c, mem))
> >  			continue;
> > -		if (!oom_kill_task(c))
> > -			return 0;
> > +
> 
> need to the check of cpuset (and memplicy) memory intersection here, probably.
> otherwise, this may selected innocence task.
> 

I'll do this, then, if you don't want to post your own patch.  Fine.

> also, OOM_DISABL check is necessary?
> 

No, badness() is 0 for tasks that are OOM_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 18:43     ` David Rientjes
@ 2010-06-08 23:25       ` Andrew Morton
  2010-06-08 23:54         ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2010-06-08 23:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010 11:43:13 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:
> 
> > I've put following historically remark in the description of the patch.
> > 
> > 
> >     We applied the exactly same patch in 2005:
> > 
> >         : commit ef08e3b4981aebf2ba9bd7025ef7210e8eec07ce
> >         : Author: Paul Jackson <pj@sgi.com>
> >         : Date:   Tue Sep 6 15:18:13 2005 -0700
> >         :
> >         : [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
> >         :
> >         : Now the real motivation for this cpuset mem_exclusive patch series seems
> >         : trivial.
> >         :
> >         : This patch keeps a task in or under one mem_exclusive cpuset from provoking an
> >         : oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
> >         : interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
> >         : containment, there is little to gain from oom killing a task under a
> >         : non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
> >         : allocation must come from disjoint memory nodes.
> >         :
> >         : This patch enables configuring a system so that a runaway job under one
> >         : mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
> >         : that might be using very high compute and memory resources for a prolonged
> >         : time.
> > 
> >     And we changed it to current logic in 2006
> > 
> >         : commit 7887a3da753e1ba8244556cc9a2b38c815bfe256
> >         : Author: Nick Piggin <npiggin@suse.de>
> >         : Date:   Mon Sep 25 23:31:29 2006 -0700
> >         :
> >         : [PATCH] oom: cpuset hint
> >         :
> >         : cpuset_excl_nodes_overlap does not always indicate that killing a task will
> >         : not free any memory we for us.  For example, we may be asking for an
> >         : allocation from _anywhere_ in the machine, or the task in question may be
> >         : pinning memory that is outside its cpuset.  Fix this by just causing
> >         : cpuset_excl_nodes_overlap to reduce the badness rather than disallow it.
> > 
> >     And we haven't get the explanation why this patch doesn't reintroduced
> >     an old issue. 

hm, that was some good kernel archeological research.

> > I don't refuse a patch if it have multiple ack. But if you have any
> > material or number, please show us soon.
> > 
> 
> And this patch is acked by the 2006 patch's author, Nick Piggin.
> 
> There's obviously not going to be any "number" to show that this means 
> anything, but we've run it internally for three years to prevent needless 
> oom killing in other cpusets that don't have any indication that it will 
> free memory that current needs.

Well I wonder if Nick had observed some problem which the 2006 change
fixed.

And I wonder if David has observed some problem which the 2010 change
fixes!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 23:28     ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2010-06-08 23:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue,  8 Jun 2010 20:41:52 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > -			panic("out of memory. panic_on_oom is selected\n");
> > +			panic("Out of memory: panic_on_oom is enabled\n");
> 
> you shouldn't immix undocumented and unnecessary change.

Well...  strictly true.  But there's not a lot of benefit in being all
dogmatic about these things.  If the change is simple and is of some
benefit and deosn't muck up the patch too much, I just let it go, shrug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-08 11:41               ` KOSAKI Motohiro
@ 2010-06-08 23:47                 ` Andrew Morton
  2010-06-17  3:28                   ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2010-06-08 23:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue,  8 Jun 2010 20:41:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> of the patch don't concentrate one thing. 2) That is strongly concentrate 
> "what and how to implement". But reviewers don't want such imformation so much 
> because they can read C language. reviewers need following information.
>   - background
>   - why do the author choose this way?
>   - why do the author choose this default value?
>   - how to confirm your concept and implementation correct?
>   - etc etc
> 
> thus, reviewers can trace the author thinking and makes good advise and judgement.
> example in this case, you wrote
>  - default threshold is 1000
>  - only accumurate 1st generation execve children
>  - time threshold is a second
> 
> but not wrote why? mess sentence hide such lack of document. then, I usually enforce
> a divide, because a divide naturally reduce to "which place change" document and 
> expose what lacking. 
> 
> Now I haven't get your intention. no test suite accelerate to can't get
> author think which workload is a problem workload.

hey, you're starting to sound like me.

>
> ...
>
> David, do you know other kernel engineer spent how much time for understanding
> a real workload and dialog various open source community and linux user company
> and user group?
> 
> At least, All developers must make _effort_ to spent some time to investigate 
> userland use case when they want to introduce new feature and incompatibility.
> Almost developers do. please read various new feature git log. few commit log
> are ridiculous quiet (probably the author bother cut-n-paste from ML bug report)
> but almost are wrote what is problem.
> thus, we can double check the problem and the code are matched correctly.
> 
> And, if you can't test your patch on various platform, at least you must to
> write theorical background of your patch. it definitely help each are engineer
> confirm your patch don't harm their area. However, for principal, if you
> want to introduce any imcompatibility, you must investigate how much affect this.
> 
> remark: if you think you need mathematical proof or 100% coveraged proof,
> it's not correct. you don't need such impossible work. We just require to
> confirm you investigate and consider enough large coverage.
> 
> Usually, the author of small patch aren't required this. because reviewers can
> think affected use-case from the code. almost reviewer have much use case knowledge
> than typical kernel developers. but now, you are challenging full
> of rewrite. We don't have enough information to finish reviewing.
> 
> Last of all, I've send various review result by another mail. Can you please
> read it?
> 

I think I'm beginning to understand your concerns with these patches. 
Finally.

Yes, it's a familiar one.  I do fairly commonly see patches where the
description can be summarised as "change lots and lots of stuff to no
apparent end" and one does have to push and poke to squeeze out the
thinking and the reasons.  It's a useful exercise and will sometimes
cause the originator to have a rethink, and sometimes reveals that it
just wasn't a good change.

Maybe if we'd been more diligent about all this around 2.6.12, we
wouldn't have wrecked dirty-page writeout off the tail of the LRU. 
Which is STILL wrecked, btw.

I think I read somewhere in one of David's emails that some of this
code has been floating around in Google for several years?  If so, the
reasons for making certain changes might even be lost and forgotten.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 23:25       ` Andrew Morton
@ 2010-06-08 23:54         ` David Rientjes
  2010-06-09  0:06           ` Andrew Morton
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-08 23:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> And I wonder if David has observed some problem which the 2010 change
> fixes!
> 

Yes, as explained in my changelog.  I'll paste it:

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 23:54         ` David Rientjes
@ 2010-06-09  0:06           ` Andrew Morton
  2010-06-09  1:07             ` David Rientjes
  2010-06-13 11:24             ` KOSAKI Motohiro
  0 siblings, 2 replies; 99+ messages in thread
From: Andrew Morton @ 2010-06-09  0:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010 16:54:31 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 8 Jun 2010, Andrew Morton wrote:
> 
> > And I wonder if David has observed some problem which the 2010 change
> > fixes!
> > 
> 
> Yes, as explained in my changelog.  I'll paste it:
> 
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.

OK, so Nick's change didn't anticipate things being set to OOM_DISABLE?

OOM_DISABLE seems pretty dangerous really - allows malicious
unprivileged users to go homicidal?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-09  0:06           ` Andrew Morton
@ 2010-06-09  1:07             ` David Rientjes
  2010-06-13 11:24             ` KOSAKI Motohiro
  1 sibling, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-09  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> > 
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> 
> OK, so Nick's change didn't anticipate things being set to OOM_DISABLE?
> 

I wrote out a more elaborate rebuttal to this in your reply to my latest 
patchset, but not strictly eliminating these tasks from consideration 
unfairly penalizes tasks in other cpusets simply because their big, 
there's no way to understand the scale of other cpusets compared to 
current's with a single divide in the heuristic (in this case, divide by 
8), and there's no guarantee that killing such a task would free any 
memory which would have two results: (i) we need to reinvoke the oom 
killer to kill yet another task, and (ii) we've now unnecessarily killed a 
task simply because it was large and probably lost a substantial amount of 
work.

> OOM_DISABLE seems pretty dangerous really - allows malicious
> unprivileged users to go homicidal?
> 

OOM_DISABLE doesn't get set without CAP_SYS_RESOURCE, you need that 
capability to decrease an oom_adj value.  So my changelog could probably 
benefit from s/user/job/.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 18:37     ` David Rientjes
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  2010-06-17  3:33         ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:
> 
> > > @@ -267,6 +259,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > >  			continue;
> > >  		if (mem && !task_in_mem_cgroup(p, mem))
> > >  			continue;
> > > +		if (!has_intersects_mems_allowed(p))
> > > +			continue;
> > >  
> > >  		/*
> > >  		 * This task already has access to memory reserves and is
> > 
> > now we have three places of oom filtering
> >   (1) select_bad_process
> 
> Done.
> 
> >   (2) dump_tasks
> 
> dump_tasks() has never filtered on this, it's possible for tasks is other 
> cpusets to allocate memory on our nodes.

I have no objection because it's policy matter. but if so, dump_tasks()
should display mem_allowed mask too, probably.
otherwise, end-user can't understand why badness but not mem intersected task
didn't killed.


> >   (3) oom_kill_task (when oom_kill_allocating_task==1 only)
> > 
> 
> Why would care about cpuset attachment in oom_kill_task()?  You mean 
> oom_kill_process() to filter the children list?

Ah, intersting question. OK, we have to discuss oom_kill_allocating_task
design at first.

First of All, oom_kill_process() to filter the children list and this issue
are independent and unrelated. My patch was not correct too.

Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process()
return 0, oom kill finished successfully. but if oom_kill_process() return 1,
fallback to normall __out_of_memory().


	===================================================
	static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask)
	{
	        struct task_struct *p;
	        unsigned long points;
	
	        if (sysctl_oom_kill_allocating_task)
	                if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
	                                      "Out of memory (oom_kill_allocating_task)"))
	                        return;
	retry:

When oom_kill_process() return 1?
I think It should be
	- current is OOM_DISABLE
	- current have no intersected CPUSET
	- current is KTHREAD
	- etc etc..

It mean, consist rule of !oom_kill_allocating_task case.

So, my previous patch didn't care to conflict "oom: sacrifice child with 
highest badness score for parent" patch. Probably right way is

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
                            unsigned long points, struct mem_cgroup *mem,
                            nodemask_t *nodemask, const char *message)
{
        struct task_struct *c;
        struct task_struct *t = p;
        struct task_struct *victim = p;
        unsigned long victim_points = 0;
        struct timespec uptime;

+	/* This process is not oom killable, we need to retry to select
+	   bad process */
+	if (oom_unkillable(c, mem, nodemask))
+		return 1;

        if (printk_ratelimit())
                dump_header(p, gfp_mask, order, mem, nodemask);

        pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
               message, task_pid_nr(p), p->comm, points);


or something else.

What do you think?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-08 18:41     ` David Rientjes
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  2010-06-14  8:54         ` David Rientjes
  0 siblings, 1 reply; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> > It mean we shouldn't assume parent and child have the same mems_allowed,
> > perhaps.
> > 
> 
> I'd be happy to have that in oom_kill_process() if you pass the
> enum oom_constraint and only do it for CONSTRAINT_CPUSET.  Please add a 
> followup patch to my latest patch series.

Please clarify.
Why do we need CONSTRAINT_CPUSET filter?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-09  0:06           ` Andrew Morton
  2010-06-09  1:07             ` David Rientjes
@ 2010-06-13 11:24             ` KOSAKI Motohiro
  1 sibling, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> On Tue, 8 Jun 2010 16:54:31 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Tue, 8 Jun 2010, Andrew Morton wrote:
> > 
> > > And I wonder if David has observed some problem which the 2010 change
> > > fixes!
> > > 
> > 
> > Yes, as explained in my changelog.  I'll paste it:
> > 
> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> > 
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> 
> OK, so Nick's change didn't anticipate things being set to OOM_DISABLE?
> 
> OOM_DISABLE seems pretty dangerous really - allows malicious
> unprivileged users to go homicidal?

Just clarify. 

David's patch have following Pros/Cons.

Pros
	- 1/8 badness was inaccurate and a bit unclear why 1/8.
	- Usually, almost processes don't change their cpuset mask
	  in their life time. then, cpuset_mems_allowed_intersects()
	  is so so good heuristic.

Cons
	- But, they can change CPUSET mask. we can't assume 
	  cpuset_mems_allowed_intersects() return always correct 
	  memory usage.
	- The task may have mlocked page cache out of CPUSET mask.
	  (probably they are using cpuset.memory_spread_page, perhaps)


I don't think this is OOM_DISABLE related issue. I think just heuristic choice
matter. Both approaches have corner case obviously. Then, I asked most 
typical workload concern and test result. 




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-13 11:24       ` KOSAKI Motohiro
@ 2010-06-14  8:54         ` David Rientjes
  2010-06-14 11:08           ` KOSAKI Motohiro
  0 siblings, 1 reply; 99+ messages in thread
From: David Rientjes @ 2010-06-14  8:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Sun, 13 Jun 2010, KOSAKI Motohiro wrote:

> > > It mean we shouldn't assume parent and child have the same mems_allowed,
> > > perhaps.
> > > 
> > 
> > I'd be happy to have that in oom_kill_process() if you pass the
> > enum oom_constraint and only do it for CONSTRAINT_CPUSET.  Please add a 
> > followup patch to my latest patch series.
> 
> Please clarify.
> Why do we need CONSTRAINT_CPUSET filter?
> 

Because we don't care about intersecting mems_allowed unless it's a cpuset 
constrained oom.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 02/18] oom: sacrifice child with highest badness score for parent
  2010-06-14  8:54         ` David Rientjes
@ 2010-06-14 11:08           ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-14 11:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> > > > It mean we shouldn't assume parent and child have the same mems_allowed,
> > > > perhaps.
> > > > 
> > > 
> > > I'd be happy to have that in oom_kill_process() if you pass the
> > > enum oom_constraint and only do it for CONSTRAINT_CPUSET.  Please add a 
> > > followup patch to my latest patch series.
> > 
> > Please clarify.
> > Why do we need CONSTRAINT_CPUSET filter?
> > 
> 
> Because we don't care about intersecting mems_allowed unless it's a cpuset 
> constrained oom.

OK, I caught your mention. My version have following hunk. 
I think simple nodemask!=NULL check is  is more cleaner.



====================================================
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
                int order, nodemask_t *nodemask)
{
(snip)
        if (constraint != CONSTRAINT_MEMORY_POLICY)
                nodemask = NULL;
(snip)
        read_lock(&tasklist_lock);
        __out_of_memory(gfp_mask, order, nodemask);
        read_unlock(&tasklist_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 08/18] oom: badness heuristic rewrite
  2010-06-08 23:47                 ` Andrew Morton
@ 2010-06-17  3:28                   ` David Rientjes
  0 siblings, 0 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-17  3:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > of the patch don't concentrate one thing. 2) That is strongly concentrate 
> > "what and how to implement". But reviewers don't want such imformation so much 
> > because they can read C language. reviewers need following information.
> >   - background
> >   - why do the author choose this way?
> >   - why do the author choose this default value?
> >   - how to confirm your concept and implementation correct?
> >   - etc etc
> > 
> > thus, reviewers can trace the author thinking and makes good advise and judgement.
> > example in this case, you wrote
> >  - default threshold is 1000
> >  - only accumurate 1st generation execve children
> >  - time threshold is a second
> > 
> > but not wrote why? mess sentence hide such lack of document. then, I usually enforce
> > a divide, because a divide naturally reduce to "which place change" document and 
> > expose what lacking. 
> > 
> > Now I haven't get your intention. no test suite accelerate to can't get
> > author think which workload is a problem workload.
> 
> hey, you're starting to sound like me.
> 

I can certainly elaborate on the forkbomb detector's patch description, 
but it would be helpful if people would bring this up as their concern 
rather than obfuscating it with a bunch of "nack"s and guessing.  I had 
_thought_ that the intent was quite clear in the comments that the patch 
added:

/*
 * Tasks that fork a very large number of children with seperate address spaces
 * may be the result of a bug, user error, malicious applications, or even those
 * with a very legitimate purpose such as a webserver.  The oom killer assesses
 * a penalty equaling
 *
 *	(average rss of children) * (# of 1st generation execve children)
 *	-----------------------------------------------------------------
 *			sysctl_oom_forkbomb_thres
 *
 * for such tasks to target the parent.  oom_kill_process() will attempt to
 * first kill a child, so there's no risk of killing an important system daemon
 * via this method.  A web server, for example, may fork a very large number of
 * threads to respond to client connections; it's much better to kill a child
 * than to kill the parent, making the server unresponsive.  The goal here is
 * to give the user a chance to recover from the error rather than deplete all
 * memory such that the system is unusable, it's not meant to effect a forkbomb
 * policy.
 */

I didn't think it had to be duplicated in the changelog.  I'll do that.

> I think I'm beginning to understand your concerns with these patches. 
> Finally.
> 
> Yes, it's a familiar one.  I do fairly commonly see patches where the
> description can be summarised as "change lots and lots of stuff to no
> apparent end" and one does have to push and poke to squeeze out the
> thinking and the reasons.  It's a useful exercise and will sometimes
> cause the originator to have a rethink, and sometimes reveals that it
> just wasn't a good change.
> 

Show me where I have a single undocumented change in the forkbomb detector 
patch, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-13 11:24       ` KOSAKI Motohiro
@ 2010-06-17  3:33         ` David Rientjes
  2010-06-21 11:45           ` KOSAKI Motohiro
  2010-06-21 11:45           ` KOSAKI Motohiro
  0 siblings, 2 replies; 99+ messages in thread
From: David Rientjes @ 2010-06-17  3:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

On Sun, 13 Jun 2010, KOSAKI Motohiro wrote:

> I have no objection because it's policy matter. but if so, dump_tasks()
> should display mem_allowed mask too, probably.

You could, but we'd want to do that all under cpuset_buffer_lock so we 
don't have to allocate it on the stack, which can be particularly lengthy 
when the page allocator is called.

> > >   (3) oom_kill_task (when oom_kill_allocating_task==1 only)
> > > 
> > 
> > Why would care about cpuset attachment in oom_kill_task()?  You mean 
> > oom_kill_process() to filter the children list?
> 
> Ah, intersting question. OK, we have to discuss oom_kill_allocating_task
> design at first.
> 
> First of All, oom_kill_process() to filter the children list and this issue
> are independent and unrelated. My patch was not correct too.
> 
> Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process()
> return 0, oom kill finished successfully. but if oom_kill_process() return 1,
> fallback to normall __out_of_memory().
> 

Right.

> 
> 	===================================================
> 	static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask)
> 	{
> 	        struct task_struct *p;
> 	        unsigned long points;
> 	
> 	        if (sysctl_oom_kill_allocating_task)
> 	                if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
> 	                                      "Out of memory (oom_kill_allocating_task)"))
> 	                        return;
> 	retry:
> 
> When oom_kill_process() return 1?
> I think It should be
> 	- current is OOM_DISABLE

In this case, oom_kill_task() returns 1, which causes oom_kill_process() 
to return 1 if current (and not one of its children) is actually selected 
to die.

> 	- current have no intersected CPUSET

current will always intersect its own cpuset's mems.

> 	- current is KTHREAD

find_lock_task_mm() should take care of that in oom_kill_task() just like 
it does for OOM_DISABLE, although we can still race with use_mm(), in 
which case this would be a good chance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-17  3:33         ` David Rientjes
@ 2010-06-21 11:45           ` KOSAKI Motohiro
  2010-06-21 11:45           ` KOSAKI Motohiro
  1 sibling, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-21 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> > > >   (3) oom_kill_task (when oom_kill_allocating_task==1 only)
> > > > 
> > > 
> > > Why would care about cpuset attachment in oom_kill_task()?  You mean 
> > > oom_kill_process() to filter the children list?
> > 
> > Ah, intersting question. OK, we have to discuss oom_kill_allocating_task
> > design at first.
> > 
> > First of All, oom_kill_process() to filter the children list and this issue
> > are independent and unrelated. My patch was not correct too.
> > 
> > Now, oom_kill_allocating_task basic logic is here. It mean, if oom_kill_process()
> > return 0, oom kill finished successfully. but if oom_kill_process() return 1,
> > fallback to normall __out_of_memory().
> > 
> 
> Right.
> 
> > 
> > 	===================================================
> > 	static void __out_of_memory(gfp_t gfp_mask, int order, nodemask_t *nodemask)
> > 	{
> > 	        struct task_struct *p;
> > 	        unsigned long points;
> > 	
> > 	        if (sysctl_oom_kill_allocating_task)
> > 	                if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
> > 	                                      "Out of memory (oom_kill_allocating_task)"))
> > 	                        return;
> > 	retry:
> > 
> > When oom_kill_process() return 1?
> > I think It should be
> > 	- current is OOM_DISABLE
> 
> In this case, oom_kill_task() returns 1, which causes oom_kill_process() 
> to return 1 if current (and not one of its children) is actually selected 
> to die.

Right.

> 
> > 	- current have no intersected CPUSET
> 
> current will always intersect its own cpuset's mems.

Oops, It was my mistake.


> 
> > 	- current is KTHREAD
> 
> find_lock_task_mm() should take care of that in oom_kill_task() just like 
> it does for OOM_DISABLE, although we can still race with use_mm(), in 
> which case this would be a good chance.

find_lock_task_mm() implementation is here. it only check ->mm.
other place are using both KTHREAD check and find_lock_task_mm().

----------------------------------------------------------------------
/*
 * The process p may have detached its own ->mm while exiting or through
 * use_mm(), but one or more of its subthreads may still have a valid
 * pointer.  Return p, or any of its subthreads with a valid ->mm, with
 * task_lock() held.
 */
static struct task_struct *find_lock_task_mm(struct task_struct *p)
{
        struct task_struct *t = p;

        do {
                task_lock(t);
                if (likely(t->mm))
                        return t;
                task_unlock(t);
        } while_each_thread(p, t);

        return NULL;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [patch -mm 01/18] oom: filter tasks not sharing the same cpuset
  2010-06-17  3:33         ` David Rientjes
  2010-06-21 11:45           ` KOSAKI Motohiro
@ 2010-06-21 11:45           ` KOSAKI Motohiro
  1 sibling, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-06-21 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, Balbir Singh, linux-mm

> On Sun, 13 Jun 2010, KOSAKI Motohiro wrote:
> 
> > I have no objection because it's policy matter. but if so, dump_tasks()
> > should display mem_allowed mask too, probably.
> 
> You could, but we'd want to do that all under cpuset_buffer_lock so we 
> don't have to allocate it on the stack, which can be particularly lengthy 
> when the page allocator is called.

Probably we don't need such worry. becuase a stack overflow risk depend on
deepest call path.
That's said, if out_of_memory() was called, page allocator did called
try_to_free_pages() at first. try_to_free_pages() have much deeper stack
rather than out_of_memory().



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2010-06-21 11:45 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-01  7:18 [patch -mm 00/18] oom killer rewrite David Rientjes
2010-06-01  7:18 ` [patch -mm 01/18] oom: filter tasks not sharing the same cpuset David Rientjes
2010-06-01  7:20   ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:37     ` David Rientjes
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-17  3:33         ` David Rientjes
2010-06-21 11:45           ` KOSAKI Motohiro
2010-06-21 11:45           ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:43     ` David Rientjes
2010-06-08 23:25       ` Andrew Morton
2010-06-08 23:54         ` David Rientjes
2010-06-09  0:06           ` Andrew Morton
2010-06-09  1:07             ` David Rientjes
2010-06-13 11:24             ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 02/18] oom: sacrifice child with highest badness score for parent David Rientjes
2010-06-01  7:39   ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:41     ` David Rientjes
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-14  8:54         ` David Rientjes
2010-06-14 11:08           ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:45     ` David Rientjes
2010-06-01  7:18 ` [patch -mm 03/18] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-06-01  7:39   ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 23:28     ` Andrew Morton
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 04/18] oom: extract panic helper function David Rientjes
2010-06-01  7:33   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 05/18] oom: remove special handling for pagefault ooms David Rientjes
2010-06-01  7:34   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 06/18] oom: move sysctl declarations to oom.h David Rientjes
2010-06-01  7:34   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 07/18] oom: enable oom tasklist dump by default David Rientjes
2010-06-01  7:36   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 08/18] oom: badness heuristic rewrite David Rientjes
2010-06-01  7:36   ` KOSAKI Motohiro
2010-06-01 18:44     ` David Rientjes
2010-06-02 13:54       ` KOSAKI Motohiro
2010-06-02 21:20         ` David Rientjes
2010-06-03 23:10         ` Andrew Morton
2010-06-03 23:53           ` KAMEZAWA Hiroyuki
2010-06-04  0:04             ` Andrew Morton
2010-06-04  0:20               ` KAMEZAWA Hiroyuki
2010-06-04  5:57                 ` KAMEZAWA Hiroyuki
2010-06-04  9:22                   ` David Rientjes
2010-06-04  9:19             ` David Rientjes
2010-06-04  9:43             ` Oleg Nesterov
2010-06-04 10:54           ` KOSAKI Motohiro
2010-06-04 20:57             ` David Rientjes
2010-06-08 11:41               ` KOSAKI Motohiro
2010-06-08 23:47                 ` Andrew Morton
2010-06-17  3:28                   ` David Rientjes
2010-06-01  7:46   ` Nick Piggin
2010-06-01 18:56     ` David Rientjes
2010-06-02 13:54       ` KOSAKI Motohiro
2010-06-02 21:23         ` David Rientjes
2010-06-03  0:05           ` KAMEZAWA Hiroyuki
2010-06-03  6:44             ` David Rientjes
2010-06-03  3:07           ` KOSAKI Motohiro
2010-06-03  6:48             ` David Rientjes
2010-06-03 23:15             ` Andrew Morton
2010-06-04 10:54               ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 09/18] oom: add forkbomb penalty to badness heuristic David Rientjes
2010-06-01  7:37   ` KOSAKI Motohiro
2010-06-01 18:57     ` David Rientjes
2010-06-03 20:33       ` David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 10/18] oom: deprecate oom_adj tunable David Rientjes
2010-06-01  7:37   ` KOSAKI Motohiro
2010-06-01  7:18 ` [patch -mm 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
2010-06-01  7:38   ` KOSAKI Motohiro
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:38     ` David Rientjes
2010-06-01  7:18 ` [patch -mm 12/18] oom: remove unnecessary code and cleanup David Rientjes
2010-06-01  7:40   ` KOSAKI Motohiro
2010-06-01 18:58     ` David Rientjes
2010-06-01  7:19 ` [patch -mm 13/18] oom: avoid race for oom killed tasks detaching mm prior to exit David Rientjes
2010-06-01  7:40   ` KOSAKI Motohiro
2010-06-01 18:59     ` David Rientjes
2010-06-01 20:43       ` Oleg Nesterov
2010-06-01 21:19         ` David Rientjes
2010-06-02  0:28         ` KAMEZAWA Hiroyuki
2010-06-02  9:49           ` David Rientjes
2010-06-02 10:46             ` Nick Piggin
2010-06-02 21:35               ` David Rientjes
2010-06-02 13:54         ` KOSAKI Motohiro
2010-06-01  7:19 ` [patch -mm 14/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
2010-06-01  7:41   ` KOSAKI Motohiro
2010-06-01  7:19 ` [patch -mm 15/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
2010-06-01  7:41   ` KOSAKI Motohiro
2010-06-01  7:19 ` [patch -mm 16/18] oom: give current access to memory reserves if it has been killed David Rientjes
2010-06-01  7:44   ` KOSAKI Motohiro
2010-06-01  7:19 ` [patch -mm 17/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
2010-06-01  7:19 ` [patch -mm 18/18] oom: clean up oom_kill_task() David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.