All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm v2 00/10] oom killer rewrite
@ 2010-02-26 23:52 David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset David Rientjes
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently.  The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

Changes from version 2:

 - updated to 2.6.33, no longer based on mmotm

 - removed "oom: remove compulsory panic_on_oom mode"

 - mempolicy detachment is now protected by task_lock(); otherwise, it is
   possible for a mempolicy to be freed out from under code that is using
   it.  This isn't necessary for current->mempolicy, but all other tasks
   require it.

 - tasks that have mempolicies of MPOL_PREFERED (or MPOL_F_LOCAL) are now
   always considered for oom kill when it is mempolicy constrained since
   they may allocate elsewhere as fallback when their preferred (or local)
   node is oom.

 - lowmem allocations that are __GFP_NOFAIL are now retried in the page
   allocator instead of returning NULL.

 - added: [patch 4/10] oom: remove special handling for pagefault ooms

 - added: [patch 10/10] oom: default to killing current for pagefault ooms

This patchset has two dependencies from the -mm tree:

	[patch 5/10] oom: badness heuristic rewrite:
		mm-count-swap-usage.patch

	[patch 7/10] oom: replace sysctls with quick mode:
		sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33 and apply

	mm-clean-up-mm_counter.patch
	mm-avoid-false-sharing-of-mm_counter.patch
	mm-avoid-false-sharing-of-mm_counter-checkpatch-fixes.patch
	mm-count-swap-usage.patch
	mm-count-swap-usage-checkpatch-fixes.patch
	mm-introduce-dump_page-and-print-symbolic-flag-names.patch
	sysctl-clean-up-vm-related-variable-declarations.patch
	sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.

This patchset is also available for each kernel release from:

	http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite/

including broken out patches and the prerequisite patches listed above.
---
 Documentation/feature-removal-schedule.txt |   30 +
 Documentation/filesystems/proc.txt         |  100 +++--
 Documentation/sysctl/vm.txt                |   51 +-
 fs/proc/base.c                             |  106 +++++
 include/linux/memcontrol.h                 |   14 
 include/linux/mempolicy.h                  |   13 
 include/linux/oom.h                        |   24 +
 include/linux/sched.h                      |    3 
 kernel/exit.c                              |    8 
 kernel/fork.c                              |    1 
 kernel/sysctl.c                            |   15 
 mm/memcontrol.c                            |   43 --
 mm/mempolicy.c                             |   44 ++
 mm/oom_kill.c                              |  572 +++++++++++++++--------------
 mm/page_alloc.c                            |   29 +
 15 files changed, 653 insertions(+), 400 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-03-02  4:54   ` Balbir Singh
  2010-02-26 23:53 ` [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent David Rientjes
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *tsk)
 {
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-03-02  4:54   ` Balbir Singh
  2010-02-26 23:53 ` [patch -mm v2 03/10] oom: select task from tasklist for mempolicy ooms David Rientjes
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list.  Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -446,19 +449,27 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	do_posix_clock_monotonic_gettime(&uptime);
+	/* Try to sacrifice the worst child first */
 	list_for_each_entry(c, &p->children, sibling) {
+		unsigned long cpoints;
+
 		if (c->mm == p->mm)
 			continue;
 		if (mem && !task_in_mem_cgroup(c, mem))
 			continue;
-		if (!oom_kill_task(c))
-			return 0;
+
+		/* badness() returns 0 if the thread is unkillable */
+		cpoints = badness(c, uptime.tv_sec);
+		if (cpoints > victim_points) {
+			victim = c;
+			victim_points = cpoints;
+		}
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(victim);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 03/10] oom: select task from tasklist for mempolicy ooms
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

It is necessary to synchronize the detachment of task->mempolicy with
task_lock(task) to ensure it is not prematurely destroyed while a user is
operating on it.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 kernel/exit.c             |    8 ++--
 mm/mempolicy.c            |   44 +++++++++++++++++++++++++
 mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
 4 files changed, 107 insertions(+), 35 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -689,6 +689,10 @@ static void exit_mm(struct task_struct * tsk)
 	enter_lazy_tlb(mm, current);
 	/* We don't want this task to be frozen prematurely */
 	clear_freeze_flag(tsk);
+#ifdef CONFIG_NUMA
+	mpol_put(tsk->mempolicy);
+	tsk->mempolicy = NULL;
+#endif
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
@@ -993,10 +997,6 @@ NORET_TYPE void do_exit(long code)
 	perf_event_exit_task(tsk);
 
 	exit_notify(tsk, group_dead);
-#ifdef CONFIG_NUMA
-	mpol_put(tsk->mempolicy);
-	tsk->mempolicy = NULL;
-#endif
 #ifdef CONFIG_FUTEX
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1612,6 +1612,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	if (!mask)
+		return ret;
+	task_lock(tsk);
+	mempolicy = tsk->mempolicy;
+	if (!mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+		 * allocate from, they may fallback to other nodes when oom.
+		 * Thus, it's possible for tsk to have allocated memory from
+		 * nodes in mask.
+		 */
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+out:
+	task_unlock(tsk);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+						const nodemask_t *mask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
 
-	t = tsk;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
-
-	return 0;
+		if (mask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(tsk, mask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, tsk))
+				return true;
+		}
+		tsk = next_thread(tsk);
+	} while (tsk != start);
+	return false;
 }
 
 /**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+		struct mem_cgroup *mem, enum oom_constraint constraint,
+		const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p,
+				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+									 NULL))
 			continue;
 
 		/*
@@ -480,7 +500,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -562,7 +582,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -576,7 +597,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -617,7 +638,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -634,6 +656,7 @@ rest_and_return:
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -662,24 +685,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
+	if (unlikely(sysctl_panic_on_oom)) {
+		/*
+		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+		 * should not panic for cpuset or mempolicy induced memory
+		 * failures.
+		 */
+		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
-			panic("out of memory. panic_on_oom is selected\n");
+			panic("Out of memory: panic_on_oom is enabled\n");
 		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
 	}
-
+	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (2 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 03/10] oom: select task from tasklist for mempolicy ooms David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-03-01  1:12   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-02-26 23:53 ` [patch -mm v2 05/10] oom: badness heuristic rewrite David Rientjes
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

It is possible to remove the special pagefault oom handler by simply
oom locking all system zones and then calling directly into
out_of_memory().

All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
parallel oom killing in progress that will lead to eventual memory
freeing so it's not necessary to needlessly kill another task.  The
context in which the pagefault is allocating memory is unknown to the oom
killer, so this is done on a system-wide level.

If a task has already been oom killed and hasn't fully exited yet, this
will be a no-op since select_bad_process() recognizes tasks across the
system with TIF_MEMDIE set.

The special handling to determine whether a parallel memcg is currently
oom is removed since we can detect future memory freeing with TIF_MEMDIE.
The memcg has already reached its memory limit, so it will still need to
kill a task regardless of the pagefault oom.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |    6 ---
 mm/memcontrol.c            |   35 +---------------
 mm/oom_kill.c              |   97 ++++++++++++++++++++++++++------------------
 3 files changed, 58 insertions(+), 80 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	return false;
-}
-
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -217,7 +217,6 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1205,34 +1204,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
@@ -1484,10 +1455,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			continue;
 
 		if (!nr_retries--) {
-			if (oom) {
+			if (oom)
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
-				record_last_oom(mem_over_limit);
-			}
 			goto nomem;
 		}
 	}
@@ -2284,8 +2253,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
  * try_charge_swapin does all of these works properly.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /*
+ * Try to acquire the oom killer lock for all system zones.  Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+	struct zone *zone;
+	int ret = 1;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		if (zone_is_oom_locked(zone)) {
+			ret = 0;
+			goto out;
+		}
+	for_each_populated_zone(zone)
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+	spin_unlock(&zone_scan_lock);
+	return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+	struct zone *zone;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	spin_unlock(&zone_scan_lock);
+}
+
+
+/*
  * Must be called with tasklist_lock held for read.
  */
 static void __out_of_memory(gfp_t gfp_mask, int order,
@@ -614,46 +652,9 @@ retry:
 		goto retry;
 }
 
-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
-	unsigned long freed = 0;
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	/*
-	 * If this is from memcg, oom-killer is already invoked.
-	 * and not worth to go system-wide-oom.
-	 */
-	if (mem_cgroup_oom_called(current))
-		goto rest_and_return;
-
-	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
-
-	read_lock(&tasklist_lock);
-	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
-	read_unlock(&tasklist_lock);
-
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory.
-	 */
-rest_and_return:
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-}
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
- * @zonelist: zonelist pointer
+ * @zonelist: zonelist pointer passed to page allocator
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
  * @nodemask: nodemask passed to page allocator
@@ -667,7 +668,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
-	enum oom_constraint constraint;
+	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -683,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -693,6 +695,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 */
 		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
+			read_unlock(&tasklist_lock);
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
@@ -706,3 +709,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing.  If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+	if (!try_set_system_oom())
+		return;
+	out_of_memory(NULL, 0, 0, NULL);
+	clear_system_oom();
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 05/10] oom: badness heuristic rewrite
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (3 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 06/10] oom: deprecate oom_adj tunable David Rientjes
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory.  "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb.  This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks.  A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected
as the victim, the oom killer will still kill the 1st generation execve
child with the highest badness() score in its place.  The avoids killing
important servers or system daemons.  When a web server forks a very
large number of threads for client connections, for example, it is much
better to kill one of those threads than to kill the server and make it
unresponsive.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not
possible to redefine the meaning of /proc/pid/oom_adj with a new scale
since the ABI cannot be changed for backward compatability.  Instead, a
new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to
+1000.  It may be used to polarize the heuristic such that certain tasks
are never considered for oom kill while others may always be considered.
The value is added directory into the badness() score so a value of -500,
for example, means to discount 50% of its memory consumption in
comparison to other tasks either on the system, bound to the mempolicy,
in the cpuset, or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj
to be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   97 +++++++----
 Documentation/sysctl/vm.txt        |   21 +++
 fs/proc/base.c                     |   98 +++++++++++-
 include/linux/memcontrol.h         |    8 +
 include/linux/oom.h                |   18 ++-
 include/linux/sched.h              |    3 +-
 kernel/fork.c                      |    1 +
 kernel/sysctl.c                    |    8 +
 mm/memcontrol.c                    |    8 +
 mm/oom_kill.c                      |  318 ++++++++++++++++++++----------------
 10 files changed, 394 insertions(+), 186 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
+  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+								score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1188,42 +1189,64 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- 	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- 	to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
-	points <<= oom_adj when it is positive and
-	points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted.  The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000.  If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called.  If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes.  If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit.  Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill.  Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it.  The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory.  A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score.  Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible.  This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
 - oom_dump_tasks
+- oom_forkbomb_thres
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.
 
 ==============================================================
 
+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb.  Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+	(average rss of execve children) * (# of 1st generation execve children)
+	------------------------------------------------------------------------
+				oom_forkbomb_thres
+
+in the oom killer's badness heuristic.  Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
 oom_kill_allocating_task
 
 This enables or disables killing the OOM-triggering task in
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
 #include <linux/elf.h>
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /* NOTE:
@@ -439,7 +440,6 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points;
@@ -447,7 +447,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task->group_leader, uptime.tv_sec);
+	points = oom_badness(task->group_leader,
+				global_page_state(NR_INACTIVE_ANON) +
+				global_page_state(NR_ACTIVE_ANON) +
+				global_page_state(NR_INACTIVE_FILE) +
+				global_page_state(NR_ACTIVE_FILE) +
+				total_swap_pages,
+				uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1054,7 +1060,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_adj = oom_adjust;
-
+	/*
+	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+	 * value is always attainable.
+	 */
+	if (task->signal->oom_adj == OOM_ADJUST_MAX)
+		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+	else
+		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+								-OOM_DISABLE;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1066,6 +1080,82 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	unsigned long flags;
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+	if (lock_task_sighand(task, &flags)) {
+		oom_score_adj = task->signal->oom_score_adj;
+		unlock_task_sighand(task, &flags);
+	}
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	unsigned long flags;
+	long oom_score_adj;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+	if (err)
+		return -EINVAL;
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	if (!lock_task_sighand(task, &flags)) {
+		put_task_struct(task);
+		return -ESRCH;
+	}
+	if (oom_score_adj < task->signal->oom_score_adj &&
+			!capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	task->signal->oom_score_adj = oom_score_adj;
+	/*
+	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+	 * always attainable.
+	 */
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		task->signal->oom_adj = OOM_DISABLE;
+	else
+		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+							OOM_SCORE_ADJ_MAX;
+	unlock_task_sighand(task, &flags);
+	put_task_struct(task);
+	return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+	.read		= oom_score_adj_read,
+	.write		= oom_score_adj_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2629,6 +2719,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -2963,6 +3054,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -128,6 +128,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -306,6 +308,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,27 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES	1000
+
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -24,6 +37,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
+extern unsigned int oom_badness(struct task_struct *p,
+			unsigned long totalpages, unsigned long uptime);
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -47,6 +62,7 @@ static inline void oom_killer_enable(void)
 extern int sysctl_panic_on_oom;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,7 +624,8 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;	/* OOM kill score adjustment (bit shift) */
+	int oom_adj;		/* OOM kill score adjustment (bit shift) */
+	int oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
+	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	return 0;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -961,6 +961,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_forkbomb_thres",
+		.data		= &sysctl_oom_forkbomb_thres,
+		.maxlen		= sizeof(sysctl_oom_forkbomb_thres),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1077,6 +1077,14 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 }
 
 /*
+ * Return the memory + swap limit for a memcg.
+ */
+u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+{
+	return res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+}
+
+/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
+ *  Copyright (C)  2010  Google, Inc
+ *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
@@ -68,138 +70,129 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 	return false;
 }
 
-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose.  The oom killer assesses a penalty equaling
  *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ *	(average rss of children) * (# of 1st generation execve children)
+ *	-----------------------------------------------------------------
+ *			sysctl_oom_forkbomb_thres
  *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the principle
- *    of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent.  oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method.  A web server, for example, may fork a very large number of
+ * threads to response to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive.  The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
 {
-	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
+	unsigned long child_rss = 0;
+	int forkcount = 0;
 
-	if (oom_adj == OOM_DISABLE)
+	if (!sysctl_oom_forkbomb_thres)
 		return 0;
+	list_for_each_entry(child, &tsk->children, sibling) {
+		struct task_cputime task_time;
+		unsigned long runtime;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
-		return 0;
+		task_lock(child);
+		if (!child->mm || child->mm == tsk->mm) {
+			task_unlock(child);
+			continue;
+		}
+		thread_group_cputime(child, &task_time);
+		runtime = cputime_to_jiffies(task_time.utime) +
+			  cputime_to_jiffies(task_time.stime);
+		/*
+		 * Only threads that have run for less than a second are
+		 * considered toward the forkbomb penalty, these threads rarely
+		 * get to execute at all in such cases anyway.
+		 */
+		if (runtime < HZ) {
+			child_rss += get_mm_rss(child->mm);
+			forkcount++;
+		}
+		task_unlock(child);
 	}
 
 	/*
-	 * The memory size of the process is the basis for the badness.
+	 * Forkbombs get penalized by:
+	 *
+	 * (average rss of children) * (# of first-generation execve children) /
+	 *			sysctl_oom_forkbomb_thres
 	 */
-	points = mm->total_vm;
+	return forkcount > sysctl_oom_forkbomb_thres ?
+				(child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible.  The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+							unsigned long uptime)
+{
+	struct mm_struct *mm;
+	int points;
 
 	/*
-	 * After this unlock we can no longer dereference local variable `mm'
+	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+	 * need to be executed for something that can't be killed.
 	 */
-	task_unlock(p);
+	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		return 0;
 
 	/*
-	 * swapoff can easily use up all memory, so kill those first.
+	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+	 * priority for oom killing.
 	 */
 	if (p->flags & PF_OOM_ORIGIN)
-		return ULONG_MAX;
+		return 1000;
 
-	/*
-	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add half the vmsize of the children if they
-	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of children. In case a single
-	 * child is eating the vast majority of memory, adding only half
-	 * to the parents will make the child our kill candidate of choice.
-	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	task_lock(p);
+	mm = p->mm;
+	if (!mm) {
+		task_unlock(p);
+		return 0;
 	}
 
 	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
-
-	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
+	 * The baseline for the badness score is the proportion of RAM that each
+	 * task's rss and swap space use.
 	 */
-	if (task_nice(p) > 0)
-		points *= 2;
-
-	/*
-	 * Superuser processes are usually more important, so we make it
-	 * less likely that we kill those.
-	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
-	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
-		points /= 4;
+	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+			totalpages;
+	task_unlock(p);
+	points += oom_forkbomb_penalty(p);
 
 	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
+	 * Root processes get 3% bonus, just like the __vm_enough_memory() used
+	 * by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+		points -= 30;
 
 	/*
-	 * Adjust the score by oom_adj.
+	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the
+	 * range may either completely disable oom killing or always prefer a
+	 * certain task.
 	 */
-	if (oom_adj) {
-		if (oom_adj > 0) {
-			if (!points)
-				points = 1;
-			points <<= oom_adj;
-		} else
-			points >>= -(oom_adj);
-	}
+	points += p->signal->oom_score_adj;
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+	if (points < 0)
+		return 0;
+	return (points <= 1000) ? points : 1000;
 }
 
 /*
@@ -207,12 +200,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				    gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	bool cpuset_limited = false;
+	int nid;
 
+	/* Default to all anonymous memory, page cache, and swap */
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
+
+	if (!zonelist)
+		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -222,26 +227,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
-	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
-	 * feature. mempolicy is an only user of nodemask here.
-	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+	 * the page allocator means a mempolicy is in effect.  Cpuset policy
+	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, *nodemask)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
 		return CONSTRAINT_MEMORY_POLICY;
+	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			return CONSTRAINT_CPUSET;
-
+			cpuset_limited = true;
+
+	if (cpuset_limited) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, cpuset_current_mems_allowed)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
+		return CONSTRAINT_CPUSET;
+	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -252,9 +278,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints,
-		struct mem_cgroup *mem, enum oom_constraint constraint,
-		const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+		unsigned long totalpages, struct mem_cgroup *mem,
+		enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -263,7 +289,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned long points;
+		unsigned int points;
 
 		/*
 		 * skip kernel threads and tasks which have already released
@@ -308,13 +334,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = ULONG_MAX;
+			*ppoints = 1000;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, totalpages, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -330,7 +356,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
  *
  * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
  * shown.
@@ -342,7 +368,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	struct task_struct *g, *p;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
-	       "name\n");
+	       "oom_score_adj name\n");
 	do_each_thread(g, p) {
 		struct mm_struct *mm;
 
@@ -362,10 +388,10 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			task_unlock(p);
 			continue;
 		}
-		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
 		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
-		       p->comm);
+		       p->signal->oom_score_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -374,8 +400,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d, oom_score_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj,
+		current->signal->oom_score_adj);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -440,7 +467,7 @@ static int oom_kill_task(struct task_struct *p)
 	 * change to NULL at any time since we do not hold task_lock(p).
 	 * However, this is of no concern to us.
 	 */
-	if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return 1;
 
 	__oom_kill_task(p, 1);
@@ -449,12 +476,12 @@ static int oom_kill_task(struct task_struct *p)
 }
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned int points, unsigned long totalpages,
+			    struct mem_cgroup *mem, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *c;
-	unsigned long victim_points = 0;
+	unsigned int victim_points = 0;
 	struct timespec uptime;
 
 	if (printk_ratelimit())
@@ -469,21 +496,21 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	/* Try to sacrifice the worst child first */
 	list_for_each_entry(c, &p->children, sibling) {
-		unsigned long cpoints;
+		unsigned int cpoints;
 
 		if (c->mm == p->mm)
 			continue;
 		if (mem && !task_in_mem_cgroup(c, mem))
 			continue;
 
-		/* badness() returns 0 if the thread is unkillable */
-		cpoints = badness(c, uptime.tv_sec);
+		/* oom_badness() returns 0 if the thread is unkillable */
+		cpoints = oom_badness(c, totalpages, uptime.tv_sec);
 		if (cpoints > victim_points) {
 			victim = c;
 			victim_points = cpoints;
@@ -495,19 +522,21 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long points = 0;
+	unsigned long limit;
+	unsigned int points = 0;
 	struct task_struct *p;
 
+	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	if (!p)
 		p = current;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -620,22 +649,22 @@ static void clear_system_oom(void)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
-	unsigned long points;
+	unsigned int points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+			NULL, "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL, constraint, mask);
+	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -647,7 +676,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 			     "Out of memory"))
 		goto retry;
 }
@@ -667,6 +696,7 @@ retry:
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
+	unsigned long totalpages;
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
@@ -684,8 +714,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	if (zonelist)
-		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+						&totalpages);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -699,7 +729,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
-	__out_of_memory(gfp_mask, order, constraint, nodemask);
+	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 06/10] oom: deprecate oom_adj tunable
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (4 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 05/10] oom: badness heuristic rewrite David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 07/10] oom: replace sysctls with quick mode David Rientjes
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is December 2011.

A warning will be printed to the kernel log if a task attempts to use
this interface.  Future warning will be suppressed until the kernel is
rebooted to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   30 ++++++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 ++
 fs/proc/base.c                             |    8 +++++++
 include/linux/oom.h                        |    3 ++
 4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -193,6 +193,36 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	December 2011
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	See Documentation/filesystems/proc.txt for information on how to use the
+	new tunable.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warning will be
+	suppressed until the kernel is rebooted.
+
+Who:	David Rientjes <rientjes@google.com>
+
+---------------------------
+
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1242,6 +1242,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1059,6 +1059,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
+	/*
+	 * Warn that /proc/pid/oom_adj is deprecated, see
+	 * Documentation/feature-removal-schedule.txt.
+	 */
+	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+			"please use /proc/%d/oom_score_adj instead.\n",
+			current->comm, task_pid_nr(current),
+			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,6 +2,9 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 07/10] oom: replace sysctls with quick mode
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (5 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 06/10] oom: deprecate oom_adj tunable David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 08/10] oom: avoid oom killer for lowmem allocations David Rientjes
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans.  The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others.  The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick.  This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

This consolidation is possible because the audience for both tunables is
the same and there is no backwards compatibility issue in removing
oom_dump_tasks since its behavior is now default.  Since mempolicy ooms
now scan the tasklist, oom_kill_allocating_task may now find more users
to avoid the performance penalty, so it's better to unite them under one
sysctl than carry two for legacy purposes.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
 include/linux/oom.h         |    3 +-
 kernel/sysctl.c             |   13 ++---------
 mm/oom_kill.c               |    9 +++----
 4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
 - nr_pdflush_threads
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
-- oom_dump_tasks
 - oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
 - overcommit_memory
 - overcommit_ratio
 - page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.
 
 ==============================================================
 
-oom_dump_tasks
-
-Enables a system-wide task dump (excluding kernel threads) to be
-produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name.  This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
-
-If this is set to zero, this information is suppressed.  On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one.  Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 0.
-
-==============================================================
-
 oom_forkbomb_thres
 
 This value defines how many children with a seperate address space a specific
@@ -511,22 +489,12 @@ The default value is 1000.
 
 ==============================================================
 
-oom_kill_allocating_task
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill.  This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition.  This avoids the expensive
-tasklist scan.
+oom_kill_quick
 
-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
+When enabled, this will always kill the task that triggered the oom killer, i.e.
+the task that attempted to allocate memory that could not be found.  It also
+suppresses the tasklist dump to the kernel log whenever the oom killer is
+called.  Typically set on systems with an extremely large number of tasks.
 
 The default value is 0.
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)
 }
 /* for sysctl */
 extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;
 extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -947,16 +947,9 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "oom_kill_allocating_task",
-		.data		= &sysctl_oom_kill_allocating_task,
-		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
-		.procname	= "oom_dump_tasks",
-		.data		= &sysctl_oom_dump_tasks,
-		.maxlen		= sizeof(sysctl_oom_dump_tasks),
+		.procname	= "oom_kill_quick",
+		.data		= &sysctl_oom_kill_quick,
+		.maxlen		= sizeof(sysctl_oom_kill_quick),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,9 +32,8 @@
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
 int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
@@ -409,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	dump_stack();
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
-	if (sysctl_oom_dump_tasks)
+	if (!sysctl_oom_kill_quick)
 		dump_tasks(mem);
 }
 
@@ -655,9 +654,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 	struct task_struct *p;
 	unsigned int points;
 
-	if (sysctl_oom_kill_allocating_task)
+	if (sysctl_oom_kill_quick)
 		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
-			NULL, "Out of memory (oom_kill_allocating_task)"))
+			NULL, "Out of memory (quick mode)"))
 			return;
 retry:
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 08/10] oom: avoid oom killer for lowmem allocations
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (6 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 07/10] oom: replace sysctls with quick mode David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 09/10] oom: remove unnecessary code and cleanup David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 10/10] oom: default to killing current for pagefault ooms David Rientjes
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   29 ++++++++++++++++++++---------
 1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1658,6 +1658,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1886,15 +1889,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 09/10] oom: remove unnecessary code and cleanup
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (7 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 08/10] oom: avoid oom killer for lowmem allocations David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  2010-02-26 23:53 ` [patch -mm v2 10/10] oom: default to killing current for pagefault ooms David Rientjes
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   64 ++++++++++++++------------------------------------------
 1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -412,67 +412,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
  */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
-		return;
-	}
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
-
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
-
 	force_sig(SIGKILL, p);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p)
 {
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+	task_lock(p);
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		task_unlock(p);
 		return 1;
+	}
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	task_unlock(p);
 
-	__oom_kill_task(p, 1);
-
+	__oom_kill_task(p);
 	return 0;
 }
+#undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned int points, unsigned long totalpages,
@@ -491,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
+		__oom_kill_task(p);
 		return 0;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch -mm v2 10/10] oom: default to killing current for pagefault ooms
  2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
                   ` (8 preceding siblings ...)
  2010-02-26 23:53 ` [patch -mm v2 09/10] oom: remove unnecessary code and cleanup David Rientjes
@ 2010-02-26 23:53 ` David Rientjes
  9 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-02-26 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Balbir Singh, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

The pagefault oom handler does not know the context (gfp_mask, order,
etc) in which memory was not found when a VM_FAULT_OOM is generated.  The
only information known is that current is trying to allocate in that
context, so killing it is a legitimate response (and is the default for
architectures that do not even use the pagefault oom handler such as ia64
and powerpc).

When a VM_FAULT_OOM occurs, the pagefault oom handler will now attempt to
kill current by default.  If it is unkillable, the oom killer is called
to find a memory-hogging task to kill instead that will lead to future
memory freeing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   18 +++++++++++++-----
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -708,15 +708,23 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing.  If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * The pagefault handler calls here because it is out of memory, so kill current
+ * by default.  If it's unkillable, then fallback to killing a memory-hogging
+ * task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel oom killing is
+ * already in progress so do nothing.  If a task is found with TIF_MEMDIE set,
+ * it has been killed so do nothing and allow it to exit.
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
+	int err;
+
 	if (!try_set_system_oom())
 		return;
-	out_of_memory(NULL, 0, 0, NULL);
+	constrained_alloc(NULL, 0, NULL, &totalpages);
+	err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
+				"Out of memory (pagefault)");
+	if (err)
+		out_of_memory(NULL, 0, 0, NULL);
 	clear_system_oom();
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-03-01  1:12   ` KAMEZAWA Hiroyuki
  2010-03-01 10:13     ` David Rientjes
  2010-03-01  5:23   ` Balbir Singh
  2010-03-02  2:21   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-01  1:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Fri, 26 Feb 2010 15:53:11 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> It is possible to remove the special pagefault oom handler by simply
> oom locking all system zones and then calling directly into
> out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory
> freeing so it's not necessary to needlessly kill another task.  The
> context in which the pagefault is allocating memory is unknown to the oom
> killer, so this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 
> The special handling to determine whether a parallel memcg is currently
> oom is removed since we can detect future memory freeing with TIF_MEMDIE.
> The memcg has already reached its memory limit, so it will still need to
> kill a task regardless of the pagefault oom.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

NACK. please leave memcg's oom as it is. We're now rewriting.
This is not core of your patch set. please skip.

Thanks,
-Kame


> ---
>  include/linux/memcontrol.h |    6 ---
>  mm/memcontrol.c            |   35 +---------------
>  mm/oom_kill.c              |   97 ++++++++++++++++++++++++++------------------
>  3 files changed, 58 insertions(+), 80 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	return false;
> -}
> -
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -217,7 +217,6 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1205,34 +1204,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  	return total;
>  }
>  
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	bool ret = false;
> -	struct mem_cgroup *mem;
> -	struct mm_struct *mm;
> -
> -	rcu_read_lock();
> -	mm = task->mm;
> -	if (!mm)
> -		mm = &init_mm;
> -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> -		ret = true;
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> -	mem->last_oom_jiffies = jiffies;
> -	return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
>  /*
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
> @@ -1484,10 +1455,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			continue;
>  
>  		if (!nr_retries--) {
> -			if (oom) {
> +			if (oom)
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> -				record_last_oom(mem_over_limit);
> -			}
>  			goto nomem;
>  		}
>  	}
> @@ -2284,8 +2253,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
>   * try_charge_swapin does all of these works properly.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}
> +
> +/*
> + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> + * attempts or page faults may now recall the oom killer, if necessary.
> + */
> +static void clear_system_oom(void)
> +{
> +	struct zone *zone;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +	spin_unlock(&zone_scan_lock);
> +}
> +
> +
> +/*
>   * Must be called with tasklist_lock held for read.
>   */
>  static void __out_of_memory(gfp_t gfp_mask, int order,
> @@ -614,46 +652,9 @@ retry:
>  		goto retry;
>  }
>  
> -/*
> - * pagefault handler calls into here because it is out of memory but
> - * doesn't know exactly how or why.
> - */
> -void pagefault_out_of_memory(void)
> -{
> -	unsigned long freed = 0;
> -
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return;
> -
> -	/*
> -	 * If this is from memcg, oom-killer is already invoked.
> -	 * and not worth to go system-wide-oom.
> -	 */
> -	if (mem_cgroup_oom_called(current))
> -		goto rest_and_return;
> -
> -	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> -
> -	read_lock(&tasklist_lock);
> -	/* unknown gfp_mask and order */
> -	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> -	read_unlock(&tasklist_lock);
> -
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory.
> -	 */
> -rest_and_return:
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -}
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
> - * @zonelist: zonelist pointer
> + * @zonelist: zonelist pointer passed to page allocator
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
>   * @nodemask: nodemask passed to page allocator
> @@ -667,7 +668,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
> -	enum oom_constraint constraint;
> +	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
> @@ -683,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	if (zonelist)
> +		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  	if (unlikely(sysctl_panic_on_oom)) {
>  		/*
> @@ -693,6 +695,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		 */
>  		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> +			read_unlock(&tasklist_lock);
>  			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
>  	}
> @@ -706,3 +709,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
>  }
> +
> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
  2010-03-01  1:12   ` KAMEZAWA Hiroyuki
@ 2010-03-01  5:23   ` Balbir Singh
  2010-03-01 10:04     ` David Rientjes
  2010-03-02  2:21   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2010-03-01  5:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

* David Rientjes <rientjes@google.com> [2010-02-26 15:53:11]:

> It is possible to remove the special pagefault oom handler by simply
> oom locking all system zones and then calling directly into
> out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory
> freeing so it's not necessary to needlessly kill another task.  The
> context in which the pagefault is allocating memory is unknown to the oom
> killer, so this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 
> The special handling to determine whether a parallel memcg is currently
> oom is removed since we can detect future memory freeing with TIF_MEMDIE.
> The memcg has already reached its memory limit, so it will still need to
> kill a task regardless of the pagefault oom.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/memcontrol.h |    6 ---
>  mm/memcontrol.c            |   35 +---------------
>  mm/oom_kill.c              |   97 ++++++++++++++++++++++++++------------------
>  3 files changed, 58 insertions(+), 80 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
> 
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
> 
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	return false;
> -}
> -
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -217,7 +217,6 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
>  	atomic_t	refcnt;
> 
>  	unsigned int	swappiness;
> @@ -1205,34 +1204,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  	return total;
>  }
> 
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	bool ret = false;
> -	struct mem_cgroup *mem;
> -	struct mm_struct *mm;
> -
> -	rcu_read_lock();
> -	mm = task->mm;
> -	if (!mm)
> -		mm = &init_mm;
> -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> -		ret = true;
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> -	mem->last_oom_jiffies = jiffies;
> -	return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
>  /*
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
> @@ -1484,10 +1455,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			continue;
> 
>  		if (!nr_retries--) {
> -			if (oom) {
> +			if (oom)
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> -				record_last_oom(mem_over_limit);
> -			}
>  			goto nomem;
>  		}
>  	}
> @@ -2284,8 +2253,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
> 
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
>   * try_charge_swapin does all of these works properly.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
> 
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}

Isn't this an overkill, if pagefault_out_of_memory() does nothing and
oom takes longer than anticipated, we might end up looping, no?
Aren't we better off waiting for OOM to finish and retry the
pagefault?

And like Kame said the pagefault code in memcg is undergoing a churn,
we should revisit those parts later. I am yet to review that
patchset though.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01  5:23   ` Balbir Singh
@ 2010-03-01 10:04     ` David Rientjes
  2010-03-01 23:55       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-01 10:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

On Mon, 1 Mar 2010, Balbir Singh wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> >  }
> > 
> >  /*
> > + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> > + * parallel oom killing is taking place, otherwise locks all zones and returns
> > + * non-zero.
> > + */
> > +static int try_set_system_oom(void)
> > +{
> > +	struct zone *zone;
> > +	int ret = 1;
> > +
> > +	spin_lock(&zone_scan_lock);
> > +	for_each_populated_zone(zone)
> > +		if (zone_is_oom_locked(zone)) {
> > +			ret = 0;
> > +			goto out;
> > +		}
> > +	for_each_populated_zone(zone)
> > +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> > +out:
> > +	spin_unlock(&zone_scan_lock);
> > +	return ret;
> > +}
> 
> Isn't this an overkill, if pagefault_out_of_memory() does nothing and
> oom takes longer than anticipated, we might end up looping, no?
> Aren't we better off waiting for OOM to finish and retry the
> pagefault?
> 

I agree, I can add schedule_timeout_uninterruptible(1) so we decrease the 
loop while waiting for the parallel oom kill to happen.  It's not overkill 
because we want to avoid needlessly killing tasks when killing one will 
already free memory which is hopefully usable by the pagefault.  This 
merely covers the race between a parallel oom kill calling out_of_memory() 
and setting TIF_MEMDIE for a task which would make the following 
out_of_memory() call in pagefault_out_of_memory() a no-op anyway.

> And like Kame said the pagefault code in memcg is undergoing a churn,
> we should revisit those parts later. I am yet to review that
> patchset though.
> 

Kame said earlier it would be no problem to rebase his memcg oom work on 
mmotm if my patches were merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01  1:12   ` KAMEZAWA Hiroyuki
@ 2010-03-01 10:13     ` David Rientjes
  2010-03-01 23:59       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-01 10:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Mon, 1 Mar 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 26 Feb 2010 15:53:11 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > It is possible to remove the special pagefault oom handler by simply
> > oom locking all system zones and then calling directly into
> > out_of_memory().
> > 
> > All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> > parallel oom killing in progress that will lead to eventual memory
> > freeing so it's not necessary to needlessly kill another task.  The
> > context in which the pagefault is allocating memory is unknown to the oom
> > killer, so this is done on a system-wide level.
> > 
> > If a task has already been oom killed and hasn't fully exited yet, this
> > will be a no-op since select_bad_process() recognizes tasks across the
> > system with TIF_MEMDIE set.
> > 
> > The special handling to determine whether a parallel memcg is currently
> > oom is removed since we can detect future memory freeing with TIF_MEMDIE.
> > The memcg has already reached its memory limit, so it will still need to
> > kill a task regardless of the pagefault oom.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> NACK. please leave memcg's oom as it is. We're now rewriting.
> This is not core of your patch set. please skip.
> 

Your nack is completely unjustified, we're not going to stop oom killer 
development so memcg can catch up.  This patch allows pagefaults to go 
through the typical out_of_memory() interface so we don't have any 
ambiguity in how situations such as panic_on_oom are handled or whether 
current's memcg recently called the oom killer and it PREVENTS needlessly 
killing tasks when a parallel oom condition exists but a task hasn't been 
killed yet.

mem_cgroup_oom_called() is completely and utterly BOGUS since we can 
detect the EXACT same conditions via a tasklist scan filtered on current's 
memcg by looking for parallel oom kills, which out_of_memory() does, and 
locking the zonelists to prevent racing in calling out_of_memory() and 
actually setting the TIF_MEMDIE bit for the selected task.

You said earlier that you would wait for the next mmotm to be released and 
could easily rebase on my patchset and now you're stopping development 
entirely and allowing tasks to be needlessly oom killed via the old 
pagefault_out_of_memory() which does not synchronize on parallel oom 
kills.

I'm completely sure that you'll remove mem_cgroup_oom_called() entirely 
yourself since it doesn't do anything but encourage VM_FAULT_OOM loops 
itself, so please come up with some constructive criticism of my patch 
that Andrew can use to decide whether to merge my work or not instead of 
thinking you're the only one that can touch memcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01 10:04     ` David Rientjes
@ 2010-03-01 23:55       ` KAMEZAWA Hiroyuki
  2010-03-03  0:01         ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-01 23:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Mon, 1 Mar 2010 02:04:07 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 1 Mar 2010, Balbir Singh wrote:
> 
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> > >  }
> > > 
> > >  /*
> > > + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> > > + * parallel oom killing is taking place, otherwise locks all zones and returns
> > > + * non-zero.
> > > + */
> > > +static int try_set_system_oom(void)
> > > +{
> > > +	struct zone *zone;
> > > +	int ret = 1;
> > > +
> > > +	spin_lock(&zone_scan_lock);
> > > +	for_each_populated_zone(zone)
> > > +		if (zone_is_oom_locked(zone)) {
> > > +			ret = 0;
> > > +			goto out;
> > > +		}
> > > +	for_each_populated_zone(zone)
> > > +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> > > +out:
> > > +	spin_unlock(&zone_scan_lock);
> > > +	return ret;
> > > +}
> > 
> > Isn't this an overkill, if pagefault_out_of_memory() does nothing and
> > oom takes longer than anticipated, we might end up looping, no?
> > Aren't we better off waiting for OOM to finish and retry the
> > pagefault?
> > 
> 
> I agree, I can add schedule_timeout_uninterruptible(1) so we decrease the 
> loop while waiting for the parallel oom kill to happen.  It's not overkill 
> because we want to avoid needlessly killing tasks when killing one will 
> already free memory which is hopefully usable by the pagefault.  This 
> merely covers the race between a parallel oom kill calling out_of_memory() 
> and setting TIF_MEMDIE for a task which would make the following 
> out_of_memory() call in pagefault_out_of_memory() a no-op anyway.
> 
> > And like Kame said the pagefault code in memcg is undergoing a churn,
> > we should revisit those parts later. I am yet to review that
> > patchset though.
> > 
> 
> Kame said earlier it would be no problem to rebase his memcg oom work on 
> mmotm if my patches were merged.
> 

But I also said this patch cause regression.
I said it's ok to rabese to you series of patch. But about this patch,
No.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01 10:13     ` David Rientjes
@ 2010-03-01 23:59       ` KAMEZAWA Hiroyuki
  2010-03-02 23:55         ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-01 23:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Mon, 1 Mar 2010 02:13:28 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 1 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > On Fri, 26 Feb 2010 15:53:11 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > It is possible to remove the special pagefault oom handler by simply
> > > oom locking all system zones and then calling directly into
> > > out_of_memory().
> > > 
> > > All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> > > parallel oom killing in progress that will lead to eventual memory
> > > freeing so it's not necessary to needlessly kill another task.  The
> > > context in which the pagefault is allocating memory is unknown to the oom
> > > killer, so this is done on a system-wide level.
> > > 
> > > If a task has already been oom killed and hasn't fully exited yet, this
> > > will be a no-op since select_bad_process() recognizes tasks across the
> > > system with TIF_MEMDIE set.
> > > 
> > > The special handling to determine whether a parallel memcg is currently
> > > oom is removed since we can detect future memory freeing with TIF_MEMDIE.
> > > The memcg has already reached its memory limit, so it will still need to
> > > kill a task regardless of the pagefault oom.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > NACK. please leave memcg's oom as it is. We're now rewriting.
> > This is not core of your patch set. please skip.
> > 
> 
> Your nack is completely unjustified, we're not going to stop oom killer 
> development so memcg can catch up.  This patch allows pagefaults to go 
> through the typical out_of_memory() interface so we don't have any 
> ambiguity in how situations such as panic_on_oom are handled or whether 
> current's memcg recently called the oom killer and it PREVENTS needlessly 
> killing tasks when a parallel oom condition exists but a task hasn't been 
> killed yet.
> 
> mem_cgroup_oom_called() is completely and utterly BOGUS since we can 
> detect the EXACT same conditions via a tasklist scan filtered on current's 
> memcg by looking for parallel oom kills, which out_of_memory() does, and 
> locking the zonelists to prevent racing in calling out_of_memory() and 
> actually setting the TIF_MEMDIE bit for the selected task.
> 
> You said earlier that you would wait for the next mmotm to be released and 
> could easily rebase on my patchset and now you're stopping development 
> entirely and allowing tasks to be needlessly oom killed via the old 
> pagefault_out_of_memory() which does not synchronize on parallel oom 
> kills.
> 
> I'm completely sure that you'll remove mem_cgroup_oom_called() entirely 
> yourself since it doesn't do anything but encourage VM_FAULT_OOM loops 
> itself, so please come up with some constructive criticism of my patch 
> that Andrew can use to decide whether to merge my work or not instead of 
> thinking you're the only one that can touch memcg.
> 

Your patch seems not to go earlier than mine.
And please avoid zone avoid locking. memcg requires memcg based locking.
I pointed out this beofre, but you ignore that as usual.
Then, I said I'll do by myself.

Bye,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
  2010-03-01  1:12   ` KAMEZAWA Hiroyuki
  2010-03-01  5:23   ` Balbir Singh
@ 2010-03-02  2:21   ` KAMEZAWA Hiroyuki
  2010-03-02 23:59     ` David Rientjes
  2 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-02  2:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Fri, 26 Feb 2010 15:53:11 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> It is possible to remove the special pagefault oom handler by simply
> oom locking all system zones and then calling directly into
> out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory
> freeing so it's not necessary to needlessly kill another task.  The
> context in which the pagefault is allocating memory is unknown to the oom
> killer, so this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 
> The special handling to determine whether a parallel memcg is currently
> oom is removed since we can detect future memory freeing with TIF_MEMDIE.
> The memcg has already reached its memory limit, so it will still need to
> kill a task regardless of the pagefault oom.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Reviewd with refleshed mind.

1. mem_cgroup_oom_called should be removed. ok.
2. I think mem_cgroup should not return -ENOMEM at charging
   Then, no complicated thing in page_fault_out_of_memory().
   I'll add such changes, soom.
3. This patch includes too much things. please divide.
   At least, please put memcg part ouf of this patch.

Maybe own my patch will remove mem_cgroup_oom_called() and changes
memcg not to return -ENOMEM in page fault path.
You can ignore all memcg part. Then, there will be no HUNK.

Bye.
-Kame





> ---
>  include/linux/memcontrol.h |    6 ---
>  mm/memcontrol.c            |   35 +---------------
>  mm/oom_kill.c              |   97 ++++++++++++++++++++++++++------------------
>  3 files changed, 58 insertions(+), 80 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	return false;
> -}
> -
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -217,7 +217,6 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1205,34 +1204,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  	return total;
>  }
>  
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	bool ret = false;
> -	struct mem_cgroup *mem;
> -	struct mm_struct *mm;
> -
> -	rcu_read_lock();
> -	mm = task->mm;
> -	if (!mm)
> -		mm = &init_mm;
> -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> -		ret = true;
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> -	mem->last_oom_jiffies = jiffies;
> -	return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
>  /*
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
> @@ -1484,10 +1455,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  			continue;
>  
>  		if (!nr_retries--) {
> -			if (oom) {
> +			if (oom)
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> -				record_last_oom(mem_over_limit);
> -			}
>  			goto nomem;
>  		}
>  	}
> @@ -2284,8 +2253,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
>   * try_charge_swapin does all of these works properly.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -580,6 +580,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}
> +
> +/*
> + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> + * attempts or page faults may now recall the oom killer, if necessary.
> + */
> +static void clear_system_oom(void)
> +{
> +	struct zone *zone;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +	spin_unlock(&zone_scan_lock);
> +}
> +
> +
> +/*
>   * Must be called with tasklist_lock held for read.
>   */
>  static void __out_of_memory(gfp_t gfp_mask, int order,
> @@ -614,46 +652,9 @@ retry:
>  		goto retry;
>  }
>  
> -/*
> - * pagefault handler calls into here because it is out of memory but
> - * doesn't know exactly how or why.
> - */
> -void pagefault_out_of_memory(void)
> -{
> -	unsigned long freed = 0;
> -
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return;
> -
> -	/*
> -	 * If this is from memcg, oom-killer is already invoked.
> -	 * and not worth to go system-wide-oom.
> -	 */
> -	if (mem_cgroup_oom_called(current))
> -		goto rest_and_return;
> -
> -	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> -
> -	read_lock(&tasklist_lock);
> -	/* unknown gfp_mask and order */
> -	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> -	read_unlock(&tasklist_lock);
> -
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory.
> -	 */
> -rest_and_return:
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -}
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
> - * @zonelist: zonelist pointer
> + * @zonelist: zonelist pointer passed to page allocator
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
>   * @nodemask: nodemask passed to page allocator
> @@ -667,7 +668,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
> -	enum oom_constraint constraint;
> +	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
> @@ -683,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	if (zonelist)
> +		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  	if (unlikely(sysctl_panic_on_oom)) {
>  		/*
> @@ -693,6 +695,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		 */
>  		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> +			read_unlock(&tasklist_lock);
>  			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
>  	}
> @@ -706,3 +709,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
>  }
> +
> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent
  2010-02-26 23:53 ` [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-03-02  4:54   ` Balbir Singh
  0 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2010-03-02  4:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

* David Rientjes <rientjes@google.com> [2010-02-26 15:53:05]:

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list.  Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>


You've got too many of these already, here goes

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset
  2010-02-26 23:53 ` [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-03-02  4:54   ` Balbir Singh
  0 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2010-03-02  4:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-mm

* David Rientjes <rientjes@google.com> [2010-02-26 15:53:02]:

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.  To use a sane heuristic, we must ensure that killing a
> task would likely free memory for current and avoid needlessly killing
> others at all costs just because their potential memory freeing is
> unknown.  It is better to kill current than another task needlessly.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01 23:59       ` KAMEZAWA Hiroyuki
@ 2010-03-02 23:55         ` David Rientjes
  2010-03-03  0:24           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-02 23:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > Your nack is completely unjustified, we're not going to stop oom killer 
> > development so memcg can catch up.  This patch allows pagefaults to go 
> > through the typical out_of_memory() interface so we don't have any 
> > ambiguity in how situations such as panic_on_oom are handled or whether 
> > current's memcg recently called the oom killer and it PREVENTS needlessly 
> > killing tasks when a parallel oom condition exists but a task hasn't been 
> > killed yet.
> > 
> > mem_cgroup_oom_called() is completely and utterly BOGUS since we can 
> > detect the EXACT same conditions via a tasklist scan filtered on current's 
> > memcg by looking for parallel oom kills, which out_of_memory() does, and 
> > locking the zonelists to prevent racing in calling out_of_memory() and 
> > actually setting the TIF_MEMDIE bit for the selected task.
> > 
> > You said earlier that you would wait for the next mmotm to be released and 
> > could easily rebase on my patchset and now you're stopping development 
> > entirely and allowing tasks to be needlessly oom killed via the old 
> > pagefault_out_of_memory() which does not synchronize on parallel oom 
> > kills.
> > 
> > I'm completely sure that you'll remove mem_cgroup_oom_called() entirely 
> > yourself since it doesn't do anything but encourage VM_FAULT_OOM loops 
> > itself, so please come up with some constructive criticism of my patch 
> > that Andrew can use to decide whether to merge my work or not instead of 
> > thinking you're the only one that can touch memcg.
> > 
> 
> Your patch seems not to go earlier than mine.

Your latest patch, "memcg: fix oom killer behavior v2" at 
http://marc.info/?l=linux-kernel&m=126750597522101 removes the same code 
that this patch removes from memcg.  Your convoluting the issue by saying 
they have any dependency on each other at all, and that's why it's 
extremely frustrating for you to go around nacking other people's work 
when you really don't understand what it does.  You could trivially rebase 
on my patch at any time and I could trivially rebase on yours, it's that 
simple.  Your nack of saying you're going to rewrite how memcg handles 
conditions when it reaches its limit is completely irrelevant to my 
patchset, so I ask that if you're going to review a patch that you keep 
irrelevant discussions out of it because it just makes things 
unnecessarily confusing.

> And please avoid zone avoid locking. memcg requires memcg based locking.

Trying to set ZONE_OOM_LOCKED for all populated zones is fundamentally the 
correct thing to do on VM_FAULT_OOM when you don't know the context in 
which we're trying to allocate pages.  The _only_ thing that does is close 
a race between when another thread calls out_of_memory(), which is likely 
in such conditions, and the oom killer hasn't killed a task yet so we 
can't detect the TIF_MEMDIE bit during the tasklist scan.  Memcg is 
completely irrelevant with respect to this zone locking and that's why I 
didn't touch mem_cgroup_out_of_memory().  Did you seriously even read this 
patch?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-02  2:21   ` KAMEZAWA Hiroyuki
@ 2010-03-02 23:59     ` David Rientjes
  0 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-03-02 23:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010, KAMEZAWA Hiroyuki wrote:

> Reviewd with refleshed mind.
> 
> 1. mem_cgroup_oom_called should be removed. ok.
> 2. I think mem_cgroup should not return -ENOMEM at charging
>    Then, no complicated thing in page_fault_out_of_memory().
>    I'll add such changes, soom.

pagefault_out_of_memory() needs to do the zone locking to prevent 
needlessly killing tasks when VM_FAULT_OOM races with another cpu trying 
to allocate pages.  This has nothing to do with memcg.

> 3. This patch includes too much things. please divide.
>    At least, please put memcg part ouf of this patch.
> 

No, it doesn't.  The patch rewrites pagefault_out_of_memory() and that 
includes removing a call to mem_cgroup_oom_called(), which you just 
agreed to removing.  I'm not going to leave behind unused code anywhere in 
the kernel that you want to remove yourself anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-01 23:55       ` KAMEZAWA Hiroyuki
@ 2010-03-03  0:01         ` David Rientjes
  2010-03-03  0:22           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-03  0:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > Kame said earlier it would be no problem to rebase his memcg oom work on 
> > mmotm if my patches were merged.
> > 
> 
> But I also said this patch cause regression.

This patch causes a regression???  You never said that in any of your 
reviews and I have no idea what you're talking about, this patch simply 
cleans up the code and closes a race where VM_FAULT_OOM could needlessly 
kill tasks in parallel oom conditions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:01         ` David Rientjes
@ 2010-03-03  0:22           ` KAMEZAWA Hiroyuki
  2010-03-03  0:38             ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  0:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010 16:01:41 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 2 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > Kame said earlier it would be no problem to rebase his memcg oom work on 
> > > mmotm if my patches were merged.
> > > 
> > 
> > But I also said this patch cause regression.
> 
> This patch causes a regression???  You never said that in any of your 
> reviews and I have no idea what you're talking about, this patch simply 
> cleans up the code and closes a race where VM_FAULT_OOM could needlessly 
> kill tasks in parallel oom conditions.
> 
try_set_system_oom() is not called in memory_cgroup_out_of_memory() path.
Then, oom kill twice.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-02 23:55         ` David Rientjes
@ 2010-03-03  0:24           ` KAMEZAWA Hiroyuki
  2010-03-03  0:44             ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  0:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010 15:55:47 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 2 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > Your nack is completely unjustified, we're not going to stop oom killer 
> > > development so memcg can catch up.  This patch allows pagefaults to go 
> > > through the typical out_of_memory() interface so we don't have any 
> > > ambiguity in how situations such as panic_on_oom are handled or whether 
> > > current's memcg recently called the oom killer and it PREVENTS needlessly 
> > > killing tasks when a parallel oom condition exists but a task hasn't been 
> > > killed yet.
> > > 
> > > mem_cgroup_oom_called() is completely and utterly BOGUS since we can 
> > > detect the EXACT same conditions via a tasklist scan filtered on current's 
> > > memcg by looking for parallel oom kills, which out_of_memory() does, and 
> > > locking the zonelists to prevent racing in calling out_of_memory() and 
> > > actually setting the TIF_MEMDIE bit for the selected task.
> > > 
> > > You said earlier that you would wait for the next mmotm to be released and 
> > > could easily rebase on my patchset and now you're stopping development 
> > > entirely and allowing tasks to be needlessly oom killed via the old 
> > > pagefault_out_of_memory() which does not synchronize on parallel oom 
> > > kills.
> > > 
> > > I'm completely sure that you'll remove mem_cgroup_oom_called() entirely 
> > > yourself since it doesn't do anything but encourage VM_FAULT_OOM loops 
> > > itself, so please come up with some constructive criticism of my patch 
> > > that Andrew can use to decide whether to merge my work or not instead of 
> > > thinking you're the only one that can touch memcg.
> > > 
> > 
> > Your patch seems not to go earlier than mine.
> 
> Your latest patch, "memcg: fix oom killer behavior v2" at 
> http://marc.info/?l=linux-kernel&m=126750597522101 removes the same code 
> that this patch removes from memcg.  Your convoluting the issue by saying 
> they have any dependency on each other at all, and that's why it's 
> extremely frustrating for you to go around nacking other people's work 
> when you really don't understand what it does.  You could trivially rebase 
> on my patch at any time and I could trivially rebase on yours, it's that 
> simple. 
Ok.


 
> > And please avoid zone avoid locking. memcg requires memcg based locking.
> 
> Trying to set ZONE_OOM_LOCKED for all populated zones is fundamentally the 
> correct thing to do on VM_FAULT_OOM when you don't know the context in 
> which we're trying to allocate pages.  The _only_ thing that does is close 
> a race between when another thread calls out_of_memory(), which is likely 
> in such conditions, and the oom killer hasn't killed a task yet so we 
> can't detect the TIF_MEMDIE bit during the tasklist scan.  Memcg is 
> completely irrelevant with respect to this zone locking and that's why I 
> didn't touch mem_cgroup_out_of_memory().  Did you seriously even read this 
> patch?
> 

Then, memcg will see second oom-kill.


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:22           ` KAMEZAWA Hiroyuki
@ 2010-03-03  0:38             ` David Rientjes
  2010-03-03  0:44               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-03  0:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > This patch causes a regression???  You never said that in any of your 
> > reviews and I have no idea what you're talking about, this patch simply 
> > cleans up the code and closes a race where VM_FAULT_OOM could needlessly 
> > kill tasks in parallel oom conditions.
> > 
> try_set_system_oom() is not called in memory_cgroup_out_of_memory() path.
> Then, oom kill twice.
> 

So how does this cause a regression AT ALL?  Calling try_set_system_oom() 
in pagefault_out_of_memory() protects against concurrent out_of_memory() 
from the page allocator before a task is actually killed.  So this patch 
closes that race entirely.  So it most certainly does not introduce a 
regression.

You said earlier that mem_cgroup_out_of_memory() need not serialize 
against parallel oom killings because in that scenario we must kill 
something anyway, memory freeing from other ooms won't help if a memcg is 
over its limit.  So, yeah, we may kill two tasks if both the system and a 
memcg are oom in parallel and neither have actually killed a task yet, but 
that's much more jusitiable since we shouldn't rely on a memcg oom to free 
memory for the entire system.

So, again, there's absolutely no regression introduced by this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:24           ` KAMEZAWA Hiroyuki
@ 2010-03-03  0:44             ` David Rientjes
  0 siblings, 0 replies; 36+ messages in thread
From: David Rientjes @ 2010-03-03  0:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > Trying to set ZONE_OOM_LOCKED for all populated zones is fundamentally the 
> > correct thing to do on VM_FAULT_OOM when you don't know the context in 
> > which we're trying to allocate pages.  The _only_ thing that does is close 
> > a race between when another thread calls out_of_memory(), which is likely 
> > in such conditions, and the oom killer hasn't killed a task yet so we 
> > can't detect the TIF_MEMDIE bit during the tasklist scan.  Memcg is 
> > completely irrelevant with respect to this zone locking and that's why I 
> > didn't touch mem_cgroup_out_of_memory().  Did you seriously even read this 
> > patch?
> > 
> 
> Then, memcg will see second oom-kill.
> 

Sigh.  Memcg will only kill a second task if the parallel oom hasn't 
killed anything yet or the parallel oom kills a task that is not in the 
memcg, and that's because memcg needs to enforce its limit by killing 
something, freeing memory for the system won't help that.  We may kill 
another task for the system-wide oom after the memcg has already killed a 
task if the system-wide oom is iterating through the tasklist and the 
memcg kill sets TIF_MEMDIE too late.  That's independent of this patch, we 
can't go and recind an oom kill.  Labeling that as a regression is just 
not truthful, it's outside the scope of closing the VM_FAULT_OOM race.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:38             ` David Rientjes
@ 2010-03-03  0:44               ` KAMEZAWA Hiroyuki
  2010-03-03  0:53                 ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  0:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010 16:38:16 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > This patch causes a regression???  You never said that in any of your 
> > > reviews and I have no idea what you're talking about, this patch simply 
> > > cleans up the code and closes a race where VM_FAULT_OOM could needlessly 
> > > kill tasks in parallel oom conditions.
> > > 
> > try_set_system_oom() is not called in memory_cgroup_out_of_memory() path.
> > Then, oom kill twice.
> > 
> 
> So how does this cause a regression AT ALL?  Calling try_set_system_oom() 
> in pagefault_out_of_memory() protects against concurrent out_of_memory() 
> from the page allocator before a task is actually killed.  So this patch 
> closes that race entirely.  So it most certainly does not introduce a 
> regression.
> 
> You said earlier that mem_cgroup_out_of_memory() need not serialize 
> against parallel oom killings because in that scenario we must kill 
> something anyway, memory freeing from other ooms won't help if a memcg is 
> over its limit.  So, yeah, we may kill two tasks if both the system and a 
> memcg are oom in parallel and neither have actually killed a task yet, but 
> that's much more jusitiable since we shouldn't rely on a memcg oom to free 
> memory for the entire system.
> 
> So, again, there's absolutely no regression introduced by this patch.
> 
I'm sorry if I miss somthing.

memory_cgroup_out_of_memory() kills a task. and return VM_FAULT_OOM then,
page_fault_out_of_memory() kills another task.
and cause panic if panic_on_oom=1.

Then, if we remove mem_cgroup_oom_called(), we have to take care that
memcg doesn't cause VM_FAULT_OOM.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:44               ` KAMEZAWA Hiroyuki
@ 2010-03-03  0:53                 ` David Rientjes
  2010-03-03  0:58                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-03  0:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:

> memory_cgroup_out_of_memory() kills a task. and return VM_FAULT_OOM then,
> page_fault_out_of_memory() kills another task.
> and cause panic if panic_on_oom=1.
> 

If mem_cgroup_out_of_memory() has returned, then it has already killed a 
task that will have TIF_MEMDIE set and therefore make the VM_FAULT_OOM oom 
a no-op.  If the oom killed task subsequently returns VM_FAULT_OOM, we 
better panic because we've fully depleted memory reserves and no future 
memory freeing is guaranteed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:53                 ` David Rientjes
@ 2010-03-03  0:58                   ` KAMEZAWA Hiroyuki
  2010-03-03 23:27                     ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-03  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Tue, 2 Mar 2010 16:53:00 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > memory_cgroup_out_of_memory() kills a task. and return VM_FAULT_OOM then,
> > page_fault_out_of_memory() kills another task.
> > and cause panic if panic_on_oom=1.
> > 
> 
> If mem_cgroup_out_of_memory() has returned, then it has already killed a 
> task that will have TIF_MEMDIE set and therefore make the VM_FAULT_OOM oom 
> a no-op.  If the oom killed task subsequently returns VM_FAULT_OOM, we 
> better panic because we've fully depleted memory reserves and no future 
> memory freeing is guaranteed.
> 
In patch 01-03, you don't modified panic_on_oom implementation.
And this patch, you don't modified the return code of memcg's charge code.
It still returns -ENOMEM.

Then, VM_FAULT_OOM is returned and page_fault_out_of_memory() calles this
and hit this.

       case CONSTRAINT_NONE:
                if (sysctl_panic_on_oom) {
                        dump_header(NULL, gfp_mask, order, NULL);
                        panic("out of memory. panic_on_oom is selected\n");
                }

The system will panic. A hook, mem_cgroup_oom_called() is for avoiding this.
memcg's oom doesn't mean memory shortage, just means it his limit.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03  0:58                   ` KAMEZAWA Hiroyuki
@ 2010-03-03 23:27                     ` David Rientjes
  2010-03-04  3:59                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-03 23:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:

> In patch 01-03, you don't modified panic_on_oom implementation.
> And this patch, you don't modified the return code of memcg's charge code.
> It still returns -ENOMEM.
> 
> Then, VM_FAULT_OOM is returned and page_fault_out_of_memory() calles this
> and hit this.
> 
>        case CONSTRAINT_NONE:
>                 if (sysctl_panic_on_oom) {
>                         dump_header(NULL, gfp_mask, order, NULL);
>                         panic("out of memory. panic_on_oom is selected\n");
>                 }
> 
> The system will panic. A hook, mem_cgroup_oom_called() is for avoiding this.
> memcg's oom doesn't mean memory shortage, just means it his limit.
> 

And this is fixed by memcg-fix-oom-kill-behavior-v3.patch in -mm, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-03 23:27                     ` David Rientjes
@ 2010-03-04  3:59                       ` KAMEZAWA Hiroyuki
  2010-03-04  6:50                         ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-04  3:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010 15:27:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 3 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > In patch 01-03, you don't modified panic_on_oom implementation.
> > And this patch, you don't modified the return code of memcg's charge code.
> > It still returns -ENOMEM.
> > 
> > Then, VM_FAULT_OOM is returned and page_fault_out_of_memory() calles this
> > and hit this.
> > 
> >        case CONSTRAINT_NONE:
> >                 if (sysctl_panic_on_oom) {
> >                         dump_header(NULL, gfp_mask, order, NULL);
> >                         panic("out of memory. panic_on_oom is selected\n");
> >                 }
> > 
> > The system will panic. A hook, mem_cgroup_oom_called() is for avoiding this.
> > memcg's oom doesn't mean memory shortage, just means it his limit.
> > 
> 
> And this is fixed by memcg-fix-oom-kill-behavior-v3.patch in -mm, right?
> 
yes.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-04  3:59                       ` KAMEZAWA Hiroyuki
@ 2010-03-04  6:50                         ` David Rientjes
  2010-03-04  7:00                           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-04  6:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Thu, 4 Mar 2010, KAMEZAWA Hiroyuki wrote:

> > And this is fixed by memcg-fix-oom-kill-behavior-v3.patch in -mm, right?
> > 
> yes.
> 

Good.  This patch can easily be rebased on top of the next mmotm release, 
then, as I mentioned before.  Do you have time to review the actual oom 
killer part of this patch?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-04  6:50                         ` David Rientjes
@ 2010-03-04  7:00                           ` KAMEZAWA Hiroyuki
  2010-03-04  9:50                             ` David Rientjes
  0 siblings, 1 reply; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-04  7:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Wed, 3 Mar 2010 22:50:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 4 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > And this is fixed by memcg-fix-oom-kill-behavior-v3.patch in -mm, right?
> > > 
> > yes.
> > 
> 
> Good.  This patch can easily be rebased on top of the next mmotm release, 
> then, as I mentioned before.  Do you have time to review the actual oom 
> killer part of this patch?
> 
About the _changes_ for generic part itself, I have no concerns.

But I'm not sure whether TIF_MEMDIE task has been already killed (quit tasklist)
before VM_FAULT_OOM task comes here.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-04  7:00                           ` KAMEZAWA Hiroyuki
@ 2010-03-04  9:50                             ` David Rientjes
  2010-03-05  0:58                               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 36+ messages in thread
From: David Rientjes @ 2010-03-04  9:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Thu, 4 Mar 2010, KAMEZAWA Hiroyuki wrote:

> About the _changes_ for generic part itself, I have no concerns.
> 

Is that your acked-by?

> But I'm not sure whether TIF_MEMDIE task has been already killed (quit tasklist)
> before VM_FAULT_OOM task comes here.
> 

If it's no longer a member of the tasklist then it has freed its memory 
and thus returning VM_FAULT_OOM again would mean that we are still oom.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch -mm v2 04/10] oom: remove special handling for pagefault ooms
  2010-03-04  9:50                             ` David Rientjes
@ 2010-03-05  0:58                               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 36+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-05  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Andrew Morton, Rik van Riel, Nick Piggin,
	KOSAKI Motohiro, linux-mm

On Thu, 4 Mar 2010 01:50:04 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 4 Mar 2010, KAMEZAWA Hiroyuki wrote:
> 
> > About the _changes_ for generic part itself, I have no concerns.
> > 
> 
> Is that your acked-by?
> 
Anyway, I think you'll repost. I'll see again. feel free to CC me.


> > But I'm not sure whether TIF_MEMDIE task has been already killed (quit tasklist)
> > before VM_FAULT_OOM task comes here.
> > 
> 
> If it's no longer a member of the tasklist then it has freed its memory 
> and thus returning VM_FAULT_OOM again would mean that we are still oom.
> 
My concern was multi-threaded task with ignoring SIGCHLD..but after looking into
page allocatoer again, I think you're right. Thanks.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-03-05  1:02 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-26 23:52 [patch -mm v2 00/10] oom killer rewrite David Rientjes
2010-02-26 23:53 ` [patch -mm v2 01/10] oom: filter tasks not sharing the same cpuset David Rientjes
2010-03-02  4:54   ` Balbir Singh
2010-02-26 23:53 ` [patch -mm v2 02/10] oom: sacrifice child with highest badness score for parent David Rientjes
2010-03-02  4:54   ` Balbir Singh
2010-02-26 23:53 ` [patch -mm v2 03/10] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-02-26 23:53 ` [patch -mm v2 04/10] oom: remove special handling for pagefault ooms David Rientjes
2010-03-01  1:12   ` KAMEZAWA Hiroyuki
2010-03-01 10:13     ` David Rientjes
2010-03-01 23:59       ` KAMEZAWA Hiroyuki
2010-03-02 23:55         ` David Rientjes
2010-03-03  0:24           ` KAMEZAWA Hiroyuki
2010-03-03  0:44             ` David Rientjes
2010-03-01  5:23   ` Balbir Singh
2010-03-01 10:04     ` David Rientjes
2010-03-01 23:55       ` KAMEZAWA Hiroyuki
2010-03-03  0:01         ` David Rientjes
2010-03-03  0:22           ` KAMEZAWA Hiroyuki
2010-03-03  0:38             ` David Rientjes
2010-03-03  0:44               ` KAMEZAWA Hiroyuki
2010-03-03  0:53                 ` David Rientjes
2010-03-03  0:58                   ` KAMEZAWA Hiroyuki
2010-03-03 23:27                     ` David Rientjes
2010-03-04  3:59                       ` KAMEZAWA Hiroyuki
2010-03-04  6:50                         ` David Rientjes
2010-03-04  7:00                           ` KAMEZAWA Hiroyuki
2010-03-04  9:50                             ` David Rientjes
2010-03-05  0:58                               ` KAMEZAWA Hiroyuki
2010-03-02  2:21   ` KAMEZAWA Hiroyuki
2010-03-02 23:59     ` David Rientjes
2010-02-26 23:53 ` [patch -mm v2 05/10] oom: badness heuristic rewrite David Rientjes
2010-02-26 23:53 ` [patch -mm v2 06/10] oom: deprecate oom_adj tunable David Rientjes
2010-02-26 23:53 ` [patch -mm v2 07/10] oom: replace sysctls with quick mode David Rientjes
2010-02-26 23:53 ` [patch -mm v2 08/10] oom: avoid oom killer for lowmem allocations David Rientjes
2010-02-26 23:53 ` [patch -mm v2 09/10] oom: remove unnecessary code and cleanup David Rientjes
2010-02-26 23:53 ` [patch -mm v2 10/10] oom: default to killing current for pagefault ooms David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.