All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch -mm 0/9 v2] oom killer rewrite
@ 2010-02-15 22:19 ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently.  The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

Changes from version 1:

 - updated to mmotm-2010-02-11-21-55

 - when iterating the tasklist for mempolicy-constrained oom conditions,
   the node of the cpu that a MPOL_F_LOCAL task is running on is now
   intersected with the page allocator's nodemask to determine whether it
   should be a candidate for oom kill.

 - added: [patch 4/9] oom: remove compulsory panic_on_oom mode

 - /proc/pid/oom_score_adj was added to prevent ABI breakage for
   applications using /proc/pid/oom_adj.  /proc/pid/oom_adj may still be
   used with the old range but it is then scaled to oom_score_adj units
   for a rough linear approximation.  There is no loss in functionality
   from the old interface.

 - added: [patch 6/9] oom: deprecate oom_adj tunable

This patchset is based on mmotm-2010-02-11-21-55 because of the following
dependencies:

	[patch 5/9] oom: badness heuristic rewrite:
		mm-count-swap-usage.patch

	[patch 7/9] oom: replace sysctls with quick mode:
		sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc8 and apply

	mm-clean-up-mm_counter.patch
	mm-avoid-false-sharing-of-mm_counter.patch
	mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
	mm-count-swap-usage.patch
	mm-count-swap-usage-checkpatch-fixes.patch
	mm-introduce-dump_page-and-print-symbolic-flag-names.patch
	sysctl-clean-up-vm-related-variable-declarations.patch
	sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---
 Documentation/feature-removal-schedule.txt |   30 +
 Documentation/filesystems/proc.txt         |  100 +++---
 Documentation/sysctl/vm.txt                |   71 +---
 fs/proc/base.c                             |  106 ++++++
 include/linux/mempolicy.h                  |   13 
 include/linux/oom.h                        |   24 +
 include/linux/sched.h                      |    3 
 kernel/fork.c                              |    1 
 kernel/sysctl.c                            |   15 
 mm/mempolicy.c                             |   39 ++
 mm/oom_kill.c                              |  479 ++++++++++++++---------------
 mm/page_alloc.c                            |    3 
 12 files changed, 553 insertions(+), 331 deletions(-)

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 0/9 v2] oom killer rewrite
@ 2010-02-15 22:19 ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently.  The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

Changes from version 1:

 - updated to mmotm-2010-02-11-21-55

 - when iterating the tasklist for mempolicy-constrained oom conditions,
   the node of the cpu that a MPOL_F_LOCAL task is running on is now
   intersected with the page allocator's nodemask to determine whether it
   should be a candidate for oom kill.

 - added: [patch 4/9] oom: remove compulsory panic_on_oom mode

 - /proc/pid/oom_score_adj was added to prevent ABI breakage for
   applications using /proc/pid/oom_adj.  /proc/pid/oom_adj may still be
   used with the old range but it is then scaled to oom_score_adj units
   for a rough linear approximation.  There is no loss in functionality
   from the old interface.

 - added: [patch 6/9] oom: deprecate oom_adj tunable

This patchset is based on mmotm-2010-02-11-21-55 because of the following
dependencies:

	[patch 5/9] oom: badness heuristic rewrite:
		mm-count-swap-usage.patch

	[patch 7/9] oom: replace sysctls with quick mode:
		sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc8 and apply

	mm-clean-up-mm_counter.patch
	mm-avoid-false-sharing-of-mm_counter.patch
	mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
	mm-count-swap-usage.patch
	mm-count-swap-usage-checkpatch-fixes.patch
	mm-introduce-dump_page-and-print-symbolic-flag-names.patch
	sysctl-clean-up-vm-related-variable-declarations.patch
	sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---
 Documentation/feature-removal-schedule.txt |   30 +
 Documentation/filesystems/proc.txt         |  100 +++---
 Documentation/sysctl/vm.txt                |   71 +---
 fs/proc/base.c                             |  106 ++++++
 include/linux/mempolicy.h                  |   13 
 include/linux/oom.h                        |   24 +
 include/linux/sched.h                      |    3 
 kernel/fork.c                              |    1 
 kernel/sysctl.c                            |   15 
 mm/mempolicy.c                             |   39 ++
 mm/oom_kill.c                              |  479 ++++++++++++++---------------
 mm/page_alloc.c                            |    3 
 12 files changed, 553 insertions(+), 331 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 1/9 v2] oom: filter tasks not sharing the same cpuset
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *tsk)
 {
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 1/9 v2] oom: filter tasks not sharing the same cpuset
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
  */
 static int has_intersects_mems_allowed(struct task_struct *tsk)
 {
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 2/9 v2] oom: sacrifice child with highest badness score for parent
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list.  Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	/* Try to sacrifice the worst child first */
+	do_posix_clock_monotonic_gettime(&uptime);
 	list_for_each_entry(c, &p->children, sibling) {
+		unsigned long cpoints;
+
 		if (c->mm == p->mm)
 			continue;
-		if (!oom_kill_task(c))
-			return 0;
+
+		/* badness() returns 0 if the thread is unkillable */
+		cpoints = badness(c, uptime.tv_sec);
+		if (cpoints > victim_points) {
+			victim = c;
+			victim_points = cpoints;
+		}
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(victim);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 2/9 v2] oom: sacrifice child with highest badness score for parent
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list.  Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	/* Try to sacrifice the worst child first */
+	do_posix_clock_monotonic_gettime(&uptime);
 	list_for_each_entry(c, &p->children, sibling) {
+		unsigned long cpoints;
+
 		if (c->mm == p->mm)
 			continue;
-		if (!oom_kill_task(c))
-			return 0;
+
+		/* badness() returns 0 if the thread is unkillable */
+		cpoints = badness(c, uptime.tv_sec);
+		if (cpoints > victim_points) {
+			victim = c;
+			victim_points = cpoints;
+		}
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(victim);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 mm/mempolicy.c            |   39 +++++++++++++++++++++++
 mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
 3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
+ * node for 'preferred' or 'local' policy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	mempolicy = tsk->mempolicy;
+	mpol_get(mempolicy);
+	if (!mask || !mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			ret = node_isset(cpu_to_node(task_cpu(tsk)), *mask);
+		else
+			ret = node_isset(mempolicy->v.preferred_node,
+					 *mask);
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+out:
+	mpol_put(mempolicy);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+						const nodemask_t *mask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
 
-	t = tsk;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
-
-	return 0;
+		if (mask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(tsk, mask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, tsk))
+				return true;
+		}
+		tsk = next_thread(tsk);
+	} while (tsk != start);
+	return false;
 }
 
 /**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+		struct mem_cgroup *mem, enum oom_constraint constraint,
+		const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p,
+				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+									 NULL))
 			continue;
 
 		/*
@@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -574,7 +595,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -632,6 +654,7 @@ rest_and_return:
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
+	if (unlikely(sysctl_panic_on_oom)) {
+		/*
+		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+		 * should not panic for cpuset or mempolicy induced memory
+		 * failures.
+		 */
+		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
-			panic("out of memory. panic_on_oom is selected\n");
+			panic("Out of memory: panic_on_oom is enabled\n");
 		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
 	}
-
+	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 mm/mempolicy.c            |   39 +++++++++++++++++++++++
 mm/oom_kill.c             |   77 +++++++++++++++++++++++++++-----------------
 3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
+ * node for 'preferred' or 'local' policy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	mempolicy = tsk->mempolicy;
+	mpol_get(mempolicy);
+	if (!mask || !mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		if (mempolicy->flags & MPOL_F_LOCAL)
+			ret = node_isset(cpu_to_node(task_cpu(tsk)), *mask);
+		else
+			ret = node_isset(mempolicy->v.preferred_node,
+					 *mask);
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+out:
+	mpol_put(mempolicy);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+						const nodemask_t *mask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
 
-	t = tsk;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
-
-	return 0;
+		if (mask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(tsk, mask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, tsk))
+				return true;
+		}
+		tsk = next_thread(tsk);
+	} while (tsk != start);
+	return false;
 }
 
 /**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+		struct mem_cgroup *mem, enum oom_constraint constraint,
+		const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p,
+				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+									 NULL))
 			continue;
 
 		/*
@@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -574,7 +595,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -632,6 +654,7 @@ rest_and_return:
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
+	if (unlikely(sysctl_panic_on_oom)) {
+		/*
+		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+		 * should not panic for cpuset or mempolicy induced memory
+		 * failures.
+		 */
+		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
-			panic("out of memory. panic_on_oom is selected\n");
+			panic("Out of memory: panic_on_oom is enabled\n");
 		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
 	}
-
+	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
regardless of whether the memory allocation is constrained by either a
mempolicy or cpuset.

Since mempolicy-constrained out of memory conditions now iterate through
the tasklist and select a task to kill, it is possible to panic the
machine if all tasks sharing the same mempolicy nodes (including those
with default policy, they may allocate anywhere) or cpuset mems have
/proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
to the compulsory panic_on_oom setting of 2, so the mode is removed.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   20 ++++----------------
 mm/oom_kill.c               |    5 -----
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -559,25 +559,13 @@ swap-intensive.
 
 panic_on_oom
 
-This enables or disables panic on out-of-memory feature.
+If this is set to zero, the oom killer will be invoked when the kernel is out of
+memory and direct reclaim cannot free any pages.  It will select a memory-
+hogging task that frees up a large amount of memory to kill.
 
-If this is set to 0, the kernel will kill some rogue process,
-called oom_killer.  Usually, oom_killer can kill rogue processes and
-system will survive.
-
-If this is set to 1, the kernel panics when out-of-memory happens.
-However, if a process limits using nodes by mempolicy/cpusets,
-and those nodes become memory exhaustion status, one process
-may be killed by oom-killer. No panic occurs in this case.
-Because other nodes' memory may be free. This means system total status
-may be not fatal yet.
-
-If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+If this is set to non-zero, the machine will panic when out of memory.
 
 The default value is 0.
-1 and 2 are for failover of clustering. Please select either
-according to your policy of failover.
 
 =============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -672,11 +672,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		/* Got some memory back in the last second. */
 		return;
 
-	if (sysctl_panic_on_oom == 2) {
-		dump_header(NULL, gfp_mask, order, NULL);
-		panic("out of memory. Compulsory panic_on_oom is selected.\n");
-	}
-
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
regardless of whether the memory allocation is constrained by either a
mempolicy or cpuset.

Since mempolicy-constrained out of memory conditions now iterate through
the tasklist and select a task to kill, it is possible to panic the
machine if all tasks sharing the same mempolicy nodes (including those
with default policy, they may allocate anywhere) or cpuset mems have
/proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
to the compulsory panic_on_oom setting of 2, so the mode is removed.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   20 ++++----------------
 mm/oom_kill.c               |    5 -----
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -559,25 +559,13 @@ swap-intensive.
 
 panic_on_oom
 
-This enables or disables panic on out-of-memory feature.
+If this is set to zero, the oom killer will be invoked when the kernel is out of
+memory and direct reclaim cannot free any pages.  It will select a memory-
+hogging task that frees up a large amount of memory to kill.
 
-If this is set to 0, the kernel will kill some rogue process,
-called oom_killer.  Usually, oom_killer can kill rogue processes and
-system will survive.
-
-If this is set to 1, the kernel panics when out-of-memory happens.
-However, if a process limits using nodes by mempolicy/cpusets,
-and those nodes become memory exhaustion status, one process
-may be killed by oom-killer. No panic occurs in this case.
-Because other nodes' memory may be free. This means system total status
-may be not fatal yet.
-
-If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+If this is set to non-zero, the machine will panic when out of memory.
 
 The default value is 0.
-1 and 2 are for failover of clustering. Please select either
-according to your policy of failover.
 
 =============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -672,11 +672,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		/* Got some memory back in the last second. */
 		return;
 
-	if (sysctl_panic_on_oom == 2) {
-		dump_header(NULL, gfp_mask, order, NULL);
-		panic("out of memory. Compulsory panic_on_oom is selected.\n");
-	}
-
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 5/9 v2] oom: badness heuristic rewrite
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory.  "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb.  This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks.  A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected
as the victim, the oom killer will still kill the 1st generation execve
child with the highest badness() score in its place.  The avoids killing
important servers or system daemons.  When a web server forks a very
large number of threads for client connections, for example, it is much
better to kill one of those threads than to kill the server and make it
unresponsive.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not
possible to redefine the meaning of /proc/pid/oom_adj with a new scale
since the ABI cannot be changed for backward compatability.  Instead, a
new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to
+1000.  It may be used to polarize the heuristic such that certain tasks
are never considered for oom kill while others may always be considered.
The value is added directory into the badness() score so a value of -500,
for example, means to discount 50% of its memory consumption in
comparison to other tasks either on the system, bound to the mempolicy,
in the cpuset, or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj
to be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   97 +++++++----
 Documentation/sysctl/vm.txt        |   21 +++
 fs/proc/base.c                     |   98 +++++++++++-
 include/linux/oom.h                |   18 ++-
 include/linux/sched.h              |    3 +-
 kernel/fork.c                      |    1 +
 kernel/sysctl.c                    |    8 +
 mm/oom_kill.c                      |  319 ++++++++++++++++++++----------------
 8 files changed, 379 insertions(+), 186 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
+  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+								score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1193,42 +1194,64 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- 	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- 	to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
-	points <<= oom_adj when it is positive and
-	points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted.  The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000.  If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called.  If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes.  If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit.  Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill.  Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it.  The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory.  A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score.  Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible.  This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
 - oom_dump_tasks
+- oom_forkbomb_thres
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.
 
 ==============================================================
 
+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb.  Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+	(average rss of execve children) * (# of 1st generation execve children)
+	------------------------------------------------------------------------
+				oom_forkbomb_thres
+
+in the oom killer's badness heuristic.  Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
 oom_kill_allocating_task
 
 This enables or disables killing the OOM-triggering task in
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
 #include <linux/elf.h>
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /* NOTE:
@@ -458,7 +459,6 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points;
@@ -466,7 +466,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task->group_leader, uptime.tv_sec);
+	points = oom_badness(task->group_leader,
+				global_page_state(NR_INACTIVE_ANON) +
+				global_page_state(NR_ACTIVE_ANON) +
+				global_page_state(NR_INACTIVE_FILE) +
+				global_page_state(NR_ACTIVE_FILE) +
+				total_swap_pages,
+				uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1152,7 +1158,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_adj = oom_adjust;
-
+	/*
+	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+	 * value is always attainable.
+	 */
+	if (task->signal->oom_adj == OOM_ADJUST_MAX)
+		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+	else
+		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+								-OOM_DISABLE;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1164,6 +1178,82 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	unsigned long flags;
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+	if (lock_task_sighand(task, &flags)) {
+		oom_score_adj = task->signal->oom_score_adj;
+		unlock_task_sighand(task, &flags);
+	}
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	unsigned long flags;
+	long oom_score_adj;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+	if (err)
+		return -EINVAL;
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	if (!lock_task_sighand(task, &flags)) {
+		put_task_struct(task);
+		return -ESRCH;
+	}
+	if (oom_score_adj < task->signal->oom_score_adj &&
+			!capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	task->signal->oom_score_adj = oom_score_adj;
+	/*
+	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+	 * always attainable.
+	 */
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		task->signal->oom_adj = OOM_DISABLE;
+	else
+		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+							OOM_SCORE_ADJ_MAX;
+	unlock_task_sighand(task, &flags);
+	put_task_struct(task);
+	return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+	.read		= oom_score_adj_read,
+	.write		= oom_score_adj_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2727,6 +2817,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -3061,6 +3152,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,27 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES	1000
+
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -24,6 +37,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
+extern unsigned int oom_badness(struct task_struct *p,
+			unsigned long totalpages, unsigned long uptime);
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -47,6 +62,7 @@ static inline void oom_killer_enable(void)
 extern int sysctl_panic_on_oom;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,7 +624,8 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;	/* OOM kill score adjustment (bit shift) */
+	int oom_adj;		/* OOM kill score adjustment (bit shift) */
+	int oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -878,6 +878,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
+	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	return 0;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -955,6 +955,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_forkbomb_thres",
+		.data		= &sysctl_oom_forkbomb_thres,
+		.maxlen		= sizeof(sysctl_oom_forkbomb_thres),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
+ *  Copyright (C)  2010  Google, Inc
+ *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
@@ -68,138 +70,129 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 	return false;
 }
 
-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose.  The oom killer assesses a penalty equaling
  *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ *	(average rss of children) * (# of 1st generation execve children)
+ *	-----------------------------------------------------------------
+ *			sysctl_oom_forkbomb_thres
  *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the principle
- *    of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent.  oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method.  A web server, for example, may fork a very large number of
+ * threads to response to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive.  The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
 {
-	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
+	unsigned long child_rss = 0;
+	int forkcount = 0;
 
-	if (oom_adj == OOM_DISABLE)
+	if (!sysctl_oom_forkbomb_thres)
 		return 0;
+	list_for_each_entry(child, &tsk->children, sibling) {
+		struct task_cputime task_time;
+		unsigned long runtime;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
-		return 0;
+		task_lock(child);
+		if (!child->mm || child->mm == tsk->mm) {
+			task_unlock(child);
+			continue;
+		}
+		thread_group_cputime(child, &task_time);
+		runtime = cputime_to_jiffies(task_time.utime) +
+			  cputime_to_jiffies(task_time.stime);
+		/*
+		 * Only threads that have run for less than a second are
+		 * considered toward the forkbomb penalty, these threads rarely
+		 * get to execute at all in such cases anyway.
+		 */
+		if (runtime < HZ) {
+			child_rss += get_mm_rss(child->mm);
+			forkcount++;
+		}
+		task_unlock(child);
 	}
 
 	/*
-	 * The memory size of the process is the basis for the badness.
+	 * Forkbombs get penalized by:
+	 *
+	 * (average rss of children) * (# of first-generation execve children) /
+	 *			sysctl_oom_forkbomb_thres
 	 */
-	points = mm->total_vm;
+	return forkcount > sysctl_oom_forkbomb_thres ?
+				(child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible.  The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+							unsigned long uptime)
+{
+	struct mm_struct *mm;
+	int points;
 
 	/*
-	 * After this unlock we can no longer dereference local variable `mm'
+	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+	 * need to be executed for something that can't be killed.
 	 */
-	task_unlock(p);
+	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		return 0;
 
 	/*
-	 * swapoff can easily use up all memory, so kill those first.
+	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+	 * priority for oom killing.
 	 */
 	if (p->flags & PF_OOM_ORIGIN)
-		return ULONG_MAX;
+		return 1000;
 
-	/*
-	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add half the vmsize of the children if they
-	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of children. In case a single
-	 * child is eating the vast majority of memory, adding only half
-	 * to the parents will make the child our kill candidate of choice.
-	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	task_lock(p);
+	mm = p->mm;
+	if (!mm) {
+		task_unlock(p);
+		return 0;
 	}
 
 	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
-
-	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
+	 * The baseline for the badness score is the proportion of RAM that each
+	 * task's rss and swap space use.
 	 */
-	if (task_nice(p) > 0)
-		points *= 2;
-
-	/*
-	 * Superuser processes are usually more important, so we make it
-	 * less likely that we kill those.
-	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
-	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
-		points /= 4;
+	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+			totalpages;
+	task_unlock(p);
+	points += oom_forkbomb_penalty(p);
 
 	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
+	 * Root processes get 3% bonus, just like the __vm_enough_memory() used
+	 * by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+		points -= 30;
 
 	/*
-	 * Adjust the score by oom_adj.
+	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the
+	 * range may either completely disable oom killing or always prefer a
+	 * certain task.
 	 */
-	if (oom_adj) {
-		if (oom_adj > 0) {
-			if (!points)
-				points = 1;
-			points <<= oom_adj;
-		} else
-			points >>= -(oom_adj);
-	}
+	points += p->signal->oom_score_adj;
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+	if (points < 0)
+		return 0;
+	return (points <= 1000) ? points : 1000;
 }
 
 /*
@@ -207,11 +200,21 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				    gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	bool cpuset_limited = false;
+	int nid;
+
+	/* Default to all anonymous memory, page cache, and swap */
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -222,25 +225,41 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
-	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
-	 * feature. mempolicy is an only user of nodemask here.
-	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+	 * the page allocator means a mempolicy is in effect.  Cpuset policy
+	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, *nodemask)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
 		return CONSTRAINT_MEMORY_POLICY;
+	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			return CONSTRAINT_CPUSET;
-
+			cpuset_limited = true;
+
+	if (cpuset_limited) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, cpuset_current_mems_allowed)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
+		return CONSTRAINT_CPUSET;
+	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	return CONSTRAINT_NONE;
 }
@@ -252,9 +271,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints,
-		struct mem_cgroup *mem, enum oom_constraint constraint,
-		const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+		unsigned long totalpages, struct mem_cgroup *mem,
+		enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -263,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned long points;
+		unsigned int points;
 
 		/*
 		 * skip kernel threads and tasks which have already released
@@ -308,13 +327,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = ULONG_MAX;
+			*ppoints = 1000;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, totalpages, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -330,7 +349,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
  *
  * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
  * shown.
@@ -342,7 +361,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	struct task_struct *g, *p;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
-	       "name\n");
+	       "oom_score_adj name\n");
 	do_each_thread(g, p) {
 		struct mm_struct *mm;
 
@@ -362,10 +381,10 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			task_unlock(p);
 			continue;
 		}
-		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
 		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
-		       p->comm);
+		       p->signal->oom_score_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -374,8 +393,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d, oom_score_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj,
+		current->signal->oom_score_adj);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -440,7 +460,7 @@ static int oom_kill_task(struct task_struct *p)
 	 * change to NULL at any time since we do not hold task_lock(p).
 	 * However, this is of no concern to us.
 	 */
-	if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return 1;
 
 	__oom_kill_task(p, 1);
@@ -449,12 +469,12 @@ static int oom_kill_task(struct task_struct *p)
 }
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned int points, unsigned long totalpages,
+			    struct mem_cgroup *mem, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *c;
-	unsigned long victim_points = 0;
+	unsigned int victim_points = 0;
 	struct timespec uptime;
 
 	if (printk_ratelimit())
@@ -469,19 +489,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
 	/* Try to sacrifice the worst child first */
 	do_posix_clock_monotonic_gettime(&uptime);
 	list_for_each_entry(c, &p->children, sibling) {
-		unsigned long cpoints;
+		unsigned int cpoints;
 
 		if (c->mm == p->mm)
 			continue;
 
-		/* badness() returns 0 if the thread is unkillable */
-		cpoints = badness(c, uptime.tv_sec);
+		/* oom_badness() returns 0 if the thread is unkillable */
+		cpoints = oom_badness(c, totalpages, uptime.tv_sec);
 		if (cpoints > victim_points) {
 			victim = c;
 			victim_points = cpoints;
@@ -493,19 +513,22 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long points = 0;
+	unsigned int points = 0;
 	struct task_struct *p;
+	unsigned long limit;
 
+	limit = (res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT) +
+		   (res_counter_read_u64(&mem->memsw, RES_LIMIT) >> PAGE_SHIFT);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	if (!p)
 		p = current;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -580,22 +603,22 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
-	unsigned long points;
+	unsigned int points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+			NULL, "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL, constraint, mask);
+	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -607,7 +630,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 			     "Out of memory"))
 		goto retry;
 }
@@ -618,6 +641,7 @@ retry:
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
 	unsigned long freed = 0;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -635,9 +659,14 @@ void pagefault_out_of_memory(void)
 	if (sysctl_panic_on_oom)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
+	totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 	read_lock(&tasklist_lock);
 	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
+	__out_of_memory(0, 0, totalpages, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -665,6 +694,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
+	unsigned long totalpages = 0;
 	enum oom_constraint constraint;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -676,7 +706,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+								&totalpages);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -689,7 +720,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
-	__out_of_memory(gfp_mask, order, constraint, nodemask);
+	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 5/9 v2] oom: badness heuristic rewrite
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory.  "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb.  This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks.  A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected
as the victim, the oom killer will still kill the 1st generation execve
child with the highest badness() score in its place.  The avoids killing
important servers or system daemons.  When a web server forks a very
large number of threads for client connections, for example, it is much
better to kill one of those threads than to kill the server and make it
unresponsive.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not
possible to redefine the meaning of /proc/pid/oom_adj with a new scale
since the ABI cannot be changed for backward compatability.  Instead, a
new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to
+1000.  It may be used to polarize the heuristic such that certain tasks
are never considered for oom kill while others may always be considered.
The value is added directory into the badness() score so a value of -500,
for example, means to discount 50% of its memory consumption in
comparison to other tasks either on the system, bound to the mempolicy,
in the cpuset, or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj
to be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   97 +++++++----
 Documentation/sysctl/vm.txt        |   21 +++
 fs/proc/base.c                     |   98 +++++++++++-
 include/linux/oom.h                |   18 ++-
 include/linux/sched.h              |    3 +-
 kernel/fork.c                      |    1 +
 kernel/sysctl.c                    |    8 +
 mm/oom_kill.c                      |  319 ++++++++++++++++++++----------------
 8 files changed, 379 insertions(+), 186 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
+  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+								score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1193,42 +1194,64 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- 	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- 	to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
-	points <<= oom_adj when it is positive and
-	points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted.  The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000.  If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called.  If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes.  If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit.  Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill.  Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it.  The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory.  A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score.  Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible.  This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
 - oom_dump_tasks
+- oom_forkbomb_thres
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.
 
 ==============================================================
 
+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb.  Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+	(average rss of execve children) * (# of 1st generation execve children)
+	------------------------------------------------------------------------
+				oom_forkbomb_thres
+
+in the oom killer's badness heuristic.  Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
 oom_kill_allocating_task
 
 This enables or disables killing the OOM-triggering task in
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
 #include <linux/elf.h>
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /* NOTE:
@@ -458,7 +459,6 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points;
@@ -466,7 +466,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task->group_leader, uptime.tv_sec);
+	points = oom_badness(task->group_leader,
+				global_page_state(NR_INACTIVE_ANON) +
+				global_page_state(NR_ACTIVE_ANON) +
+				global_page_state(NR_INACTIVE_FILE) +
+				global_page_state(NR_ACTIVE_FILE) +
+				total_swap_pages,
+				uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1152,7 +1158,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_adj = oom_adjust;
-
+	/*
+	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+	 * value is always attainable.
+	 */
+	if (task->signal->oom_adj == OOM_ADJUST_MAX)
+		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+	else
+		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+								-OOM_DISABLE;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1164,6 +1178,82 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	unsigned long flags;
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+	if (lock_task_sighand(task, &flags)) {
+		oom_score_adj = task->signal->oom_score_adj;
+		unlock_task_sighand(task, &flags);
+	}
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	unsigned long flags;
+	long oom_score_adj;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+	if (err)
+		return -EINVAL;
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	if (!lock_task_sighand(task, &flags)) {
+		put_task_struct(task);
+		return -ESRCH;
+	}
+	if (oom_score_adj < task->signal->oom_score_adj &&
+			!capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	task->signal->oom_score_adj = oom_score_adj;
+	/*
+	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+	 * always attainable.
+	 */
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		task->signal->oom_adj = OOM_DISABLE;
+	else
+		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+							OOM_SCORE_ADJ_MAX;
+	unlock_task_sighand(task, &flags);
+	put_task_struct(task);
+	return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+	.read		= oom_score_adj_read,
+	.write		= oom_score_adj_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2727,6 +2817,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -3061,6 +3152,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,27 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES	1000
+
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -24,6 +37,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
+extern unsigned int oom_badness(struct task_struct *p,
+			unsigned long totalpages, unsigned long uptime);
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
@@ -47,6 +62,7 @@ static inline void oom_killer_enable(void)
 extern int sysctl_panic_on_oom;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,7 +624,8 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;	/* OOM kill score adjustment (bit shift) */
+	int oom_adj;		/* OOM kill score adjustment (bit shift) */
+	int oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -878,6 +878,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
+	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	return 0;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -955,6 +955,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_forkbomb_thres",
+		.data		= &sysctl_oom_forkbomb_thres,
+		.maxlen		= sizeof(sysctl_oom_forkbomb_thres),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
+ *  Copyright (C)  2010  Google, Inc
+ *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
@@ -68,138 +70,129 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
 	return false;
 }
 
-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose.  The oom killer assesses a penalty equaling
  *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ *	(average rss of children) * (# of 1st generation execve children)
+ *	-----------------------------------------------------------------
+ *			sysctl_oom_forkbomb_thres
  *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the principle
- *    of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent.  oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method.  A web server, for example, may fork a very large number of
+ * threads to response to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive.  The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
 {
-	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
-	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
+	unsigned long child_rss = 0;
+	int forkcount = 0;
 
-	if (oom_adj == OOM_DISABLE)
+	if (!sysctl_oom_forkbomb_thres)
 		return 0;
+	list_for_each_entry(child, &tsk->children, sibling) {
+		struct task_cputime task_time;
+		unsigned long runtime;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
-		return 0;
+		task_lock(child);
+		if (!child->mm || child->mm == tsk->mm) {
+			task_unlock(child);
+			continue;
+		}
+		thread_group_cputime(child, &task_time);
+		runtime = cputime_to_jiffies(task_time.utime) +
+			  cputime_to_jiffies(task_time.stime);
+		/*
+		 * Only threads that have run for less than a second are
+		 * considered toward the forkbomb penalty, these threads rarely
+		 * get to execute at all in such cases anyway.
+		 */
+		if (runtime < HZ) {
+			child_rss += get_mm_rss(child->mm);
+			forkcount++;
+		}
+		task_unlock(child);
 	}
 
 	/*
-	 * The memory size of the process is the basis for the badness.
+	 * Forkbombs get penalized by:
+	 *
+	 * (average rss of children) * (# of first-generation execve children) /
+	 *			sysctl_oom_forkbomb_thres
 	 */
-	points = mm->total_vm;
+	return forkcount > sysctl_oom_forkbomb_thres ?
+				(child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible.  The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+							unsigned long uptime)
+{
+	struct mm_struct *mm;
+	int points;
 
 	/*
-	 * After this unlock we can no longer dereference local variable `mm'
+	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+	 * need to be executed for something that can't be killed.
 	 */
-	task_unlock(p);
+	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		return 0;
 
 	/*
-	 * swapoff can easily use up all memory, so kill those first.
+	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+	 * priority for oom killing.
 	 */
 	if (p->flags & PF_OOM_ORIGIN)
-		return ULONG_MAX;
+		return 1000;
 
-	/*
-	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add half the vmsize of the children if they
-	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of children. In case a single
-	 * child is eating the vast majority of memory, adding only half
-	 * to the parents will make the child our kill candidate of choice.
-	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
+	task_lock(p);
+	mm = p->mm;
+	if (!mm) {
+		task_unlock(p);
+		return 0;
 	}
 
 	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
-	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
-
-	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
+	 * The baseline for the badness score is the proportion of RAM that each
+	 * task's rss and swap space use.
 	 */
-	if (task_nice(p) > 0)
-		points *= 2;
-
-	/*
-	 * Superuser processes are usually more important, so we make it
-	 * less likely that we kill those.
-	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
-	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
-		points /= 4;
+	points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+			totalpages;
+	task_unlock(p);
+	points += oom_forkbomb_penalty(p);
 
 	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
+	 * Root processes get 3% bonus, just like the __vm_enough_memory() used
+	 * by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+		points -= 30;
 
 	/*
-	 * Adjust the score by oom_adj.
+	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the
+	 * range may either completely disable oom killing or always prefer a
+	 * certain task.
 	 */
-	if (oom_adj) {
-		if (oom_adj > 0) {
-			if (!points)
-				points = 1;
-			points <<= oom_adj;
-		} else
-			points >>= -(oom_adj);
-	}
+	points += p->signal->oom_score_adj;
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+	if (points < 0)
+		return 0;
+	return (points <= 1000) ? points : 1000;
 }
 
 /*
@@ -207,11 +200,21 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				    gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	bool cpuset_limited = false;
+	int nid;
+
+	/* Default to all anonymous memory, page cache, and swap */
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -222,25 +225,41 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
-	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
-	 * feature. mempolicy is an only user of nodemask here.
-	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+	 * the page allocator means a mempolicy is in effect.  Cpuset policy
+	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, *nodemask)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
 		return CONSTRAINT_MEMORY_POLICY;
+	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			return CONSTRAINT_CPUSET;
-
+			cpuset_limited = true;
+
+	if (cpuset_limited) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, cpuset_current_mems_allowed)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
+		return CONSTRAINT_CPUSET;
+	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	return CONSTRAINT_NONE;
 }
@@ -252,9 +271,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints,
-		struct mem_cgroup *mem, enum oom_constraint constraint,
-		const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+		unsigned long totalpages, struct mem_cgroup *mem,
+		enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -263,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned long points;
+		unsigned int points;
 
 		/*
 		 * skip kernel threads and tasks which have already released
@@ -308,13 +327,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = ULONG_MAX;
+			*ppoints = 1000;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = oom_badness(p, totalpages, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -330,7 +349,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
  *
  * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
  * shown.
@@ -342,7 +361,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	struct task_struct *g, *p;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
-	       "name\n");
+	       "oom_score_adj name\n");
 	do_each_thread(g, p) {
 		struct mm_struct *mm;
 
@@ -362,10 +381,10 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			task_unlock(p);
 			continue;
 		}
-		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
 		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
-		       p->comm);
+		       p->signal->oom_score_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -374,8 +393,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d, oom_score_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj,
+		current->signal->oom_score_adj);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -440,7 +460,7 @@ static int oom_kill_task(struct task_struct *p)
 	 * change to NULL at any time since we do not hold task_lock(p).
 	 * However, this is of no concern to us.
 	 */
-	if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 		return 1;
 
 	__oom_kill_task(p, 1);
@@ -449,12 +469,12 @@ static int oom_kill_task(struct task_struct *p)
 }
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned int points, unsigned long totalpages,
+			    struct mem_cgroup *mem, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *c;
-	unsigned long victim_points = 0;
+	unsigned int victim_points = 0;
 	struct timespec uptime;
 
 	if (printk_ratelimit())
@@ -469,19 +489,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
 	/* Try to sacrifice the worst child first */
 	do_posix_clock_monotonic_gettime(&uptime);
 	list_for_each_entry(c, &p->children, sibling) {
-		unsigned long cpoints;
+		unsigned int cpoints;
 
 		if (c->mm == p->mm)
 			continue;
 
-		/* badness() returns 0 if the thread is unkillable */
-		cpoints = badness(c, uptime.tv_sec);
+		/* oom_badness() returns 0 if the thread is unkillable */
+		cpoints = oom_badness(c, totalpages, uptime.tv_sec);
 		if (cpoints > victim_points) {
 			victim = c;
 			victim_points = cpoints;
@@ -493,19 +513,22 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long points = 0;
+	unsigned int points = 0;
 	struct task_struct *p;
+	unsigned long limit;
 
+	limit = (res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT) +
+		   (res_counter_read_u64(&mem->memsw, RES_LIMIT) >> PAGE_SHIFT);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	if (!p)
 		p = current;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -580,22 +603,22 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
-	unsigned long points;
+	unsigned int points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+			NULL, "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL, constraint, mask);
+	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -607,7 +630,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 			     "Out of memory"))
 		goto retry;
 }
@@ -618,6 +641,7 @@ retry:
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
 	unsigned long freed = 0;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -635,9 +659,14 @@ void pagefault_out_of_memory(void)
 	if (sysctl_panic_on_oom)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
+	totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 	read_lock(&tasklist_lock);
 	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
+	__out_of_memory(0, 0, totalpages, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -665,6 +694,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
+	unsigned long totalpages = 0;
 	enum oom_constraint constraint;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -676,7 +706,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+								&totalpages);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -689,7 +720,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
-	__out_of_memory(gfp_mask, order, constraint, nodemask);
+	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is December 2011.

A warning will be printed to the kernel log if a task attempts to use
this interface.  Future warning will be suppressed until the kernel is
rebooted to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   30 ++++++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 ++
 fs/proc/base.c                             |    8 +++++++
 include/linux/oom.h                        |    3 ++
 4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -168,6 +168,36 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	December 2011
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	See Documentation/filesystems/proc.txt for information on how to use the
+	new tunable.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warning will be
+	suppressed until the kernel is rebooted.
+
+Who:	David Rientjes <rientjes@google.com>
+
+---------------------------
+
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1247,6 +1247,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1157,6 +1157,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
+	/*
+	 * Warn that /proc/pid/oom_adj is deprecated, see
+	 * Documentation/feature-removal-schedule.txt.
+	 */
+	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+			"please use /proc/%d/oom_score_adj instead.\n",
+			current->comm, task_pid_nr(current),
+			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,6 +2,9 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is December 2011.

A warning will be printed to the kernel log if a task attempts to use
this interface.  Future warning will be suppressed until the kernel is
rebooted to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   30 ++++++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 ++
 fs/proc/base.c                             |    8 +++++++
 include/linux/oom.h                        |    3 ++
 4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -168,6 +168,36 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	December 2011
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	See Documentation/filesystems/proc.txt for information on how to use the
+	new tunable.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warning will be
+	suppressed until the kernel is rebooted.
+
+Who:	David Rientjes <rientjes@google.com>
+
+---------------------------
+
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1247,6 +1247,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1157,6 +1157,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
+	/*
+	 * Warn that /proc/pid/oom_adj is deprecated, see
+	 * Documentation/feature-removal-schedule.txt.
+	 */
+	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+			"please use /proc/%d/oom_score_adj instead.\n",
+			current->comm, task_pid_nr(current),
+			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,6 +2,9 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 7/9 v2] oom: replace sysctls with quick mode
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans.  The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others.  The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick.  This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
 include/linux/oom.h         |    3 +-
 kernel/sysctl.c             |   13 ++---------
 mm/oom_kill.c               |    9 +++----
 4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
 - nr_pdflush_threads
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
-- oom_dump_tasks
 - oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
 - overcommit_memory
 - overcommit_ratio
 - page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.
 
 ==============================================================
 
-oom_dump_tasks
-
-Enables a system-wide task dump (excluding kernel threads) to be
-produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name.  This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
-
-If this is set to zero, this information is suppressed.  On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one.  Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 0.
-
-==============================================================
-
 oom_forkbomb_thres
 
 This value defines how many children with a seperate address space a specific
@@ -511,22 +489,12 @@ The default value is 1000.
 
 ==============================================================
 
-oom_kill_allocating_task
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill.  This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition.  This avoids the expensive
-tasklist scan.
+oom_kill_quick
 
-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
+When enabled, this will always kill the task that triggered the oom killer, i.e.
+the task that attempted to allocate memory that could not be found.  It also
+suppresses the tasklist dump to the kernel log whenever the oom killer is
+called.  Typically set on systems with an extremely large number of tasks.
 
 The default value is 0.
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)
 }
 /* for sysctl */
 extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;
 extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -941,16 +941,9 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "oom_kill_allocating_task",
-		.data		= &sysctl_oom_kill_allocating_task,
-		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
-		.procname	= "oom_dump_tasks",
-		.data		= &sysctl_oom_dump_tasks,
-		.maxlen		= sizeof(sysctl_oom_dump_tasks),
+		.procname	= "oom_kill_quick",
+		.data		= &sysctl_oom_kill_quick,
+		.maxlen		= sizeof(sysctl_oom_kill_quick),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,9 +32,8 @@
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
 int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
@@ -402,7 +401,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	dump_stack();
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
-	if (sysctl_oom_dump_tasks)
+	if (!sysctl_oom_kill_quick)
 		dump_tasks(mem);
 }
 
@@ -609,9 +608,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 	struct task_struct *p;
 	unsigned int points;
 
-	if (sysctl_oom_kill_allocating_task)
+	if (sysctl_oom_kill_quick)
 		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
-			NULL, "Out of memory (oom_kill_allocating_task)"))
+			NULL, "Out of memory (quick mode)"))
 			return;
 retry:
 	/*

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 7/9 v2] oom: replace sysctls with quick mode
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans.  The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others.  The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick.  This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
 include/linux/oom.h         |    3 +-
 kernel/sysctl.c             |   13 ++---------
 mm/oom_kill.c               |    9 +++----
 4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
 - nr_pdflush_threads
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
-- oom_dump_tasks
 - oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
 - overcommit_memory
 - overcommit_ratio
 - page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.
 
 ==============================================================
 
-oom_dump_tasks
-
-Enables a system-wide task dump (excluding kernel threads) to be
-produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name.  This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
-
-If this is set to zero, this information is suppressed.  On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one.  Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 0.
-
-==============================================================
-
 oom_forkbomb_thres
 
 This value defines how many children with a seperate address space a specific
@@ -511,22 +489,12 @@ The default value is 1000.
 
 ==============================================================
 
-oom_kill_allocating_task
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill.  This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition.  This avoids the expensive
-tasklist scan.
+oom_kill_quick
 
-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
+When enabled, this will always kill the task that triggered the oom killer, i.e.
+the task that attempted to allocate memory that could not be found.  It also
+suppresses the tasklist dump to the kernel log whenever the oom killer is
+called.  Typically set on systems with an extremely large number of tasks.
 
 The default value is 0.
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)
 }
 /* for sysctl */
 extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;
 extern int sysctl_oom_forkbomb_thres;
 
 #endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -941,16 +941,9 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "oom_kill_allocating_task",
-		.data		= &sysctl_oom_kill_allocating_task,
-		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
-	{
-		.procname	= "oom_dump_tasks",
-		.data		= &sysctl_oom_dump_tasks,
-		.maxlen		= sizeof(sysctl_oom_dump_tasks),
+		.procname	= "oom_kill_quick",
+		.data		= &sysctl_oom_kill_quick,
+		.maxlen		= sizeof(sysctl_oom_kill_quick),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,9 +32,8 @@
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
 int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
@@ -402,7 +401,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 	dump_stack();
 	mem_cgroup_print_oom_info(mem, p);
 	show_mem();
-	if (sysctl_oom_dump_tasks)
+	if (!sysctl_oom_kill_quick)
 		dump_tasks(mem);
 }
 
@@ -609,9 +608,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 	struct task_struct *p;
 	unsigned int points;
 
-	if (sysctl_oom_kill_allocating_task)
+	if (sysctl_oom_kill_quick)
 		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
-			NULL, "Out of memory (oom_kill_allocating_task)"))
+			NULL, "Out of memory (quick mode)"))
 			return;
 retry:
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later.  Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
 	 * running out of options and have to consider going OOM
 	 */
 	if (!did_some_progress) {
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto nopage;
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			if (oom_killer_disabled)
 				goto nopage;

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later.  Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
 	 * running out of options and have to consider going OOM
 	 */
 	if (!did_some_progress) {
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto nopage;
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			if (oom_killer_disabled)
 				goto nopage;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 9/9 v2] oom: remove unnecessary code and cleanup
  2010-02-15 22:19 ` David Rientjes
@ 2010-02-15 22:20   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   64 ++++++++++++++------------------------------------------
 1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,67 +405,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
  */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
-		return;
-	}
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
-
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
-
 	force_sig(SIGKILL, p);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p)
 {
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+	task_lock(p);
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		task_unlock(p);
 		return 1;
+	}
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	task_unlock(p);
 
-	__oom_kill_task(p, 1);
-
+	__oom_kill_task(p);
 	return 0;
 }
+#undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned int points, unsigned long totalpages,
@@ -484,7 +452,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
+		__oom_kill_task(p);
 		return 0;
 	}
 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch -mm 9/9 v2] oom: remove unnecessary code and cleanup
@ 2010-02-15 22:20   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   64 ++++++++++++++------------------------------------------
 1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,67 +405,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(mem);
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
-
 /*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
  */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
-		return;
-	}
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
-	task_unlock(p);
-
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
-
 	force_sig(SIGKILL, p);
 }
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p)
 {
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+	task_lock(p);
+	if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		task_unlock(p);
 		return 1;
+	}
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(p->mm->total_vm),
+	       K(get_mm_counter(p->mm, MM_ANONPAGES)),
+	       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	task_unlock(p);
 
-	__oom_kill_task(p, 1);
-
+	__oom_kill_task(p);
 	return 0;
 }
+#undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned int points, unsigned long totalpages,
@@ -484,7 +452,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
+		__oom_kill_task(p);
 		return 0;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-15 22:28     ` Alan Cox
  -1 siblings, 0 replies; 145+ messages in thread
From: Alan Cox @ 2010-02-15 22:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 15 Feb 2010 14:20:16 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> /proc/pid/oom_adj is now deprecated so that that it may eventually be
> removed.  The target date for removal is December 2011.

There are systems that rely on this feature. It's ABI, its sacred. We are
committed to it and it has users. That doesn't really detract from the
good/bad of the rest of the proposal, it's just one step we can't quite
make.

Alan

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
@ 2010-02-15 22:28     ` Alan Cox
  0 siblings, 0 replies; 145+ messages in thread
From: Alan Cox @ 2010-02-15 22:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 15 Feb 2010 14:20:16 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> /proc/pid/oom_adj is now deprecated so that that it may eventually be
> removed.  The target date for removal is December 2011.

There are systems that rely on this feature. It's ABI, its sacred. We are
committed to it and it has users. That doesn't really detract from the
good/bad of the rest of the proposal, it's just one step we can't quite
make.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
  2010-02-15 22:28     ` Alan Cox
@ 2010-02-15 22:35       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 15 Feb 2010, Alan Cox wrote:

> > /proc/pid/oom_adj is now deprecated so that that it may eventually be
> > removed.  The target date for removal is December 2011.
> 
> There are systems that rely on this feature. It's ABI, its sacred. We are
> committed to it and it has users. That doesn't really detract from the
> good/bad of the rest of the proposal, it's just one step we can't quite
> make.
> 

Andrew suggested that it be deprecated in this way, so that's what was 
done.  I don't have any strong opinions about leaving it around forever 
now that it's otherwise unused beyond simply converting itself into units 
for /proc/pid/oom_score_adj at a much higher granularity.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 6/9 v2] oom: deprecate oom_adj tunable
@ 2010-02-15 22:35       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-15 22:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 15 Feb 2010, Alan Cox wrote:

> > /proc/pid/oom_adj is now deprecated so that that it may eventually be
> > removed.  The target date for removal is December 2011.
> 
> There are systems that rely on this feature. It's ABI, its sacred. We are
> committed to it and it has users. That doesn't really detract from the
> good/bad of the rest of the proposal, it's just one step we can't quite
> make.
> 

Andrew suggested that it be deprecated in this way, so that's what was 
done.  I don't have any strong opinions about leaving it around forever 
now that it's otherwise unused beyond simply converting itself into units 
for /proc/pid/oom_score_adj at a much higher granularity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-15 23:57     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-15 23:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 14:20:21 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help.  The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O.  Killing such an
> application may leave the hardware in an unspecified state and there is
> no guarantee that it will be able to make a timely exit.
> 
> Lowmem allocations are now failed in oom conditions so that the task can
> perhaps recover or try again later.  Killing current is an unnecessary
> result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> unnecessary.
> 
> Previously, the heuristic provided some protection for those tasks with 
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
> 
> high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1914,6 +1914,9 @@ rebalance:
>  	 * running out of options and have to consider going OOM
>  	 */
>  	if (!did_some_progress) {
> +		/* The oom killer won't necessarily free lowmem */
> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto nopage;
>  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>  			if (oom_killer_disabled)
>  				goto nopage;

WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
plz.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-15 23:57     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-15 23:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 14:20:21 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help.  The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O.  Killing such an
> application may leave the hardware in an unspecified state and there is
> no guarantee that it will be able to make a timely exit.
> 
> Lowmem allocations are now failed in oom conditions so that the task can
> perhaps recover or try again later.  Killing current is an unnecessary
> result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> unnecessary.
> 
> Previously, the heuristic provided some protection for those tasks with 
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
> 
> high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1914,6 +1914,9 @@ rebalance:
>  	 * running out of options and have to consider going OOM
>  	 */
>  	if (!did_some_progress) {
> +		/* The oom killer won't necessarily free lowmem */
> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto nopage;
>  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>  			if (oom_killer_disabled)
>  				goto nopage;

WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
plz.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-16  0:00     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 14:20:09 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
> 
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

NACK. In an enviroment which depends on cluster-fail-over, this is useful
even if in such situation.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  0:00     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 14:20:09 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
> 
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

NACK. In an enviroment which depends on cluster-fail-over, this is useful
even if in such situation.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-15 23:57     ` KAMEZAWA Hiroyuki
@ 2010-02-16  0:10       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  0:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > If memory has been depleted in lowmem zones even with the protection
> > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > killing current users will help.  The memory is either reclaimable (or
> > migratable) already, in which case we should not invoke the oom killer at
> > all, or it is pinned by an application for I/O.  Killing such an
> > application may leave the hardware in an unspecified state and there is
> > no guarantee that it will be able to make a timely exit.
> > 
> > Lowmem allocations are now failed in oom conditions so that the task can
> > perhaps recover or try again later.  Killing current is an unnecessary
> > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > unnecessary.
> > 
> > Previously, the heuristic provided some protection for those tasks with 
> > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > killing tasks for the purposes of ISA allocations.
> > 
> > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > 
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/page_alloc.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1914,6 +1914,9 @@ rebalance:
> >  	 * running out of options and have to consider going OOM
> >  	 */
> >  	if (!did_some_progress) {
> > +		/* The oom killer won't necessarily free lowmem */
> > +		if (high_zoneidx < ZONE_NORMAL)
> > +			goto nopage;
> >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >  			if (oom_killer_disabled)
> >  				goto nopage;
> 
> WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> plz.
> 

As I already explained when you first brought this up, the possibility of 
not invoking the oom killer is not unique to GFP_DMA, it is also possible 
for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
We're not adding any additional __GFP_NOFAIL allocations.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  0:10       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  0:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > If memory has been depleted in lowmem zones even with the protection
> > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > killing current users will help.  The memory is either reclaimable (or
> > migratable) already, in which case we should not invoke the oom killer at
> > all, or it is pinned by an application for I/O.  Killing such an
> > application may leave the hardware in an unspecified state and there is
> > no guarantee that it will be able to make a timely exit.
> > 
> > Lowmem allocations are now failed in oom conditions so that the task can
> > perhaps recover or try again later.  Killing current is an unnecessary
> > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > unnecessary.
> > 
> > Previously, the heuristic provided some protection for those tasks with 
> > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > killing tasks for the purposes of ISA allocations.
> > 
> > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > 
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/page_alloc.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1914,6 +1914,9 @@ rebalance:
> >  	 * running out of options and have to consider going OOM
> >  	 */
> >  	if (!did_some_progress) {
> > +		/* The oom killer won't necessarily free lowmem */
> > +		if (high_zoneidx < ZONE_NORMAL)
> > +			goto nopage;
> >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >  			if (oom_killer_disabled)
> >  				goto nopage;
> 
> WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> plz.
> 

As I already explained when you first brought this up, the possibility of 
not invoking the oom killer is not unique to GFP_DMA, it is also possible 
for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
We're not adding any additional __GFP_NOFAIL allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  0:00     ` KAMEZAWA Hiroyuki
@ 2010-02-16  0:14       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  0:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> > regardless of whether the memory allocation is constrained by either a
> > mempolicy or cpuset.
> > 
> > Since mempolicy-constrained out of memory conditions now iterate through
> > the tasklist and select a task to kill, it is possible to panic the
> > machine if all tasks sharing the same mempolicy nodes (including those
> > with default policy, they may allocate anywhere) or cpuset mems have
> > /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> > to the compulsory panic_on_oom setting of 2, so the mode is removed.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> NACK. In an enviroment which depends on cluster-fail-over, this is useful
> even if in such situation.
> 

You don't understand that the behavior has changed ever since 
mempolicy-constrained oom conditions are now affected by a compulsory 
panic_on_oom mode, please see the patch description.  It's absolutely 
insane for a single sysctl mode to panic the machine anytime a cpuset or 
mempolicy runs out of memory and is more prone to user error from setting 
it without fully understanding the ramifications than any use it will ever 
do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
if you want your cpuset or mempolicy to risk panicking the machine, set 
all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
This is no different from the memory controller being immune to such 
panic_on_oom conditions, stop believing that it is the only mechanism used 
in the kernel to do memory isolation.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  0:14       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  0:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> > regardless of whether the memory allocation is constrained by either a
> > mempolicy or cpuset.
> > 
> > Since mempolicy-constrained out of memory conditions now iterate through
> > the tasklist and select a task to kill, it is possible to panic the
> > machine if all tasks sharing the same mempolicy nodes (including those
> > with default policy, they may allocate anywhere) or cpuset mems have
> > /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> > to the compulsory panic_on_oom setting of 2, so the mode is removed.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> NACK. In an enviroment which depends on cluster-fail-over, this is useful
> even if in such situation.
> 

You don't understand that the behavior has changed ever since 
mempolicy-constrained oom conditions are now affected by a compulsory 
panic_on_oom mode, please see the patch description.  It's absolutely 
insane for a single sysctl mode to panic the machine anytime a cpuset or 
mempolicy runs out of memory and is more prone to user error from setting 
it without fully understanding the ramifications than any use it will ever 
do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
if you want your cpuset or mempolicy to risk panicking the machine, set 
all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
This is no different from the memory controller being immune to such 
panic_on_oom conditions, stop believing that it is the only mechanism used 
in the kernel to do memory isolation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  0:10       ` David Rientjes
@ 2010-02-16  0:21         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 16:10:15 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.
>

Please add documentation about that to gfp.h before doing this.
Doing this without writing any documenation is laziness.
(WARNING is a style of documentation.)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  0:21         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 16:10:15 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.
>

Please add documentation about that to gfp.h before doing this.
Doing this without writing any documenation is laziness.
(WARNING is a style of documentation.)

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  0:14       ` David Rientjes
@ 2010-02-16  0:23         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 16:14:22 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> > > regardless of whether the memory allocation is constrained by either a
> > > mempolicy or cpuset.
> > > 
> > > Since mempolicy-constrained out of memory conditions now iterate through
> > > the tasklist and select a task to kill, it is possible to panic the
> > > machine if all tasks sharing the same mempolicy nodes (including those
> > > with default policy, they may allocate anywhere) or cpuset mems have
> > > /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> > > to the compulsory panic_on_oom setting of 2, so the mode is removed.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > NACK. In an enviroment which depends on cluster-fail-over, this is useful
> > even if in such situation.
> > 
> 
> You don't understand that the behavior has changed ever since 
> mempolicy-constrained oom conditions are now affected by a compulsory 
> panic_on_oom mode, please see the patch description.  It's absolutely 
> insane for a single sysctl mode to panic the machine anytime a cpuset or 
> mempolicy runs out of memory and is more prone to user error from setting 
> it without fully understanding the ramifications than any use it will ever 
> do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> if you want your cpuset or mempolicy to risk panicking the machine, set 
> all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> This is no different from the memory controller being immune to such 
> panic_on_oom conditions, stop believing that it is the only mechanism used 
> in the kernel to do memory isolation.
> 
You don't explain why "we _have to_ remove API which is used"

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  0:23         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  0:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 16:14:22 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> > > regardless of whether the memory allocation is constrained by either a
> > > mempolicy or cpuset.
> > > 
> > > Since mempolicy-constrained out of memory conditions now iterate through
> > > the tasklist and select a task to kill, it is possible to panic the
> > > machine if all tasks sharing the same mempolicy nodes (including those
> > > with default policy, they may allocate anywhere) or cpuset mems have
> > > /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> > > to the compulsory panic_on_oom setting of 2, so the mode is removed.
> > > 
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > 
> > NACK. In an enviroment which depends on cluster-fail-over, this is useful
> > even if in such situation.
> > 
> 
> You don't understand that the behavior has changed ever since 
> mempolicy-constrained oom conditions are now affected by a compulsory 
> panic_on_oom mode, please see the patch description.  It's absolutely 
> insane for a single sysctl mode to panic the machine anytime a cpuset or 
> mempolicy runs out of memory and is more prone to user error from setting 
> it without fully understanding the ramifications than any use it will ever 
> do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> if you want your cpuset or mempolicy to risk panicking the machine, set 
> all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> This is no different from the memory controller being immune to such 
> panic_on_oom conditions, stop believing that it is the only mechanism used 
> in the kernel to do memory isolation.
> 
You don't explain why "we _have to_ remove API which is used"

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch] mm: add comment about deprecation of __GFP_NOFAIL
  2010-02-16  0:21         ` KAMEZAWA Hiroyuki
@ 2010-02-16  1:13           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  1:13 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	Lubos Lunak, KOSAKI Motohiro, linux-kernel, linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > As I already explained when you first brought this up, the possibility of 
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > We're not adding any additional __GFP_NOFAIL allocations.
> >
> 
> Please add documentation about that to gfp.h before doing this.
> Doing this without writing any documenation is laziness.
> (WARNING is a style of documentation.)
> 

This is already documented in the page allocator, but I guess doing it in 
include/linux/gfp.h as well doesn't hurt.



mm: add comment about deprecation of __GFP_NOFAIL

__GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new 
users should be added.

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/gfp.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,7 +30,8 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.
+ * cannot handle allocation failures.  This modifier is deprecated and no new
+ * users should be added.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

^ permalink raw reply	[flat|nested] 145+ messages in thread

* [patch] mm: add comment about deprecation of __GFP_NOFAIL
@ 2010-02-16  1:13           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  1:13 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	Lubos Lunak, KOSAKI Motohiro, linux-kernel, linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > As I already explained when you first brought this up, the possibility of 
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > We're not adding any additional __GFP_NOFAIL allocations.
> >
> 
> Please add documentation about that to gfp.h before doing this.
> Doing this without writing any documenation is laziness.
> (WARNING is a style of documentation.)
> 

This is already documented in the page allocator, but I guess doing it in 
include/linux/gfp.h as well doesn't hurt.



mm: add comment about deprecation of __GFP_NOFAIL

__GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new 
users should be added.

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/gfp.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,7 +30,8 @@ struct vm_area_struct;
  * _might_ fail.  This depends upon the particular VM implementation.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.
+ * cannot handle allocation failures.  This modifier is deprecated and no new
+ * users should be added.
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
  2010-02-16  1:13           ` David Rientjes
@ 2010-02-16  1:26             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  1:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 17:13:57 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > As I already explained when you first brought this up, the possibility of 
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > > We're not adding any additional __GFP_NOFAIL allocations.
> > >
> > 
> > Please add documentation about that to gfp.h before doing this.
> > Doing this without writing any documenation is laziness.
> > (WARNING is a style of documentation.)
> > 
> 
> This is already documented in the page allocator, but I guess doing it in 
> include/linux/gfp.h as well doesn't hurt.
> 
I want warning when someone uses OBSOLETE interface but...

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
believe API is trustable and unchanged.

> 
> 
> mm: add comment about deprecation of __GFP_NOFAIL
> 
> __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new 
> users should be added.
> 
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/gfp.h |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -30,7 +30,8 @@ struct vm_area_struct;
>   * _might_ fail.  This depends upon the particular VM implementation.
>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.
> + * cannot handle allocation failures.  This modifier is deprecated and no new
> + * users should be added.
>   *
>   * __GFP_NORETRY: The VM implementation must not retry indefinitely.
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
@ 2010-02-16  1:26             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  1:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, 15 Feb 2010 17:13:57 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > As I already explained when you first brought this up, the possibility of 
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > > We're not adding any additional __GFP_NOFAIL allocations.
> > >
> > 
> > Please add documentation about that to gfp.h before doing this.
> > Doing this without writing any documenation is laziness.
> > (WARNING is a style of documentation.)
> > 
> 
> This is already documented in the page allocator, but I guess doing it in 
> include/linux/gfp.h as well doesn't hurt.
> 
I want warning when someone uses OBSOLETE interface but...

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
believe API is trustable and unchanged.

> 
> 
> mm: add comment about deprecation of __GFP_NOFAIL
> 
> __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new 
> users should be added.
> 
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/gfp.h |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -30,7 +30,8 @@ struct vm_area_struct;
>   * _might_ fail.  This depends upon the particular VM implementation.
>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.
> + * cannot handle allocation failures.  This modifier is deprecated and no new
> + * users should be added.
>   *
>   * __GFP_NORETRY: The VM implementation must not retry indefinitely.
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  0:10       ` David Rientjes
@ 2010-02-16  5:32         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 145+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16  5:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	linux-kernel, linux-mm

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.

No current user? I don't think so.

	int bio_integrity_prep(struct bio *bio)
	{
	(snip)
	        buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);

and 

	void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
	{
	(snip)
	        if (dma) {
	                init_emergency_isa_pool();
	                q->bounce_gfp = GFP_NOIO | GFP_DMA;
	                q->limits.bounce_pfn = b_pfn;
	        }



I don't like rumor based discussion, I like fact based one.

Thanks.




^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  5:32         ` KOSAKI Motohiro
  0 siblings, 0 replies; 145+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16  5:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	linux-kernel, linux-mm

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.

No current user? I don't think so.

	int bio_integrity_prep(struct bio *bio)
	{
	(snip)
	        buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);

and 

	void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
	{
	(snip)
	        if (dma) {
	                init_emergency_isa_pool();
	                q->bounce_gfp = GFP_NOIO | GFP_DMA;
	                q->limits.bounce_pfn = b_pfn;
	        }



I don't like rumor based discussion, I like fact based one.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 1/9 v2] oom: filter tasks not sharing the same cpuset
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-16  6:14     ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:01PM -0800, David Rientjes wrote:
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: Nick Piggin <npiggin@suse.de>

> ---
>  mm/oom_kill.c |   12 +++---------
>  1 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
>   */
>  static int has_intersects_mems_allowed(struct task_struct *tsk)
>  {
> @@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 4;
>  
>  	/*
> -	 * If p's nodes don't overlap ours, it may still help to kill p
> -	 * because p may have allocated or otherwise mapped memory on
> -	 * this node before. However it will be less likely.
> -	 */
> -	if (!has_intersects_mems_allowed(p))
> -		points /= 8;
> -
> -	/*
>  	 * Adjust the score by oom_adj.
>  	 */
>  	if (oom_adj) {
> @@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> +		if (!has_intersects_mems_allowed(p))
> +			continue;
>  
>  		/*
>  		 * This task already has access to memory reserves and is

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 1/9 v2] oom: filter tasks not sharing the same cpuset
@ 2010-02-16  6:14     ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:01PM -0800, David Rientjes wrote:
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: Nick Piggin <npiggin@suse.de>

> ---
>  mm/oom_kill.c |   12 +++---------
>  1 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
>   */
>  static int has_intersects_mems_allowed(struct task_struct *tsk)
>  {
> @@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 4;
>  
>  	/*
> -	 * If p's nodes don't overlap ours, it may still help to kill p
> -	 * because p may have allocated or otherwise mapped memory on
> -	 * this node before. However it will be less likely.
> -	 */
> -	if (!has_intersects_mems_allowed(p))
> -		points /= 8;
> -
> -	/*
>  	 * Adjust the score by oom_adj.
>  	 */
>  	if (oom_adj) {
> @@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> +		if (!has_intersects_mems_allowed(p))
> +			continue;
>  
>  		/*
>  		 * This task already has access to memory reserves and is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 2/9 v2] oom: sacrifice child with highest badness score for parent
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-16  6:15     ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:03PM -0800, David Rientjes wrote:
> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list.  Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Nick Piggin <npiggin@suse.de>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 2/9 v2] oom: sacrifice child with highest badness score for parent
@ 2010-02-16  6:15     ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:03PM -0800, David Rientjes wrote:
> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list.  Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Nick Piggin <npiggin@suse.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-16  6:20     ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:09PM -0800, David Rientjes wrote:
> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
> 
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

What is the point of removing it, though? If it doesn't significantly
help some future patch, just leave it in. It's not worth breaking the
user/kernel interface just to remove 3 trivial lines of code.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  6:20     ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:09PM -0800, David Rientjes wrote:
> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
> 
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE.  This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

What is the point of removing it, though? If it doesn't significantly
help some future patch, just leave it in. It's not worth breaking the
user/kernel interface just to remove 3 trivial lines of code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 7/9 v2] oom: replace sysctls with quick mode
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-16  6:28     ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:18PM -0800, David Rientjes wrote:
> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans.  The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others.  The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
> 
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick.  This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
> 
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

I just don't understand this either. There appears to be simply no
performance or maintainability reason to change this.

> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
>  include/linux/oom.h         |    3 +-
>  kernel/sysctl.c             |   13 ++---------
>  mm/oom_kill.c               |    9 +++----
>  4 files changed, 14 insertions(+), 55 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
>  - nr_pdflush_threads
>  - nr_trim_pages         (only if CONFIG_MMU=n)
>  - numa_zonelist_order
> -- oom_dump_tasks
>  - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
>  - overcommit_memory
>  - overcommit_ratio
>  - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>  
>  ==============================================================
>  
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name.  This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed.  On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one.  Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
>  oom_forkbomb_thres
>  
>  This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>  
>  ==============================================================
>  
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill.  This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition.  This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>  
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found.  It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called.  Typically set on systems with an extremely large number of tasks.
>  
>  The default value is 0.
>  
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)
>  }
>  /* for sysctl */
>  extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
>  extern int sysctl_oom_forkbomb_thres;
>  
>  #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -941,16 +941,9 @@ static struct ctl_table vm_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> -		.procname	= "oom_kill_allocating_task",
> -		.data		= &sysctl_oom_kill_allocating_task,
> -		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
> -		.mode		= 0644,
> -		.proc_handler	= proc_dointvec,
> -	},
> -	{
> -		.procname	= "oom_dump_tasks",
> -		.data		= &sysctl_oom_dump_tasks,
> -		.maxlen		= sizeof(sysctl_oom_dump_tasks),
> +		.procname	= "oom_kill_quick",
> +		.data		= &sysctl_oom_kill_quick,
> +		.maxlen		= sizeof(sysctl_oom_kill_quick),
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
>  int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  
>  /*
> @@ -402,7 +401,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  	dump_stack();
>  	mem_cgroup_print_oom_info(mem, p);
>  	show_mem();
> -	if (sysctl_oom_dump_tasks)
> +	if (!sysctl_oom_kill_quick)
>  		dump_tasks(mem);
>  }
>  
> @@ -609,9 +608,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
>  	struct task_struct *p;
>  	unsigned int points;
>  
> -	if (sysctl_oom_kill_allocating_task)
> +	if (sysctl_oom_kill_quick)
>  		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> -			NULL, "Out of memory (oom_kill_allocating_task)"))
> +			NULL, "Out of memory (quick mode)"))
>  			return;
>  retry:
>  	/*

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 7/9 v2] oom: replace sysctls with quick mode
@ 2010-02-16  6:28     ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 02:20:18PM -0800, David Rientjes wrote:
> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans.  The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others.  The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
> 
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick.  This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
> 
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

I just don't understand this either. There appears to be simply no
performance or maintainability reason to change this.

> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/sysctl/vm.txt |   44 +++++-------------------------------------
>  include/linux/oom.h         |    3 +-
>  kernel/sysctl.c             |   13 ++---------
>  mm/oom_kill.c               |    9 +++----
>  4 files changed, 14 insertions(+), 55 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
>  - nr_pdflush_threads
>  - nr_trim_pages         (only if CONFIG_MMU=n)
>  - numa_zonelist_order
> -- oom_dump_tasks
>  - oom_forkbomb_thres
> -- oom_kill_allocating_task
> +- oom_kill_quick
>  - overcommit_memory
>  - overcommit_ratio
>  - page-cluster
> @@ -470,27 +469,6 @@ this is causing problems for your system/application.
>  
>  ==============================================================
>  
> -oom_dump_tasks
> -
> -Enables a system-wide task dump (excluding kernel threads) to be
> -produced when the kernel performs an OOM-killing and includes such
> -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
> -name.  This is helpful to determine why the OOM killer was invoked
> -and to identify the rogue task that caused it.
> -
> -If this is set to zero, this information is suppressed.  On very
> -large systems with thousands of tasks it may not be feasible to dump
> -the memory state information for each one.  Such systems should not
> -be forced to incur a performance penalty in OOM conditions when the
> -information may not be desired.
> -
> -If this is set to non-zero, this information is shown whenever the
> -OOM killer actually kills a memory-hogging task.
> -
> -The default value is 0.
> -
> -==============================================================
> -
>  oom_forkbomb_thres
>  
>  This value defines how many children with a seperate address space a specific
> @@ -511,22 +489,12 @@ The default value is 1000.
>  
>  ==============================================================
>  
> -oom_kill_allocating_task
> -
> -This enables or disables killing the OOM-triggering task in
> -out-of-memory situations.
> -
> -If this is set to zero, the OOM killer will scan through the entire
> -tasklist and select a task based on heuristics to kill.  This normally
> -selects a rogue memory-hogging task that frees up a large amount of
> -memory when killed.
> -
> -If this is set to non-zero, the OOM killer simply kills the task that
> -triggered the out-of-memory condition.  This avoids the expensive
> -tasklist scan.
> +oom_kill_quick
>  
> -If panic_on_oom is selected, it takes precedence over whatever value
> -is used in oom_kill_allocating_task.
> +When enabled, this will always kill the task that triggered the oom killer, i.e.
> +the task that attempted to allocate memory that could not be found.  It also
> +suppresses the tasklist dump to the kernel log whenever the oom killer is
> +called.  Typically set on systems with an extremely large number of tasks.
>  
>  The default value is 0.
>  
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)
>  }
>  /* for sysctl */
>  extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_quick;
>  extern int sysctl_oom_forkbomb_thres;
>  
>  #endif /* __KERNEL__*/
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -941,16 +941,9 @@ static struct ctl_table vm_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> -		.procname	= "oom_kill_allocating_task",
> -		.data		= &sysctl_oom_kill_allocating_task,
> -		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
> -		.mode		= 0644,
> -		.proc_handler	= proc_dointvec,
> -	},
> -	{
> -		.procname	= "oom_dump_tasks",
> -		.data		= &sysctl_oom_dump_tasks,
> -		.maxlen		= sizeof(sysctl_oom_dump_tasks),
> +		.procname	= "oom_kill_quick",
> +		.data		= &sysctl_oom_kill_quick,
> +		.maxlen		= sizeof(sysctl_oom_kill_quick),
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,9 +32,8 @@
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> -int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
>  int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> +int sysctl_oom_kill_quick;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  
>  /*
> @@ -402,7 +401,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  	dump_stack();
>  	mem_cgroup_print_oom_info(mem, p);
>  	show_mem();
> -	if (sysctl_oom_dump_tasks)
> +	if (!sysctl_oom_kill_quick)
>  		dump_tasks(mem);
>  }
>  
> @@ -609,9 +608,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
>  	struct task_struct *p;
>  	unsigned int points;
>  
> -	if (sysctl_oom_kill_allocating_task)
> +	if (sysctl_oom_kill_quick)
>  		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> -			NULL, "Out of memory (oom_kill_allocating_task)"))
> +			NULL, "Out of memory (quick mode)"))
>  			return;
>  retry:
>  	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  0:10       ` David Rientjes
@ 2010-02-16  6:44         ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 04:10:15PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.

Completely agree with this request. Actually, I think even better you
should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
it is OK to break the API (callers *will* oops or corrupt memory if
__GFP_NOFAIL returns NULL).


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  6:44         ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  6:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 04:10:15PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help.  The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O.  Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > > 
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later.  Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > > 
> > > Previously, the heuristic provided some protection for those tasks with 
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > > 
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > > 
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > >  mm/page_alloc.c |    3 +++
> > >  1 files changed, 3 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > >  	 * running out of options and have to consider going OOM
> > >  	 */
> > >  	if (!did_some_progress) {
> > > +		/* The oom killer won't necessarily free lowmem */
> > > +		if (high_zoneidx < ZONE_NORMAL)
> > > +			goto nopage;
> > >  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >  			if (oom_killer_disabled)
> > >  				goto nopage;
> > 
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> > 
> 
> As I already explained when you first brought this up, the possibility of 
> not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> We're not adding any additional __GFP_NOFAIL allocations.

Completely agree with this request. Actually, I think even better you
should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
it is OK to break the API (callers *will* oops or corrupt memory if
__GFP_NOFAIL returns NULL).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  6:20     ` Nick Piggin
@ 2010-02-16  6:59       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  6:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> What is the point of removing it, though? If it doesn't significantly
> help some future patch, just leave it in. It's not worth breaking the
> user/kernel interface just to remove 3 trivial lines of code.
> 

Because it is inconsistent at the user's expense, it has never panicked 
the machine for memory controller ooms, so why is a cpuset or mempolicy 
constrained oom conditions any different?  It also panics the machine even 
on VM_FAULT_OOM which is ridiculous, the tunable is certainly not being 
used how it was documented and so given the fact that mempolicy 
constrained ooms are now much smarter with my rewrite and we never simply 
kill current unless oom_kill_quick is enabled anymore, the compulsory 
panic_on_oom == 2 mode is no longer required.  Simply set all tasks 
attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, 
the kernel need not provide confusing alternative modes to sysctls for 
this behavior.  Before panic_on_oom == 2 was introduced, it would have 
only panicked the machine if panic_on_oom was set to a non-zero integer, 
defining it be something different for '2' after it has held the same 
semantics for years is inappropriate.  There is just no concrete example 
that anyone can give where they want a cpuset-constrained oom to panic the 
machine when other tasks on a disjoint set of mems can continue to do 
work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  6:59       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  6:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> What is the point of removing it, though? If it doesn't significantly
> help some future patch, just leave it in. It's not worth breaking the
> user/kernel interface just to remove 3 trivial lines of code.
> 

Because it is inconsistent at the user's expense, it has never panicked 
the machine for memory controller ooms, so why is a cpuset or mempolicy 
constrained oom conditions any different?  It also panics the machine even 
on VM_FAULT_OOM which is ridiculous, the tunable is certainly not being 
used how it was documented and so given the fact that mempolicy 
constrained ooms are now much smarter with my rewrite and we never simply 
kill current unless oom_kill_quick is enabled anymore, the compulsory 
panic_on_oom == 2 mode is no longer required.  Simply set all tasks 
attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, 
the kernel need not provide confusing alternative modes to sysctls for 
this behavior.  Before panic_on_oom == 2 was introduced, it would have 
only panicked the machine if panic_on_oom was set to a non-zero integer, 
defining it be something different for '2' after it has held the same 
semantics for years is inappropriate.  There is just no concrete example 
that anyone can give where they want a cpuset-constrained oom to panic the 
machine when other tasks on a disjoint set of mems can continue to do 
work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
  2010-02-16  1:26             ` KAMEZAWA Hiroyuki
@ 2010-02-16  7:03               ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> believe API is trustable and unchanged.
> 

I hope they don't use it with GFP_ATOMIC, either, because it's never been 
respected in that context.  We can easily audit the handful of cases in 
the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and 
prove that none use it with GFP_ATOMIC or GFP_NOFS.  We don't need to add 
multitudes of warnings about using a deprecated flag with ludicrous 
combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work 
gracefully)?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
@ 2010-02-16  7:03               ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> believe API is trustable and unchanged.
> 

I hope they don't use it with GFP_ATOMIC, either, because it's never been 
respected in that context.  We can easily audit the handful of cases in 
the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and 
prove that none use it with GFP_ATOMIC or GFP_NOFS.  We don't need to add 
multitudes of warnings about using a deprecated flag with ludicrous 
combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work 
gracefully)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  6:59       ` David Rientjes
@ 2010-02-16  7:20         ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 10:59:26PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > What is the point of removing it, though? If it doesn't significantly
> > help some future patch, just leave it in. It's not worth breaking the
> > user/kernel interface just to remove 3 trivial lines of code.
> > 
> 
> Because it is inconsistent at the user's expense, it has never panicked 
> the machine for memory controller ooms, so why is a cpuset or mempolicy 
> constrained oom conditions any different?

Well memory controller was added later, wasn't it? So if you think
that's a bug then a fix to panic on memory controller ooms might
be in order.

>  It also panics the machine even 
> on VM_FAULT_OOM which is ridiculous,

Why?

> the tunable is certainly not being 
> used how it was documented

Why not? The documentation seems to match the implementation.

> and so given the fact that mempolicy 
> constrained ooms are now much smarter with my rewrite and we never simply 
> kill current unless oom_kill_quick is enabled anymore, the compulsory 
> panic_on_oom == 2 mode is no longer required.  Simply set all tasks 
> attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, 
> the kernel need not provide confusing alternative modes to sysctls for 
> this behavior.  Before panic_on_oom == 2 was introduced, it would have 
> only panicked the machine if panic_on_oom was set to a non-zero integer, 
> defining it be something different for '2' after it has held the same 
> semantics for years is inappropriate.

Well it was always defined in the documentation that it should be 0
or 1. Just that the limit wasn't enforced. I agree that's not ideal,
but anyway the existing and documented 0/1/2 has been there for 3 years
and so now removing the 2 is even worse.

>  There is just no concrete example 
> that anyone can give where they want a cpuset-constrained oom to panic the 
> machine when other tasks on a disjoint set of mems can continue to do 
> work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

But this is changing the way the environment is required to set up. So
a kernel upgrade can break previously working setups. We don't do this
without really good reason.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  7:20         ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 10:59:26PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > What is the point of removing it, though? If it doesn't significantly
> > help some future patch, just leave it in. It's not worth breaking the
> > user/kernel interface just to remove 3 trivial lines of code.
> > 
> 
> Because it is inconsistent at the user's expense, it has never panicked 
> the machine for memory controller ooms, so why is a cpuset or mempolicy 
> constrained oom conditions any different?

Well memory controller was added later, wasn't it? So if you think
that's a bug then a fix to panic on memory controller ooms might
be in order.

>  It also panics the machine even 
> on VM_FAULT_OOM which is ridiculous,

Why?

> the tunable is certainly not being 
> used how it was documented

Why not? The documentation seems to match the implementation.

> and so given the fact that mempolicy 
> constrained ooms are now much smarter with my rewrite and we never simply 
> kill current unless oom_kill_quick is enabled anymore, the compulsory 
> panic_on_oom == 2 mode is no longer required.  Simply set all tasks 
> attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE, 
> the kernel need not provide confusing alternative modes to sysctls for 
> this behavior.  Before panic_on_oom == 2 was introduced, it would have 
> only panicked the machine if panic_on_oom was set to a non-zero integer, 
> defining it be something different for '2' after it has held the same 
> semantics for years is inappropriate.

Well it was always defined in the documentation that it should be 0
or 1. Just that the limit wasn't enforced. I agree that's not ideal,
but anyway the existing and documented 0/1/2 has been there for 3 years
and so now removing the 2 is even worse.

>  There is just no concrete example 
> that anyone can give where they want a cpuset-constrained oom to panic the 
> machine when other tasks on a disjoint set of mems can continue to do 
> work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

But this is changing the way the environment is required to set up. So
a kernel upgrade can break previously working setups. We don't do this
without really good reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
  2010-02-16  7:03               ` David Rientjes
@ 2010-02-16  7:23                 ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:03:50PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> > believe API is trustable and unchanged.
> > 
> 
> I hope they don't use it with GFP_ATOMIC, either, because it's never been 
> respected in that context.  We can easily audit the handful of cases in 
> the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and 
> prove that none use it with GFP_ATOMIC or GFP_NOFS.  We don't need to add 
> multitudes of warnings about using a deprecated flag with ludicrous 
> combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work 
> gracefully)?

You don't need to add warnings, just don't break existing working
combinations and nobody has anything to complain about.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch] mm: add comment about deprecation of __GFP_NOFAIL
@ 2010-02-16  7:23                 ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:03:50PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> > believe API is trustable and unchanged.
> > 
> 
> I hope they don't use it with GFP_ATOMIC, either, because it's never been 
> respected in that context.  We can easily audit the handful of cases in 
> the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and 
> prove that none use it with GFP_ATOMIC or GFP_NOFS.  We don't need to add 
> multitudes of warnings about using a deprecated flag with ludicrous 
> combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work 
> gracefully)?

You don't need to add warnings, just don't break existing working
combinations and nobody has anything to complain about.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  5:32         ` KOSAKI Motohiro
@ 2010-02-16  7:29           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> No current user? I don't think so.
> 
> 	int bio_integrity_prep(struct bio *bio)
> 	{
> 	(snip)
> 	        buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);
> 
> and 
> 
> 	void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
> 	{
> 	(snip)
> 	        if (dma) {
> 	                init_emergency_isa_pool();
> 	                q->bounce_gfp = GFP_NOIO | GFP_DMA;
> 	                q->limits.bounce_pfn = b_pfn;
> 	        }
> 
> 
> 
> I don't like rumor based discussion, I like fact based one.
> 

The GFP_NOIO will prevent the oom killer from being called, it requires 
__GFP_FS.

I can change this to invoke the should_alloc_retry() logic by testing for 
!(gfp_mask & __GFP_NOFAIL), but there's nothing else the page allocator 
can currently do to increase its probability of allocating pages; the 
memory compaction patchset might be particularly helpful for these types 
of scenarios.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  7:29           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> No current user? I don't think so.
> 
> 	int bio_integrity_prep(struct bio *bio)
> 	{
> 	(snip)
> 	        buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);
> 
> and 
> 
> 	void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
> 	{
> 	(snip)
> 	        if (dma) {
> 	                init_emergency_isa_pool();
> 	                q->bounce_gfp = GFP_NOIO | GFP_DMA;
> 	                q->limits.bounce_pfn = b_pfn;
> 	        }
> 
> 
> 
> I don't like rumor based discussion, I like fact based one.
> 

The GFP_NOIO will prevent the oom killer from being called, it requires 
__GFP_FS.

I can change this to invoke the should_alloc_retry() logic by testing for 
!(gfp_mask & __GFP_NOFAIL), but there's nothing else the page allocator 
can currently do to increase its probability of allocating pages; the 
memory compaction patchset might be particularly helpful for these types 
of scenarios.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  6:44         ` Nick Piggin
@ 2010-02-16  7:41           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > As I already explained when you first brought this up, the possibility of 
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > We're not adding any additional __GFP_NOFAIL allocations.
> 
> Completely agree with this request. Actually, I think even better you
> should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> it is OK to break the API (callers *will* oops or corrupt memory if
> __GFP_NOFAIL returns NULL).
> 

... unless it's used with GFP_ATOMIC, which we've always returned NULL 
for when even ALLOC_HARDER can't find pages, right?

I'm wondering where this strong argument in favor of continuing to support 
__GFP_NOFAIL was when I insisted we call the oom killer for them even for 
allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was 
refactored back in 2.6.31.  The argument was that nobody is allocating 
that high of orders of __GFP_NOFAIL pages so we don't need to free memory 
for them and that's where the deprecation of the modifier happened in the 
first place.  Ultimately, we did invoke the oom killer for those 
allocations because there's no chance of forward progress otherwise and, 
unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.  

I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
__GFP_NOFAIL) path since we're all content with endlessly looping.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  7:41           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > As I already explained when you first brought this up, the possibility of 
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > We're not adding any additional __GFP_NOFAIL allocations.
> 
> Completely agree with this request. Actually, I think even better you
> should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> it is OK to break the API (callers *will* oops or corrupt memory if
> __GFP_NOFAIL returns NULL).
> 

... unless it's used with GFP_ATOMIC, which we've always returned NULL 
for when even ALLOC_HARDER can't find pages, right?

I'm wondering where this strong argument in favor of continuing to support 
__GFP_NOFAIL was when I insisted we call the oom killer for them even for 
allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was 
refactored back in 2.6.31.  The argument was that nobody is allocating 
that high of orders of __GFP_NOFAIL pages so we don't need to free memory 
for them and that's where the deprecation of the modifier happened in the 
first place.  Ultimately, we did invoke the oom killer for those 
allocations because there's no chance of forward progress otherwise and, 
unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.  

I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
__GFP_NOFAIL) path since we're all content with endlessly looping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  7:41           ` David Rientjes
@ 2010-02-16  7:53             ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:41:49PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > As I already explained when you first brought this up, the possibility of 
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > > We're not adding any additional __GFP_NOFAIL allocations.
> > 
> > Completely agree with this request. Actually, I think even better you
> > should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> > it is OK to break the API (callers *will* oops or corrupt memory if
> > __GFP_NOFAIL returns NULL).
> > 
> 
> ... unless it's used with GFP_ATOMIC, which we've always returned NULL 
> for when even ALLOC_HARDER can't find pages, right?

Ye, it's never worked with GFP_ATOMIC.


> I'm wondering where this strong argument in favor of continuing to support 
> __GFP_NOFAIL was when I insisted we call the oom killer for them even for 
> allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was 
> refactored back in 2.6.31.  The argument was that nobody is allocating 
> that high of orders of __GFP_NOFAIL pages so we don't need to free memory 
> for them and that's where the deprecation of the modifier happened in the 
> first place.  Ultimately, we did invoke the oom killer for those 
> allocations because there's no chance of forward progress otherwise and, 
> unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.  

I don't know. IMO we should never just randomly weaken or break such
flag as the page allocator API.

> 
> I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> __GFP_NOFAIL) path since we're all content with endlessly looping.

Thanks. Yes endlessly looping is far preferable to randomly oopsing
or corrupting memory.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  7:53             ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  7:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:41:49PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > As I already explained when you first brought this up, the possibility of 
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible 
> > > for GFP_NOFS.  Since __GFP_NOFAIL is deprecated and there are no current 
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.  
> > > We're not adding any additional __GFP_NOFAIL allocations.
> > 
> > Completely agree with this request. Actually, I think even better you
> > should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> > it is OK to break the API (callers *will* oops or corrupt memory if
> > __GFP_NOFAIL returns NULL).
> > 
> 
> ... unless it's used with GFP_ATOMIC, which we've always returned NULL 
> for when even ALLOC_HARDER can't find pages, right?

Ye, it's never worked with GFP_ATOMIC.


> I'm wondering where this strong argument in favor of continuing to support 
> __GFP_NOFAIL was when I insisted we call the oom killer for them even for 
> allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was 
> refactored back in 2.6.31.  The argument was that nobody is allocating 
> that high of orders of __GFP_NOFAIL pages so we don't need to free memory 
> for them and that's where the deprecation of the modifier happened in the 
> first place.  Ultimately, we did invoke the oom killer for those 
> allocations because there's no chance of forward progress otherwise and, 
> unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.  

I don't know. IMO we should never just randomly weaken or break such
flag as the page allocator API.

> 
> I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> __GFP_NOFAIL) path since we're all content with endlessly looping.

Thanks. Yes endlessly looping is far preferable to randomly oopsing
or corrupting memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  7:20         ` Nick Piggin
@ 2010-02-16  7:53           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Because it is inconsistent at the user's expense, it has never panicked 
> > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > constrained oom conditions any different?
> 
> Well memory controller was added later, wasn't it? So if you think
> that's a bug then a fix to panic on memory controller ooms might
> be in order.
> 

But what about the existing memcg users who set panic_on_oom == 2 and 
don't expect the memory controller to be influenced by that?

> >  It also panics the machine even 
> > on VM_FAULT_OOM which is ridiculous,
> 
> Why?
> 

Because the oom killer was never called for VM_FAULT_OOM before, we simply 
sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
not even enforced.

> > the tunable is certainly not being 
> > used how it was documented
> 
> Why not? The documentation seems to match the implementation.
> 

It was meant to panic the machine anytime it was out of memory, regardless 
of the constraint, but that obviously doesn't match the memory controller 
case.  Just because cpusets and mempolicies decide to use the oom killer 
as a mechanism for enforcing a user-defined policy does not mean that we 
want to panic for them: mempolicies, for example, are user created and do 
not require any special capability.  Does it seem reasonable that an oom 
condition on those mempolicy nodes should panic the machine when killing 
the offender is possible (and perhaps even encouraged if the user sets a 
high /proc/pid/oom_score_adj?)  In other words, is an admin setting 
panic_on_oom == 2 really expecting that no application will use 
set_mempolicy() or do an mbind()?  This is a very error-prone interface 
that needs to be dealt with on a case-by-case basis and the perfect way to 
do that is by setting the affected tasks to be OOM_DISABLE; that 
interface, unlike panic_on_oom == 2, is very well understood by those with 
CAP_SYS_RESOURCE.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  7:53           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  7:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Because it is inconsistent at the user's expense, it has never panicked 
> > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > constrained oom conditions any different?
> 
> Well memory controller was added later, wasn't it? So if you think
> that's a bug then a fix to panic on memory controller ooms might
> be in order.
> 

But what about the existing memcg users who set panic_on_oom == 2 and 
don't expect the memory controller to be influenced by that?

> >  It also panics the machine even 
> > on VM_FAULT_OOM which is ridiculous,
> 
> Why?
> 

Because the oom killer was never called for VM_FAULT_OOM before, we simply 
sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
not even enforced.

> > the tunable is certainly not being 
> > used how it was documented
> 
> Why not? The documentation seems to match the implementation.
> 

It was meant to panic the machine anytime it was out of memory, regardless 
of the constraint, but that obviously doesn't match the memory controller 
case.  Just because cpusets and mempolicies decide to use the oom killer 
as a mechanism for enforcing a user-defined policy does not mean that we 
want to panic for them: mempolicies, for example, are user created and do 
not require any special capability.  Does it seem reasonable that an oom 
condition on those mempolicy nodes should panic the machine when killing 
the offender is possible (and perhaps even encouraged if the user sets a 
high /proc/pid/oom_score_adj?)  In other words, is an admin setting 
panic_on_oom == 2 really expecting that no application will use 
set_mempolicy() or do an mbind()?  This is a very error-prone interface 
that needs to be dealt with on a case-by-case basis and the perfect way to 
do that is by setting the affected tasks to be OOM_DISABLE; that 
interface, unlike panic_on_oom == 2, is very well understood by those with 
CAP_SYS_RESOURCE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  7:53           ` David Rientjes
@ 2010-02-16  8:08             ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  8:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > Because it is inconsistent at the user's expense, it has never panicked 
> > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > constrained oom conditions any different?
> > 
> > Well memory controller was added later, wasn't it? So if you think
> > that's a bug then a fix to panic on memory controller ooms might
> > be in order.
> > 
> 
> But what about the existing memcg users who set panic_on_oom == 2 and 
> don't expect the memory controller to be influenced by that?

But that was a bug in the addition of the memory controller. Either the
documentation should be fixed, or the implementation should be fixed.

 
> > >  It also panics the machine even 
> > > on VM_FAULT_OOM which is ridiculous,
> > 
> > Why?
> > 
> 
> Because the oom killer was never called for VM_FAULT_OOM before, we simply 
> sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
> not even enforced.

No but now they are. I don't know what your point is here because there
is no way the users of this interface can be expected to know about
VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
with that.

> 
> > > the tunable is certainly not being 
> > > used how it was documented
> > 
> > Why not? The documentation seems to match the implementation.
> > 
> 
> It was meant to panic the machine anytime it was out of memory, regardless 
> of the constraint, but that obviously doesn't match the memory controller 
> case.

Right, and it's been like that for 3 years and people who don't use
the memory controller will be using that tunable.

Let's fix the memory controller case.

>  Just because cpusets and mempolicies decide to use the oom killer 
> as a mechanism for enforcing a user-defined policy does not mean that we 
> want to panic for them: mempolicies, for example, are user created and do 
> not require any special capability.  Does it seem reasonable that an oom 
> condition on those mempolicy nodes should panic the machine when killing 
> the offender is possible (and perhaps even encouraged if the user sets a 
> high /proc/pid/oom_score_adj?)  In other words, is an admin setting 
> panic_on_oom == 2 really expecting that no application will use 
> set_mempolicy() or do an mbind()?  This is a very error-prone interface 
> that needs to be dealt with on a case-by-case basis and the perfect way to 
> do that is by setting the affected tasks to be OOM_DISABLE; that 
> interface, unlike panic_on_oom == 2, is very well understood by those with 
> CAP_SYS_RESOURCE.

I assume it is reasonable to want to panic on any OOM if you're after
fail-stop kind of behaviour. I guess that is why it was added. I see
more use for that case than panic_on_oom==1 case myself.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  8:08             ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-16  8:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > Because it is inconsistent at the user's expense, it has never panicked 
> > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > constrained oom conditions any different?
> > 
> > Well memory controller was added later, wasn't it? So if you think
> > that's a bug then a fix to panic on memory controller ooms might
> > be in order.
> > 
> 
> But what about the existing memcg users who set panic_on_oom == 2 and 
> don't expect the memory controller to be influenced by that?

But that was a bug in the addition of the memory controller. Either the
documentation should be fixed, or the implementation should be fixed.

 
> > >  It also panics the machine even 
> > > on VM_FAULT_OOM which is ridiculous,
> > 
> > Why?
> > 
> 
> Because the oom killer was never called for VM_FAULT_OOM before, we simply 
> sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
> not even enforced.

No but now they are. I don't know what your point is here because there
is no way the users of this interface can be expected to know about
VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
with that.

> 
> > > the tunable is certainly not being 
> > > used how it was documented
> > 
> > Why not? The documentation seems to match the implementation.
> > 
> 
> It was meant to panic the machine anytime it was out of memory, regardless 
> of the constraint, but that obviously doesn't match the memory controller 
> case.

Right, and it's been like that for 3 years and people who don't use
the memory controller will be using that tunable.

Let's fix the memory controller case.

>  Just because cpusets and mempolicies decide to use the oom killer 
> as a mechanism for enforcing a user-defined policy does not mean that we 
> want to panic for them: mempolicies, for example, are user created and do 
> not require any special capability.  Does it seem reasonable that an oom 
> condition on those mempolicy nodes should panic the machine when killing 
> the offender is possible (and perhaps even encouraged if the user sets a 
> high /proc/pid/oom_score_adj?)  In other words, is an admin setting 
> panic_on_oom == 2 really expecting that no application will use 
> set_mempolicy() or do an mbind()?  This is a very error-prone interface 
> that needs to be dealt with on a case-by-case basis and the perfect way to 
> do that is by setting the affected tasks to be OOM_DISABLE; that 
> interface, unlike panic_on_oom == 2, is very well understood by those with 
> CAP_SYS_RESOURCE.

I assume it is reasonable to want to panic on any OOM if you're after
fail-stop kind of behaviour. I guess that is why it was added. I see
more use for that case than panic_on_oom==1 case myself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  8:08             ` Nick Piggin
@ 2010-02-16  8:10               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  8:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 19:08:17 +1100
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> > On Tue, 16 Feb 2010, Nick Piggin wrote:
> > 
> > > > Because it is inconsistent at the user's expense, it has never panicked 
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > > constrained oom conditions any different?
> > > 
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > > 
> > 
> > But what about the existing memcg users who set panic_on_oom == 2 and 
> > don't expect the memory controller to be influenced by that?
> 
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
> 
I'll add a documentation to memcg. As

"When you exhaust memory resource under memcg, oom-killer may be invoked.
 But in this case, the system never panics even when panic_on_oom is set."

Maybe I should add "memcg_oom_notify (netlink message or file-decriptor or some".
Because memcg's oom is virtual oom, automatic management software can show
report to users and can do fail-over. I'll consider something useful for
memcg oom-fail-over instead of panic. In the simplest case, cgroup's notiifer
file descriptor can be used.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  8:10               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16  8:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 19:08:17 +1100
Nick Piggin <npiggin@suse.de> wrote:

> On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> > On Tue, 16 Feb 2010, Nick Piggin wrote:
> > 
> > > > Because it is inconsistent at the user's expense, it has never panicked 
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > > constrained oom conditions any different?
> > > 
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > > 
> > 
> > But what about the existing memcg users who set panic_on_oom == 2 and 
> > don't expect the memory controller to be influenced by that?
> 
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
> 
I'll add a documentation to memcg. As

"When you exhaust memory resource under memcg, oom-killer may be invoked.
 But in this case, the system never panics even when panic_on_oom is set."

Maybe I should add "memcg_oom_notify (netlink message or file-decriptor or some".
Because memcg's oom is virtual oom, automatic management software can show
report to users and can do fail-over. I'll consider something useful for
memcg oom-fail-over instead of panic. In the simplest case, cgroup's notiifer
file descriptor can be used.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  7:53             ` Nick Piggin
@ 2010-02-16  8:25               ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > __GFP_NOFAIL) path since we're all content with endlessly looping.
> 
> Thanks. Yes endlessly looping is far preferable to randomly oopsing
> or corrupting memory.
> 

Here's the new patch for your consideration.


oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1705,6 +1705,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16  8:25               ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > __GFP_NOFAIL) path since we're all content with endlessly looping.
> 
> Thanks. Yes endlessly looping is far preferable to randomly oopsing
> or corrupting memory.
> 

Here's the new patch for your consideration.


oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with 
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1705,6 +1705,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
+		/* The oom killer won't necessarily free lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  8:08             ` Nick Piggin
@ 2010-02-16  8:42               ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > > > Because it is inconsistent at the user's expense, it has never panicked 
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > > constrained oom conditions any different?
> > > 
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > > 
> > 
> > But what about the existing memcg users who set panic_on_oom == 2 and 
> > don't expect the memory controller to be influenced by that?
> 
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
> 

The memory controller behavior seems intentional because it prevents 
panicking in two places: mem_cgroup_out_of_memory() never considers it and 
sysctl_panic_on_oom is preempted in pagefault_out_of_memory() if current's 
memcg is oom.

The documentation is currently right because it only mentions an 
application to cpusets and mempolicies.

That's the reason why I think we should eliminate it: it is completely 
bogus as it stands because it allows tasks to be killed in memory 
controller environments if their hard limit is reached unless they are set 
to OOM_DISABLE.  That doesn't have fail-stop behavior and trying to make 
exceptions to the rule is not true "fail-stop" that we need to preserve 
with this interface.

> > Because the oom killer was never called for VM_FAULT_OOM before, we simply 
> > sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
> > not even enforced.
> 
> No but now they are. I don't know what your point is here because there
> is no way the users of this interface can be expected to know about
> VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
> with that.
> 

I think VM_FAULT_OOM should panic the machine for panic_on_oom == 1 as it 
presently does, it needs no special handling otherwise.  But this is an 
example of where semantics of panic_on_oom have changed in the past where 
OOM_DISABLE would remove any ambiguity.  Instead of redefining the 
sysctl's semantics everytime we add another usecase for the oom killer, 
why can't we just use a single interface that has been around for years 
when a certain task shouldn't be killed?

> Let's fix the memory controller case.
> 

I doubt you'll find much support from the memory controller folks on that 
since they probably won't agree this is fail-stop behavior and killing a 
task when constrained by a memcg is appropriate because the user asked for 
a hard limit.

Again, OOM_DISABLE would remove all ambiguity and we wouldn't need to 
concern ourselves of what the semantics of a poorly chosen interface such 
as panic_on_oom == 2 is whenever we change the oom killer.

> I assume it is reasonable to want to panic on any OOM if you're after
> fail-stop kind of behaviour. I guess that is why it was added. I see
> more use for that case than panic_on_oom==1 case myself.
> 

panic_on_oom == 1 is reasonable since no system task can make forward 
progress in allocating memory, that isn't necessarily true of cpuset or 
mempolicy (or memcg) constrained applications.  Other cpusets, for 
instance, can continue to do work uninterrupted and without threat of 
having one of their tasks being oom killed.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  8:42               ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > > > Because it is inconsistent at the user's expense, it has never panicked 
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy 
> > > > constrained oom conditions any different?
> > > 
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > > 
> > 
> > But what about the existing memcg users who set panic_on_oom == 2 and 
> > don't expect the memory controller to be influenced by that?
> 
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
> 

The memory controller behavior seems intentional because it prevents 
panicking in two places: mem_cgroup_out_of_memory() never considers it and 
sysctl_panic_on_oom is preempted in pagefault_out_of_memory() if current's 
memcg is oom.

The documentation is currently right because it only mentions an 
application to cpusets and mempolicies.

That's the reason why I think we should eliminate it: it is completely 
bogus as it stands because it allows tasks to be killed in memory 
controller environments if their hard limit is reached unless they are set 
to OOM_DISABLE.  That doesn't have fail-stop behavior and trying to make 
exceptions to the rule is not true "fail-stop" that we need to preserve 
with this interface.

> > Because the oom killer was never called for VM_FAULT_OOM before, we simply 
> > sent a SIGKILL to current, i.e. the original panic_on_oom semantics were 
> > not even enforced.
> 
> No but now they are. I don't know what your point is here because there
> is no way the users of this interface can be expected to know about
> VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
> with that.
> 

I think VM_FAULT_OOM should panic the machine for panic_on_oom == 1 as it 
presently does, it needs no special handling otherwise.  But this is an 
example of where semantics of panic_on_oom have changed in the past where 
OOM_DISABLE would remove any ambiguity.  Instead of redefining the 
sysctl's semantics everytime we add another usecase for the oom killer, 
why can't we just use a single interface that has been around for years 
when a certain task shouldn't be killed?

> Let's fix the memory controller case.
> 

I doubt you'll find much support from the memory controller folks on that 
since they probably won't agree this is fail-stop behavior and killing a 
task when constrained by a memcg is appropriate because the user asked for 
a hard limit.

Again, OOM_DISABLE would remove all ambiguity and we wouldn't need to 
concern ourselves of what the semantics of a poorly chosen interface such 
as panic_on_oom == 2 is whenever we change the oom killer.

> I assume it is reasonable to want to panic on any OOM if you're after
> fail-stop kind of behaviour. I guess that is why it was added. I see
> more use for that case than panic_on_oom==1 case myself.
> 

panic_on_oom == 1 is reasonable since no system task can make forward 
progress in allocating memory, that isn't necessarily true of cpuset or 
mempolicy (or memcg) constrained applications.  Other cpusets, for 
instance, can continue to do work uninterrupted and without threat of 
having one of their tasks being oom killed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 7/9 v2] oom: replace sysctls with quick mode
  2010-02-16  6:28     ` Nick Piggin
@ 2010-02-16  8:58       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans.  The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others.  The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> > 
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick.  This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> > 
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
> 
> I just don't understand this either. There appears to be simply no
> performance or maintainability reason to change this.
> 

When oom_dump_tasks() is always emitted for out of memory conditions as my 
patch does, then these two tunables have the exact same audience: users 
with large systems that have extremely long tasklists.  They want to avoid 
tasklist scanning (either to select a bad process to kill or dump their 
information) in oom conditions and simply kill the allocating task.  I 
chose to combine the two: we're not concerned about breaking the 
oom_dump_tasks ABI since it's now the default behavior and since we scan 
the tasklist for mempolicy-constrained ooms, users may now choose to 
enable oom_kill_allocating_task when they previously wouldn't have.  To do 
that, they can either use the old sysctl or convert to this new sysctl 
with the benefit that we've removed one unnecessary sysctl from 
/proc/sys/vm.

As far as I know, oom_kill_allocating_task is only used by SGI, anyway, 
since they are the ones who asked for it when I implemented cpuset 
tasklist scanning.  It's certainly not widely used and since the semantics 
for mempolicies have changed, oom_kill_quick may find more users.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 7/9 v2] oom: replace sysctls with quick mode
@ 2010-02-16  8:58       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  8:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans.  The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others.  The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> > 
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick.  This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> > 
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
> 
> I just don't understand this either. There appears to be simply no
> performance or maintainability reason to change this.
> 

When oom_dump_tasks() is always emitted for out of memory conditions as my 
patch does, then these two tunables have the exact same audience: users 
with large systems that have extremely long tasklists.  They want to avoid 
tasklist scanning (either to select a bad process to kill or dump their 
information) in oom conditions and simply kill the allocating task.  I 
chose to combine the two: we're not concerned about breaking the 
oom_dump_tasks ABI since it's now the default behavior and since we scan 
the tasklist for mempolicy-constrained ooms, users may now choose to 
enable oom_kill_allocating_task when they previously wouldn't have.  To do 
that, they can either use the old sysctl or convert to this new sysctl 
with the benefit that we've removed one unnecessary sysctl from 
/proc/sys/vm.

As far as I know, oom_kill_allocating_task is only used by SGI, anyway, 
since they are the ones who asked for it when I implemented cpuset 
tasklist scanning.  It's certainly not widely used and since the semantics 
for mempolicies have changed, oom_kill_quick may find more users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  0:23         ` KAMEZAWA Hiroyuki
@ 2010-02-16  9:02           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > You don't understand that the behavior has changed ever since 
> > mempolicy-constrained oom conditions are now affected by a compulsory 
> > panic_on_oom mode, please see the patch description.  It's absolutely 
> > insane for a single sysctl mode to panic the machine anytime a cpuset or 
> > mempolicy runs out of memory and is more prone to user error from setting 
> > it without fully understanding the ramifications than any use it will ever 
> > do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> > if you want your cpuset or mempolicy to risk panicking the machine, set 
> > all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> > This is no different from the memory controller being immune to such 
> > panic_on_oom conditions, stop believing that it is the only mechanism used 
> > in the kernel to do memory isolation.
> > 
> You don't explain why "we _have to_ remove API which is used"
> 

First, I'm not stating that we _have_ to remove anything, this is a patch 
proposal that is open for review.

Second, I believe we _should_ remove panic_on_oom == 2 because it's no 
longer being used as it was documented: as we've increased the exposure of 
the oom killer (memory controller, pagefault ooms, now mempolicy tasklist 
scanning), we constantly have to re-evaluate the semantics of this option 
while a well-understood tunable with a long history, OOM_DISABLE, already 
does the equivalent.  The downside of getting this wrong is that the 
machine panics when it shouldn't have because of an unintended consequence 
of the mode being enabled (a mempolicy ooms, for example, that was created 
by the user).  When reconsidering its semantics, I'd personally opt on the 
safe side and make sure the machine doesn't panic unnecessarily and 
instead require users to use OOM_DISABLE for tasks they do not want to be 
oom killed.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16  9:02           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > You don't understand that the behavior has changed ever since 
> > mempolicy-constrained oom conditions are now affected by a compulsory 
> > panic_on_oom mode, please see the patch description.  It's absolutely 
> > insane for a single sysctl mode to panic the machine anytime a cpuset or 
> > mempolicy runs out of memory and is more prone to user error from setting 
> > it without fully understanding the ramifications than any use it will ever 
> > do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> > if you want your cpuset or mempolicy to risk panicking the machine, set 
> > all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> > This is no different from the memory controller being immune to such 
> > panic_on_oom conditions, stop believing that it is the only mechanism used 
> > in the kernel to do memory isolation.
> > 
> You don't explain why "we _have to_ remove API which is used"
> 

First, I'm not stating that we _have_ to remove anything, this is a patch 
proposal that is open for review.

Second, I believe we _should_ remove panic_on_oom == 2 because it's no 
longer being used as it was documented: as we've increased the exposure of 
the oom killer (memory controller, pagefault ooms, now mempolicy tasklist 
scanning), we constantly have to re-evaluate the semantics of this option 
while a well-understood tunable with a long history, OOM_DISABLE, already 
does the equivalent.  The downside of getting this wrong is that the 
machine panics when it shouldn't have because of an unintended consequence 
of the mode being enabled (a mempolicy ooms, for example, that was created 
by the user).  When reconsidering its semantics, I'd personally opt on the 
safe side and make sure the machine doesn't panic unnecessarily and 
instead require users to use OOM_DISABLE for tasks they do not want to be 
oom killed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16  9:02           ` David Rientjes
@ 2010-02-16 23:42             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16 23:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 01:02:28 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > You don't understand that the behavior has changed ever since 
> > > mempolicy-constrained oom conditions are now affected by a compulsory 
> > > panic_on_oom mode, please see the patch description.  It's absolutely 
> > > insane for a single sysctl mode to panic the machine anytime a cpuset or 
> > > mempolicy runs out of memory and is more prone to user error from setting 
> > > it without fully understanding the ramifications than any use it will ever 
> > > do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> > > if you want your cpuset or mempolicy to risk panicking the machine, set 
> > > all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> > > This is no different from the memory controller being immune to such 
> > > panic_on_oom conditions, stop believing that it is the only mechanism used 
> > > in the kernel to do memory isolation.
> > > 
> > You don't explain why "we _have to_ remove API which is used"
> > 
> 
> First, I'm not stating that we _have_ to remove anything, this is a patch 
> proposal that is open for review.
> 
> Second, I believe we _should_ remove panic_on_oom == 2 because it's no 
> longer being used as it was documented: as we've increased the exposure of 
> the oom killer (memory controller, pagefault ooms, now mempolicy tasklist 
> scanning), we constantly have to re-evaluate the semantics of this option 
> while a well-understood tunable with a long history, OOM_DISABLE, already 
> does the equivalent.  The downside of getting this wrong is that the 
> machine panics when it shouldn't have because of an unintended consequence 
> of the mode being enabled (a mempolicy ooms, for example, that was created 
> by the user).  When reconsidering its semantics, I'd personally opt on the 
> safe side and make sure the machine doesn't panic unnecessarily and 
> instead require users to use OOM_DISABLE for tasks they do not want to be 
> oom killed.
> 

Please don't. I had a chance to talk with customer support team and talked
about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
is the strongest tool for investigating customer's OOM situtation and do
the best advice to them. panic_on_oom_always+kdump is the 100% information
as snapshot when oom-killer happens. Then, it's easy to investigate and
explain what is wront. They sometimes discover memory leak (by some prorietary
driver) or miss-configuration of the system (as using unnecessary bounce buffer.)

Then, please leave panic_on_oom=always.
Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
And yes, I'll add something similar to memcg. freeze_at_oom or something.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16 23:42             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16 23:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 01:02:28 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > You don't understand that the behavior has changed ever since 
> > > mempolicy-constrained oom conditions are now affected by a compulsory 
> > > panic_on_oom mode, please see the patch description.  It's absolutely 
> > > insane for a single sysctl mode to panic the machine anytime a cpuset or 
> > > mempolicy runs out of memory and is more prone to user error from setting 
> > > it without fully understanding the ramifications than any use it will ever 
> > > do.  The kernel already provides a mechanism for doing this, OOM_DISABLE.  
> > > if you want your cpuset or mempolicy to risk panicking the machine, set 
> > > all tasks that share its mems or nodes, respectively, to OOM_DISABLE.  
> > > This is no different from the memory controller being immune to such 
> > > panic_on_oom conditions, stop believing that it is the only mechanism used 
> > > in the kernel to do memory isolation.
> > > 
> > You don't explain why "we _have to_ remove API which is used"
> > 
> 
> First, I'm not stating that we _have_ to remove anything, this is a patch 
> proposal that is open for review.
> 
> Second, I believe we _should_ remove panic_on_oom == 2 because it's no 
> longer being used as it was documented: as we've increased the exposure of 
> the oom killer (memory controller, pagefault ooms, now mempolicy tasklist 
> scanning), we constantly have to re-evaluate the semantics of this option 
> while a well-understood tunable with a long history, OOM_DISABLE, already 
> does the equivalent.  The downside of getting this wrong is that the 
> machine panics when it shouldn't have because of an unintended consequence 
> of the mode being enabled (a mempolicy ooms, for example, that was created 
> by the user).  When reconsidering its semantics, I'd personally opt on the 
> safe side and make sure the machine doesn't panic unnecessarily and 
> instead require users to use OOM_DISABLE for tasks they do not want to be 
> oom killed.
> 

Please don't. I had a chance to talk with customer support team and talked
about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
is the strongest tool for investigating customer's OOM situtation and do
the best advice to them. panic_on_oom_always+kdump is the 100% information
as snapshot when oom-killer happens. Then, it's easy to investigate and
explain what is wront. They sometimes discover memory leak (by some prorietary
driver) or miss-configuration of the system (as using unnecessary bounce buffer.)

Then, please leave panic_on_oom=always.
Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
And yes, I'll add something similar to memcg. freeze_at_oom or something.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16  8:25               ` David Rientjes
@ 2010-02-16 23:48                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16 23:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 00:25:22 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > 
> > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > or corrupting memory.
> > 
> 
> Here's the new patch for your consideration.
> 

Then, can we take kdump in this endlessly looping situaton ?

panic_on_oom=always + kdump can do that. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-16 23:48                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-16 23:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 00:25:22 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, Nick Piggin wrote:
> 
> > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > 
> > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > or corrupting memory.
> > 
> 
> Here's the new patch for your consideration.
> 

Then, can we take kdump in this endlessly looping situaton ?

panic_on_oom=always + kdump can do that. 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16 23:42             ` KAMEZAWA Hiroyuki
@ 2010-02-16 23:54               ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16 23:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Please don't. I had a chance to talk with customer support team and talked
> about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
> is the strongest tool for investigating customer's OOM situtation and do
> the best advice to them. panic_on_oom_always+kdump is the 100% information
> as snapshot when oom-killer happens. Then, it's easy to investigate and
> explain what is wront. They sometimes discover memory leak (by some prorietary
> driver) or miss-configuration of the system (as using unnecessary bounce buffer.)
> 

Ok, I'm not looking to cause your customers unnecessary grief by removing 
an option that they use, even though the same effect is possible by 
setting all tasks to OOM_DISABLE.  I'll remove this patch in the next 
revision.

> Then, please leave panic_on_oom=always.
> Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> And yes, I'll add something similar to memcg. freeze_at_oom or something.
> 

Memcg isn't a special case here, it should also panic the machine if 
panic_on_oom == 2, so if we aren't going to remove this option then I 
agree with Nick that we need to panic from mem_cgroup_out_of_memory() as 
well.  Some users use cpusets, for example, for the same effect of memory 
isolation as you use memcg, so panicking in one scenario and not the other 
is inconsistent.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-16 23:54               ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-16 23:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Please don't. I had a chance to talk with customer support team and talked
> about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
> is the strongest tool for investigating customer's OOM situtation and do
> the best advice to them. panic_on_oom_always+kdump is the 100% information
> as snapshot when oom-killer happens. Then, it's easy to investigate and
> explain what is wront. They sometimes discover memory leak (by some prorietary
> driver) or miss-configuration of the system (as using unnecessary bounce buffer.)
> 

Ok, I'm not looking to cause your customers unnecessary grief by removing 
an option that they use, even though the same effect is possible by 
setting all tasks to OOM_DISABLE.  I'll remove this patch in the next 
revision.

> Then, please leave panic_on_oom=always.
> Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> And yes, I'll add something similar to memcg. freeze_at_oom or something.
> 

Memcg isn't a special case here, it should also panic the machine if 
panic_on_oom == 2, so if we aren't going to remove this option then I 
agree with Nick that we need to panic from mem_cgroup_out_of_memory() as 
well.  Some users use cpusets, for example, for the same effect of memory 
isolation as you use memcg, so panicking in one scenario and not the other 
is inconsistent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-16 23:54               ` David Rientjes
@ 2010-02-17  0:01                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 15:54:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > Then, please leave panic_on_oom=always.
> > Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> > And yes, I'll add something similar to memcg. freeze_at_oom or something.
> > 
> 
> Memcg isn't a special case here, it should also panic the machine if 
> panic_on_oom == 2, so if we aren't going to remove this option then I 
> agree with Nick that we need to panic from mem_cgroup_out_of_memory() as 
> well.  Some users use cpusets, for example, for the same effect of memory 
> isolation as you use memcg, so panicking in one scenario and not the other 
> is inconsistent.
> 
Hmm, I have a few reason to add special behavior to memcg rather than panic.

 - freeze_at_oom is enough.
   If OOM can be notified, the management daemon can do useful jobs. Shutdown
   all other cgroups or migrate them to other host and do kdump.

 - memcg's oom is not very complicated.
   Because we just counts RSS+FileCache

But, Hmm...I'd like to go this way.

 1. At first, support panic_on_oom=2 in memcg.

 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
    and don't call memcg_out_of_memory in oom_kill.c in this case. Because
    we don't kill anything. Taking coredumps of all procs in memcg is not
    very difficult.

I need to discuss with memcg guys. But this will be a way to go, I think

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  0:01                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 15:54:50 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > Then, please leave panic_on_oom=always.
> > Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> > And yes, I'll add something similar to memcg. freeze_at_oom or something.
> > 
> 
> Memcg isn't a special case here, it should also panic the machine if 
> panic_on_oom == 2, so if we aren't going to remove this option then I 
> agree with Nick that we need to panic from mem_cgroup_out_of_memory() as 
> well.  Some users use cpusets, for example, for the same effect of memory 
> isolation as you use memcg, so panicking in one scenario and not the other 
> is inconsistent.
> 
Hmm, I have a few reason to add special behavior to memcg rather than panic.

 - freeze_at_oom is enough.
   If OOM can be notified, the management daemon can do useful jobs. Shutdown
   all other cgroups or migrate them to other host and do kdump.

 - memcg's oom is not very complicated.
   Because we just counts RSS+FileCache

But, Hmm...I'd like to go this way.

 1. At first, support panic_on_oom=2 in memcg.

 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
    and don't call memcg_out_of_memory in oom_kill.c in this case. Because
    we don't kill anything. Taking coredumps of all procs in memcg is not
    very difficult.

I need to discuss with memcg guys. But this will be a way to go, I think

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-17  0:03                   ` David Rientjes
@ 2010-02-17  0:03                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:03:23 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > 
> > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > or corrupting memory.
> > > > 
> > > 
> > > Here's the new patch for your consideration.
> > > 
> > 
> > Then, can we take kdump in this endlessly looping situaton ?
> > 
> > panic_on_oom=always + kdump can do that. 
> > 
> 
> The endless loop is only helpful if something is going to free memory 
> external to the current page allocation: either another task with 
> __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> memory, or a task that exits.
> 
> The most notable endless loop in the page allocator is the one when a task 
> has been oom killed, gets access to memory reserves, and then cannot find 
> a page for a __GFP_NOFAIL allocation:
> 
> 	do {
> 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> 			preferred_zone, migratetype);
> 
> 		if (!page && gfp_mask & __GFP_NOFAIL)
> 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> 
> We don't expect any such allocations to happen during the exit path, but 
> we could probably find some in the fs layer.
> 
> I don't want to check sysctl_panic_on_oom in the page allocator because it 
> would start panicking the machine unnecessarily for the integrity 
> metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> for oom kill that wouldn't have panicked before.
> 

Then, why don't you check higzone_idx in oom_kill.c

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-17  0:03                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:03:23 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > 
> > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > or corrupting memory.
> > > > 
> > > 
> > > Here's the new patch for your consideration.
> > > 
> > 
> > Then, can we take kdump in this endlessly looping situaton ?
> > 
> > panic_on_oom=always + kdump can do that. 
> > 
> 
> The endless loop is only helpful if something is going to free memory 
> external to the current page allocation: either another task with 
> __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> memory, or a task that exits.
> 
> The most notable endless loop in the page allocator is the one when a task 
> has been oom killed, gets access to memory reserves, and then cannot find 
> a page for a __GFP_NOFAIL allocation:
> 
> 	do {
> 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> 			preferred_zone, migratetype);
> 
> 		if (!page && gfp_mask & __GFP_NOFAIL)
> 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> 
> We don't expect any such allocations to happen during the exit path, but 
> we could probably find some in the fs layer.
> 
> I don't want to check sysctl_panic_on_oom in the page allocator because it 
> would start panicking the machine unnecessarily for the integrity 
> metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> for oom kill that wouldn't have panicked before.
> 

Then, why don't you check higzone_idx in oom_kill.c

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-16 23:48                 ` KAMEZAWA Hiroyuki
@ 2010-02-17  0:03                   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > 
> > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > or corrupting memory.
> > > 
> > 
> > Here's the new patch for your consideration.
> > 
> 
> Then, can we take kdump in this endlessly looping situaton ?
> 
> panic_on_oom=always + kdump can do that. 
> 

The endless loop is only helpful if something is going to free memory 
external to the current page allocation: either another task with 
__GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
memory, or a task that exits.

The most notable endless loop in the page allocator is the one when a task 
has been oom killed, gets access to memory reserves, and then cannot find 
a page for a __GFP_NOFAIL allocation:

	do {
		page = get_page_from_freelist(gfp_mask, nodemask, order,
			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
			preferred_zone, migratetype);

		if (!page && gfp_mask & __GFP_NOFAIL)
			congestion_wait(BLK_RW_ASYNC, HZ/50);
	} while (!page && (gfp_mask & __GFP_NOFAIL));

We don't expect any such allocations to happen during the exit path, but 
we could probably find some in the fs layer.

I don't want to check sysctl_panic_on_oom in the page allocator because it 
would start panicking the machine unnecessarily for the integrity 
metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
for oom kill that wouldn't have panicked before.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-17  0:03                   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > 
> > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > or corrupting memory.
> > > 
> > 
> > Here's the new patch for your consideration.
> > 
> 
> Then, can we take kdump in this endlessly looping situaton ?
> 
> panic_on_oom=always + kdump can do that. 
> 

The endless loop is only helpful if something is going to free memory 
external to the current page allocation: either another task with 
__GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
memory, or a task that exits.

The most notable endless loop in the page allocator is the one when a task 
has been oom killed, gets access to memory reserves, and then cannot find 
a page for a __GFP_NOFAIL allocation:

	do {
		page = get_page_from_freelist(gfp_mask, nodemask, order,
			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
			preferred_zone, migratetype);

		if (!page && gfp_mask & __GFP_NOFAIL)
			congestion_wait(BLK_RW_ASYNC, HZ/50);
	} while (!page && (gfp_mask & __GFP_NOFAIL));

We don't expect any such allocations to happen during the exit path, but 
we could probably find some in the fs layer.

I don't want to check sysctl_panic_on_oom in the page allocator because it 
would start panicking the machine unnecessarily for the integrity 
metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
for oom kill that wouldn't have panicked before.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-17  0:03                     ` KAMEZAWA Hiroyuki
@ 2010-02-17  0:21                       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > > 
> > > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > > or corrupting memory.
> > > > > 
> > > > 
> > > > Here's the new patch for your consideration.
> > > > 
> > > 
> > > Then, can we take kdump in this endlessly looping situaton ?
> > > 
> > > panic_on_oom=always + kdump can do that. 
> > > 
> > 
> > The endless loop is only helpful if something is going to free memory 
> > external to the current page allocation: either another task with 
> > __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> > memory, or a task that exits.
> > 
> > The most notable endless loop in the page allocator is the one when a task 
> > has been oom killed, gets access to memory reserves, and then cannot find 
> > a page for a __GFP_NOFAIL allocation:
> > 
> > 	do {
> > 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> > 			preferred_zone, migratetype);
> > 
> > 		if (!page && gfp_mask & __GFP_NOFAIL)
> > 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> > 
> > We don't expect any such allocations to happen during the exit path, but 
> > we could probably find some in the fs layer.
> > 
> > I don't want to check sysctl_panic_on_oom in the page allocator because it 
> > would start panicking the machine unnecessarily for the integrity 
> > metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> > order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> > for oom kill that wouldn't have panicked before.
> > 
> 
> Then, why don't you check higzone_idx in oom_kill.c
> 

out_of_memory() doesn't return a value to specify whether the page 
allocator should retry the allocation or just return NULL, all that policy 
is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
killer when it's __GFP_NOFAIL.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1924,15 +1927,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-17  0:21                       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nick Piggin, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > > 
> > > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > > or corrupting memory.
> > > > > 
> > > > 
> > > > Here's the new patch for your consideration.
> > > > 
> > > 
> > > Then, can we take kdump in this endlessly looping situaton ?
> > > 
> > > panic_on_oom=always + kdump can do that. 
> > > 
> > 
> > The endless loop is only helpful if something is going to free memory 
> > external to the current page allocation: either another task with 
> > __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> > memory, or a task that exits.
> > 
> > The most notable endless loop in the page allocator is the one when a task 
> > has been oom killed, gets access to memory reserves, and then cannot find 
> > a page for a __GFP_NOFAIL allocation:
> > 
> > 	do {
> > 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> > 			preferred_zone, migratetype);
> > 
> > 		if (!page && gfp_mask & __GFP_NOFAIL)
> > 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> > 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> > 
> > We don't expect any such allocations to happen during the exit path, but 
> > we could probably find some in the fs layer.
> > 
> > I don't want to check sysctl_panic_on_oom in the page allocator because it 
> > would start panicking the machine unnecessarily for the integrity 
> > metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> > order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> > for oom kill that wouldn't have panicked before.
> > 
> 
> Then, why don't you check higzone_idx in oom_kill.c
> 

out_of_memory() doesn't return a value to specify whether the page 
allocator should retry the allocation or just return NULL, all that policy 
is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
killer when it's __GFP_NOFAIL.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1924,15 +1927,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  0:01                 ` KAMEZAWA Hiroyuki
@ 2010-02-17  0:31                   ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Hmm, I have a few reason to add special behavior to memcg rather than panic.
> 
>  - freeze_at_oom is enough.
>    If OOM can be notified, the management daemon can do useful jobs. Shutdown
>    all other cgroups or migrate them to other host and do kdump.
> 

The same could be said for cpusets if users use that for memory isolation.

> But, Hmm...I'd like to go this way.
> 
>  1. At first, support panic_on_oom=2 in memcg.
> 

This should panic in mem_cgroup_out_of_memory() and the documentation 
should be added to Documentation/sysctl/vm.txt.

The memory controller also has some protection in the pagefault oom 
handler that seems like it could be made more general: instead of checking 
for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
already oom killed task (checking for the TIF_MEMDIE bit) and check all 
zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
oom.

>  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
>     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
>     we don't kill anything. Taking coredumps of all procs in memcg is not
>     very difficult.
> 

The oom notifier would be at a higher level than the oom killer, the oom 
killer's job is simply to kill a task when it is called.  So for these 
particular cases, you would never even call into out_of_memory() to panic 
the machine in the first place.  Hopefully, the oom notifier can be made 
to be more generic as its own cgroup rather than only being used by memcg, 
but if such a userspace notifier would defer to the kernel oom killer, it 
should panic when panic_on_oom == 2 is selected regardless of whether it 
is constrained or not.  Thus, we can keep the sysctl_panic_on_oom logic in 
the oom killer (both in out_of_memory() and mem_cgroup_out_of_memory()) 
without risk of unnecessarily panic whenever an oom notifier or 
freeze_at_oom setting intercepts the condition.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  0:31                   ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Hmm, I have a few reason to add special behavior to memcg rather than panic.
> 
>  - freeze_at_oom is enough.
>    If OOM can be notified, the management daemon can do useful jobs. Shutdown
>    all other cgroups or migrate them to other host and do kdump.
> 

The same could be said for cpusets if users use that for memory isolation.

> But, Hmm...I'd like to go this way.
> 
>  1. At first, support panic_on_oom=2 in memcg.
> 

This should panic in mem_cgroup_out_of_memory() and the documentation 
should be added to Documentation/sysctl/vm.txt.

The memory controller also has some protection in the pagefault oom 
handler that seems like it could be made more general: instead of checking 
for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
already oom killed task (checking for the TIF_MEMDIE bit) and check all 
zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
oom.

>  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
>     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
>     we don't kill anything. Taking coredumps of all procs in memcg is not
>     very difficult.
> 

The oom notifier would be at a higher level than the oom killer, the oom 
killer's job is simply to kill a task when it is called.  So for these 
particular cases, you would never even call into out_of_memory() to panic 
the machine in the first place.  Hopefully, the oom notifier can be made 
to be more generic as its own cgroup rather than only being used by memcg, 
but if such a userspace notifier would defer to the kernel oom killer, it 
should panic when panic_on_oom == 2 is selected regardless of whether it 
is constrained or not.  Thus, we can keep the sysctl_panic_on_oom logic in 
the oom killer (both in out_of_memory() and mem_cgroup_out_of_memory()) 
without risk of unnecessarily panic whenever an oom notifier or 
freeze_at_oom setting intercepts the condition.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  0:31                   ` David Rientjes
@ 2010-02-17  0:41                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:31:39 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, I have a few reason to add special behavior to memcg rather than panic.
> > 
> >  - freeze_at_oom is enough.
> >    If OOM can be notified, the management daemon can do useful jobs. Shutdown
> >    all other cgroups or migrate them to other host and do kdump.
> > 
> 
> The same could be said for cpusets if users use that for memory isolation.
> 
cpuset's difficulty is that there are some methods which share the limitation.

It's not simple that we have
  - cpuset
  - mempolicy per task
  - mempolicy per vma

Sigh..but they are for their own purpose.


> > But, Hmm...I'd like to go this way.
> > 
> >  1. At first, support panic_on_oom=2 in memcg.
> > 
> 
> This should panic in mem_cgroup_out_of_memory() and the documentation 
> should be added to Documentation/sysctl/vm.txt.
> 
> The memory controller also has some protection in the pagefault oom 
> handler that seems like it could be made more general: instead of checking 
> for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
> already oom killed task (checking for the TIF_MEMDIE bit) and check all 
> zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
> are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
> oom.
> 
plz remove memcg's hook after doing that. Current implemantation is desgined 
not to affect too much to other cgroups by doing unnecessary jobs.


> >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> >     we don't kill anything. Taking coredumps of all procs in memcg is not
> >     very difficult.
> > 
> 
> The oom notifier would be at a higher level than the oom killer, the oom 
> killer's job is simply to kill a task when it is called. 
> So for these particular cases, you would never even call into out_of_memory() to panic 
> the machine in the first place. 

That's my point. 

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  0:41                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  0:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:31:39 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, I have a few reason to add special behavior to memcg rather than panic.
> > 
> >  - freeze_at_oom is enough.
> >    If OOM can be notified, the management daemon can do useful jobs. Shutdown
> >    all other cgroups or migrate them to other host and do kdump.
> > 
> 
> The same could be said for cpusets if users use that for memory isolation.
> 
cpuset's difficulty is that there are some methods which share the limitation.

It's not simple that we have
  - cpuset
  - mempolicy per task
  - mempolicy per vma

Sigh..but they are for their own purpose.


> > But, Hmm...I'd like to go this way.
> > 
> >  1. At first, support panic_on_oom=2 in memcg.
> > 
> 
> This should panic in mem_cgroup_out_of_memory() and the documentation 
> should be added to Documentation/sysctl/vm.txt.
> 
> The memory controller also has some protection in the pagefault oom 
> handler that seems like it could be made more general: instead of checking 
> for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
> already oom killed task (checking for the TIF_MEMDIE bit) and check all 
> zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
> are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
> oom.
> 
plz remove memcg's hook after doing that. Current implemantation is desgined 
not to affect too much to other cgroups by doing unnecessary jobs.


> >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> >     we don't kill anything. Taking coredumps of all procs in memcg is not
> >     very difficult.
> > 
> 
> The oom notifier would be at a higher level than the oom killer, the oom 
> killer's job is simply to kill a task when it is called. 
> So for these particular cases, you would never even call into out_of_memory() to panic 
> the machine in the first place. 

That's my point. 

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  0:41                     ` KAMEZAWA Hiroyuki
@ 2010-02-17  0:54                       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > This should panic in mem_cgroup_out_of_memory() and the documentation 
> > should be added to Documentation/sysctl/vm.txt.
> > 
> > The memory controller also has some protection in the pagefault oom 
> > handler that seems like it could be made more general: instead of checking 
> > for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
> > already oom killed task (checking for the TIF_MEMDIE bit) and check all 
> > zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
> > are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
> > oom.
> > 
> plz remove memcg's hook after doing that. Current implemantation is desgined 
> not to affect too much to other cgroups by doing unnecessary jobs.
> 

Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
out_of_memory() by only checking for constrained_alloc() when
gfp_mask != 0.

> > >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > >     we don't kill anything. Taking coredumps of all procs in memcg is not
> > >     very difficult.
> > > 
> > 
> > The oom notifier would be at a higher level than the oom killer, the oom 
> > killer's job is simply to kill a task when it is called. 
> > So for these particular cases, you would never even call into out_of_memory() to panic 
> > the machine in the first place. 
> 
> That's my point. 
> 

Great, are you planning on implementing a cgroup that is based on roughly 
on the /dev/mem_notify patchset so userspace can poll() a file and be 
notified of oom events?  It would help beyond just memcg, it has an 
application to cpusets (adding more mems on large systems) as well.  It 
can also be used purely to preempt the kernel oom killer and move all the 
policy to userspace even though it would be sacrificing TIF_MEMDIE.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  0:54                       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  0:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > This should panic in mem_cgroup_out_of_memory() and the documentation 
> > should be added to Documentation/sysctl/vm.txt.
> > 
> > The memory controller also has some protection in the pagefault oom 
> > handler that seems like it could be made more general: instead of checking 
> > for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for 
> > already oom killed task (checking for the TIF_MEMDIE bit) and check all 
> > zones for ZONE_OOM_LOCKED.  If no oom killed tasks are found and no zones 
> > are locked, we can check sysctl_panic_on_oom and invoke the system-wide 
> > oom.
> > 
> plz remove memcg's hook after doing that. Current implemantation is desgined 
> not to affect too much to other cgroups by doing unnecessary jobs.
> 

Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
out_of_memory() by only checking for constrained_alloc() when
gfp_mask != 0.

> > >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > >     we don't kill anything. Taking coredumps of all procs in memcg is not
> > >     very difficult.
> > > 
> > 
> > The oom notifier would be at a higher level than the oom killer, the oom 
> > killer's job is simply to kill a task when it is called. 
> > So for these particular cases, you would never even call into out_of_memory() to panic 
> > the machine in the first place. 
> 
> That's my point. 
> 

Great, are you planning on implementing a cgroup that is based on roughly 
on the /dev/mem_notify patchset so userspace can poll() a file and be 
notified of oom events?  It would help beyond just memcg, it has an 
application to cpusets (adding more mems on large systems) as well.  It 
can also be used purely to preempt the kernel oom killer and move all the 
policy to userspace even though it would be sacrificing TIF_MEMDIE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  0:54                       ` David Rientjes
@ 2010-02-17  1:03                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  1:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:54:31 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > > >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > > >     we don't kill anything. Taking coredumps of all procs in memcg is not
> > > >     very difficult.
> > > > 
> > > 
> > > The oom notifier would be at a higher level than the oom killer, the oom 
> > > killer's job is simply to kill a task when it is called. 
> > > So for these particular cases, you would never even call into out_of_memory() to panic 
> > > the machine in the first place. 
> > 
> > That's my point. 
> > 
> 
> Great, are you planning on implementing a cgroup that is based on roughly 
> on the /dev/mem_notify patchset so userspace can poll() a file and be 
> notified of oom events?  It would help beyond just memcg, it has an 
> application to cpusets (adding more mems on large systems) as well.  It 
> can also be used purely to preempt the kernel oom killer and move all the 
> policy to userspace even though it would be sacrificing TIF_MEMDIE.
> 

I start from memcg because that gives us simple and clean, no heulistics
operation and we will not have ugly corner cases. And we can _expect_
that memcg has management daemon of OOM in other cgroup. Because memcg's memory
shortage never means "memory is exhausted", we can expect that daemon can work well.
Now, memcg has memory-usage-notifier file. oom-notifier will not be far differnet
from that.

cpuset should have its own if necessary. cpuset's difficulty is that
the memory on its nodes are _really_ exhausted and we're not sure
it can affecet management daemon at el...hang up.

BTW, concept of /dev/mem_notify is notify before OOM, not notify when OOM.
Now, memcg has memory-usage-notifier and that's implemented in some meaning.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  1:03                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  1:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 16:54:31 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > >  2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > > >     and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > > >     we don't kill anything. Taking coredumps of all procs in memcg is not
> > > >     very difficult.
> > > > 
> > > 
> > > The oom notifier would be at a higher level than the oom killer, the oom 
> > > killer's job is simply to kill a task when it is called. 
> > > So for these particular cases, you would never even call into out_of_memory() to panic 
> > > the machine in the first place. 
> > 
> > That's my point. 
> > 
> 
> Great, are you planning on implementing a cgroup that is based on roughly 
> on the /dev/mem_notify patchset so userspace can poll() a file and be 
> notified of oom events?  It would help beyond just memcg, it has an 
> application to cpusets (adding more mems on large systems) as well.  It 
> can also be used purely to preempt the kernel oom killer and move all the 
> policy to userspace even though it would be sacrificing TIF_MEMDIE.
> 

I start from memcg because that gives us simple and clean, no heulistics
operation and we will not have ugly corner cases. And we can _expect_
that memcg has management daemon of OOM in other cgroup. Because memcg's memory
shortage never means "memory is exhausted", we can expect that daemon can work well.
Now, memcg has memory-usage-notifier file. oom-notifier will not be far differnet
from that.

cpuset should have its own if necessary. cpuset's difficulty is that
the memory on its nodes are _really_ exhausted and we're not sure
it can affecet management daemon at el...hang up.

BTW, concept of /dev/mem_notify is notify before OOM, not notify when OOM.
Now, memcg has memory-usage-notifier and that's implemented in some meaning.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  0:54                       ` David Rientjes
@ 2010-02-17  1:58                         ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  1:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, David Rientjes wrote:

> Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> out_of_memory() by only checking for constrained_alloc() when
> gfp_mask != 0.
> 

What do you think about making pagefaults use out_of_memory() directly and 
respecting the sysctl_panic_on_oom settings?

This removes the check for a parallel memcg oom killing since we can 
guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all 
populated zones (nobody is currently executing the oom killer) and no 
tasks have TIF_MEMDIE set.

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	return false;
-}
-
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -200,7 +200,6 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
@@ -1549,10 +1520,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		}
 
 		if (!nr_retries--) {
-			if (oom) {
+			if (oom)
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
-				record_last_oom(mem_over_limit);
-			}
 			goto nomem;
 		}
 	}
@@ -2408,8 +2377,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
  * try_charge_swapin does all of these works properly.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -490,29 +490,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	return oom_kill_task(victim);
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
-{
-	unsigned long points = 0;
-	struct task_struct *p;
-
-	read_lock(&tasklist_lock);
-retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
-	if (PTR_ERR(p) == -1UL)
-		goto out;
-
-	if (!p)
-		p = current;
-
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
-				"Memory cgroup out of memory"))
-		goto retry;
-out:
-	read_unlock(&tasklist_lock);
-}
-#endif
-
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)
@@ -578,6 +555,70 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /*
+ * Try to acquire the oom killer lock for all system zones.  Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+	struct zone *zone;
+	int ret = 1;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		if (zone_is_oom_locked(zone)) {
+			ret = 0;
+			goto out;
+		}
+	for_each_populated_zone(zone)
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+	spin_unlock(&zone_scan_lock);
+	return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+	struct zone *zone;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	spin_unlock(&zone_scan_lock);
+}
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	unsigned long points = 0;
+	struct task_struct *p;
+
+	if (!try_set_system_oom())
+		return;
+	read_lock(&tasklist_lock);
+retry:
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	if (PTR_ERR(p) == -1UL)
+		goto out;
+
+	if (!p)
+		p = current;
+
+	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+				"Memory cgroup out of memory"))
+		goto retry;
+out:
+	read_unlock(&tasklist_lock);
+	clear_system_oom();
+}
+#endif
+
+/*
  * Must be called with tasklist_lock held for read.
  */
 static void __out_of_memory(gfp_t gfp_mask, int order,
@@ -612,46 +653,9 @@ retry:
 		goto retry;
 }
 
-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
-	unsigned long freed = 0;
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	/*
-	 * If this is from memcg, oom-killer is already invoked.
-	 * and not worth to go system-wide-oom.
-	 */
-	if (mem_cgroup_oom_called(current))
-		goto rest_and_return;
-
-	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
-
-	read_lock(&tasklist_lock);
-	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
-	read_unlock(&tasklist_lock);
-
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory.
-	 */
-rest_and_return:
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-}
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
- * @zonelist: zonelist pointer
+ * @zonelist: zonelist pointer passed to page allocator
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
  * @nodemask: nodemask passed to page allocator
@@ -665,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
-	enum oom_constraint constraint;
+	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -681,7 +685,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -691,6 +696,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 */
 		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
+			read_unlock(&tasklist_lock);
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
@@ -704,3 +710,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing.  If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+	if (!try_set_system_oom())
+		return;
+	out_of_memory(NULL, 0, 0, NULL);
+	clear_system_oom();
+}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  1:58                         ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  1:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010, David Rientjes wrote:

> Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> out_of_memory() by only checking for constrained_alloc() when
> gfp_mask != 0.
> 

What do you think about making pagefaults use out_of_memory() directly and 
respecting the sysctl_panic_on_oom settings?

This removes the check for a parallel memcg oom killing since we can 
guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all 
populated zones (nobody is currently executing the oom killer) and no 
tasks have TIF_MEMDIE set.

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	return false;
-}
-
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -200,7 +200,6 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
@@ -1549,10 +1520,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		}
 
 		if (!nr_retries--) {
-			if (oom) {
+			if (oom)
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
-				record_last_oom(mem_over_limit);
-			}
 			goto nomem;
 		}
 	}
@@ -2408,8 +2377,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
  * try_charge_swapin does all of these works properly.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -490,29 +490,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	return oom_kill_task(victim);
 }
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
-{
-	unsigned long points = 0;
-	struct task_struct *p;
-
-	read_lock(&tasklist_lock);
-retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
-	if (PTR_ERR(p) == -1UL)
-		goto out;
-
-	if (!p)
-		p = current;
-
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
-				"Memory cgroup out of memory"))
-		goto retry;
-out:
-	read_unlock(&tasklist_lock);
-}
-#endif
-
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)
@@ -578,6 +555,70 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /*
+ * Try to acquire the oom killer lock for all system zones.  Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+	struct zone *zone;
+	int ret = 1;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		if (zone_is_oom_locked(zone)) {
+			ret = 0;
+			goto out;
+		}
+	for_each_populated_zone(zone)
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+	spin_unlock(&zone_scan_lock);
+	return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+	struct zone *zone;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	spin_unlock(&zone_scan_lock);
+}
+
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
+{
+	unsigned long points = 0;
+	struct task_struct *p;
+
+	if (!try_set_system_oom())
+		return;
+	read_lock(&tasklist_lock);
+retry:
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	if (PTR_ERR(p) == -1UL)
+		goto out;
+
+	if (!p)
+		p = current;
+
+	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+				"Memory cgroup out of memory"))
+		goto retry;
+out:
+	read_unlock(&tasklist_lock);
+	clear_system_oom();
+}
+#endif
+
+/*
  * Must be called with tasklist_lock held for read.
  */
 static void __out_of_memory(gfp_t gfp_mask, int order,
@@ -612,46 +653,9 @@ retry:
 		goto retry;
 }
 
-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
-	unsigned long freed = 0;
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	/*
-	 * If this is from memcg, oom-killer is already invoked.
-	 * and not worth to go system-wide-oom.
-	 */
-	if (mem_cgroup_oom_called(current))
-		goto rest_and_return;
-
-	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
-
-	read_lock(&tasklist_lock);
-	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
-	read_unlock(&tasklist_lock);
-
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory.
-	 */
-rest_and_return:
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-}
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
- * @zonelist: zonelist pointer
+ * @zonelist: zonelist pointer passed to page allocator
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
  * @nodemask: nodemask passed to page allocator
@@ -665,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
-	enum oom_constraint constraint;
+	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -681,7 +685,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 	if (unlikely(sysctl_panic_on_oom)) {
 		/*
@@ -691,6 +696,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		 */
 		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
+			read_unlock(&tasklist_lock);
 			panic("Out of memory: panic_on_oom is enabled\n");
 		}
 	}
@@ -704,3 +710,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing.  If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+	if (!try_set_system_oom())
+		return;
+	out_of_memory(NULL, 0, 0, NULL);
+	clear_system_oom();
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  1:58                         ` David Rientjes
@ 2010-02-17  2:13                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 17:58:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, David Rientjes wrote:
> 
> > Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> > out_of_memory() by only checking for constrained_alloc() when
> > gfp_mask != 0.
> > 
> 
> What do you think about making pagefaults use out_of_memory() directly and 
> respecting the sysctl_panic_on_oom settings?
> 

I don't think this patch is good. Because several memcg can
cause oom at the same time independently, system-wide oom locking is
unsuitable. BTW, what I doubt is much more fundamental thing.

What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
to call oom_killer when page fault returns it".
Is there someone who returns VM_FAULT_OOM without calling page allocator
and oom-killer helps something in such situation ?

If returning VM_FAULT_OOM without caliing usual page allocator, oom-killer
will be never help, I guess.

If we don't have that, we don't have to implement pagefault_out_of_memory.

Hmm ?

Thanks,
-Kame


> This removes the check for a parallel memcg oom killing since we can 
> guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all 
> populated zones (nobody is currently executing the oom killer) and no 
> tasks have TIF_MEMDIE set.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	return false;
> -}
> -
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -200,7 +200,6 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  	return total;
>  }
>  
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	bool ret = false;
> -	struct mem_cgroup *mem;
> -	struct mm_struct *mm;
> -
> -	rcu_read_lock();
> -	mm = task->mm;
> -	if (!mm)
> -		mm = &init_mm;
> -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> -		ret = true;
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> -	mem->last_oom_jiffies = jiffies;
> -	return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
>  /*
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
> @@ -1549,10 +1520,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		}
>  
>  		if (!nr_retries--) {
> -			if (oom) {
> +			if (oom)
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> -				record_last_oom(mem_over_limit);
> -			}
>  			goto nomem;
>  		}
>  	}
> @@ -2408,8 +2377,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
>   * try_charge_swapin does all of these works properly.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -490,29 +490,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	return oom_kill_task(victim);
>  }
>  
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> -{
> -	unsigned long points = 0;
> -	struct task_struct *p;
> -
> -	read_lock(&tasklist_lock);
> -retry:
> -	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> -		goto out;
> -
> -	if (!p)
> -		p = current;
> -
> -	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> -				"Memory cgroup out of memory"))
> -		goto retry;
> -out:
> -	read_unlock(&tasklist_lock);
> -}
> -#endif
> -
>  static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
>  
>  int register_oom_notifier(struct notifier_block *nb)
> @@ -578,6 +555,70 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}
> +
> +/*
> + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> + * attempts or page faults may now recall the oom killer, if necessary.
> + */
> +static void clear_system_oom(void)
> +{
> +	struct zone *zone;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +	spin_unlock(&zone_scan_lock);
> +}
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> +{
> +	unsigned long points = 0;
> +	struct task_struct *p;
> +
> +	if (!try_set_system_oom())
> +		return;
> +	read_lock(&tasklist_lock);
> +retry:
> +	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> +	if (PTR_ERR(p) == -1UL)
> +		goto out;
> +
> +	if (!p)
> +		p = current;
> +
> +	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> +				"Memory cgroup out of memory"))
> +		goto retry;
> +out:
> +	read_unlock(&tasklist_lock);
> +	clear_system_oom();
> +}
> +#endif
> +
> +/*
>   * Must be called with tasklist_lock held for read.
>   */
>  static void __out_of_memory(gfp_t gfp_mask, int order,
> @@ -612,46 +653,9 @@ retry:
>  		goto retry;
>  }
>  
> -/*
> - * pagefault handler calls into here because it is out of memory but
> - * doesn't know exactly how or why.
> - */
> -void pagefault_out_of_memory(void)
> -{
> -	unsigned long freed = 0;
> -
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return;
> -
> -	/*
> -	 * If this is from memcg, oom-killer is already invoked.
> -	 * and not worth to go system-wide-oom.
> -	 */
> -	if (mem_cgroup_oom_called(current))
> -		goto rest_and_return;
> -
> -	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> -
> -	read_lock(&tasklist_lock);
> -	/* unknown gfp_mask and order */
> -	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> -	read_unlock(&tasklist_lock);
> -
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory.
> -	 */
> -rest_and_return:
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -}
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
> - * @zonelist: zonelist pointer
> + * @zonelist: zonelist pointer passed to page allocator
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
>   * @nodemask: nodemask passed to page allocator
> @@ -665,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
> -	enum oom_constraint constraint;
> +	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
> @@ -681,7 +685,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	if (zonelist)
> +		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  	if (unlikely(sysctl_panic_on_oom)) {
>  		/*
> @@ -691,6 +696,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		 */
>  		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> +			read_unlock(&tasklist_lock);
>  			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
>  	}
> @@ -704,3 +710,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
>  }
> +
> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:13                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 17:58:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 16 Feb 2010, David Rientjes wrote:
> 
> > Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> > out_of_memory() by only checking for constrained_alloc() when
> > gfp_mask != 0.
> > 
> 
> What do you think about making pagefaults use out_of_memory() directly and 
> respecting the sysctl_panic_on_oom settings?
> 

I don't think this patch is good. Because several memcg can
cause oom at the same time independently, system-wide oom locking is
unsuitable. BTW, what I doubt is much more fundamental thing.

What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
to call oom_killer when page fault returns it".
Is there someone who returns VM_FAULT_OOM without calling page allocator
and oom-killer helps something in such situation ?

If returning VM_FAULT_OOM without caliing usual page allocator, oom-killer
will be never help, I guess.

If we don't have that, we don't have to implement pagefault_out_of_memory.

Hmm ?

Thanks,
-Kame


> This removes the check for a parallel memcg oom killing since we can 
> guarantee that's not going to happen if we take ZONE_OOM_LOCKED for all 
> populated zones (nobody is currently executing the oom killer) and no 
> tasks have TIF_MEMDIE set.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
>  void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	return false;
> -}
> -
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -200,7 +200,6 @@ struct mem_cgroup {
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
>  	bool use_hierarchy;
> -	unsigned long	last_oom_jiffies;
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> @@ -1234,34 +1233,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  	return total;
>  }
>  
> -bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> -	bool ret = false;
> -	struct mem_cgroup *mem;
> -	struct mm_struct *mm;
> -
> -	rcu_read_lock();
> -	mm = task->mm;
> -	if (!mm)
> -		mm = &init_mm;
> -	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> -	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> -		ret = true;
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> -{
> -	mem->last_oom_jiffies = jiffies;
> -	return 0;
> -}
> -
> -static void record_last_oom(struct mem_cgroup *mem)
> -{
> -	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> -}
> -
>  /*
>   * Currently used to update mapped file statistics, but the routine can be
>   * generalized to update other statistics as well.
> @@ -1549,10 +1520,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		}
>  
>  		if (!nr_retries--) {
> -			if (oom) {
> +			if (oom)
>  				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> -				record_last_oom(mem_over_limit);
> -			}
>  			goto nomem;
>  		}
>  	}
> @@ -2408,8 +2377,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> - * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
>   * try_charge_swapin does all of these works properly.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -490,29 +490,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	return oom_kill_task(victim);
>  }
>  
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> -{
> -	unsigned long points = 0;
> -	struct task_struct *p;
> -
> -	read_lock(&tasklist_lock);
> -retry:
> -	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> -	if (PTR_ERR(p) == -1UL)
> -		goto out;
> -
> -	if (!p)
> -		p = current;
> -
> -	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> -				"Memory cgroup out of memory"))
> -		goto retry;
> -out:
> -	read_unlock(&tasklist_lock);
> -}
> -#endif
> -
>  static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
>  
>  int register_oom_notifier(struct notifier_block *nb)
> @@ -578,6 +555,70 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}
> +
> +/*
> + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> + * attempts or page faults may now recall the oom killer, if necessary.
> + */
> +static void clear_system_oom(void)
> +{
> +	struct zone *zone;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +	spin_unlock(&zone_scan_lock);
> +}
> +
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> +{
> +	unsigned long points = 0;
> +	struct task_struct *p;
> +
> +	if (!try_set_system_oom())
> +		return;
> +	read_lock(&tasklist_lock);
> +retry:
> +	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> +	if (PTR_ERR(p) == -1UL)
> +		goto out;
> +
> +	if (!p)
> +		p = current;
> +
> +	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> +				"Memory cgroup out of memory"))
> +		goto retry;
> +out:
> +	read_unlock(&tasklist_lock);
> +	clear_system_oom();
> +}
> +#endif
> +
> +/*
>   * Must be called with tasklist_lock held for read.
>   */
>  static void __out_of_memory(gfp_t gfp_mask, int order,
> @@ -612,46 +653,9 @@ retry:
>  		goto retry;
>  }
>  
> -/*
> - * pagefault handler calls into here because it is out of memory but
> - * doesn't know exactly how or why.
> - */
> -void pagefault_out_of_memory(void)
> -{
> -	unsigned long freed = 0;
> -
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return;
> -
> -	/*
> -	 * If this is from memcg, oom-killer is already invoked.
> -	 * and not worth to go system-wide-oom.
> -	 */
> -	if (mem_cgroup_oom_called(current))
> -		goto rest_and_return;
> -
> -	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> -
> -	read_lock(&tasklist_lock);
> -	/* unknown gfp_mask and order */
> -	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> -	read_unlock(&tasklist_lock);
> -
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory.
> -	 */
> -rest_and_return:
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -}
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
> - * @zonelist: zonelist pointer
> + * @zonelist: zonelist pointer passed to page allocator
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
>   * @nodemask: nodemask passed to page allocator
> @@ -665,7 +669,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
> -	enum oom_constraint constraint;
> +	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
> @@ -681,7 +685,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	if (zonelist)
> +		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  	if (unlikely(sysctl_panic_on_oom)) {
>  		/*
> @@ -691,6 +696,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		 */
>  		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> +			read_unlock(&tasklist_lock);
>  			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
>  	}
> @@ -704,3 +710,17 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
>  }
> +
> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  1:58                         ` David Rientjes
@ 2010-02-17  2:19                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 145+ messages in thread
From: KOSAKI Motohiro @ 2010-02-17  2:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	linux-kernel, linux-mm

> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}

At least, I agree pagefault oom part. it need ZONE_OOM_LOCKED too.
if you make separated patch, I'll ack it. I don't know memcg part is 
correct or not.




^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:19                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 145+ messages in thread
From: KOSAKI Motohiro @ 2010-02-17  2:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	linux-kernel, linux-mm

> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (!try_set_system_oom())
> +		return;
> +	out_of_memory(NULL, 0, 0, NULL);
> +	clear_system_oom();
> +}

At least, I agree pagefault oom part. it need ZONE_OOM_LOCKED too.
if you make separated patch, I'll ack it. I don't know memcg part is 
correct or not.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:13                           ` KAMEZAWA Hiroyuki
@ 2010-02-17  2:23                             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Wed, 17 Feb 2010 11:13:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 16 Feb 2010 17:58:05 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Tue, 16 Feb 2010, David Rientjes wrote:
> > 
> > > Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> > > out_of_memory() by only checking for constrained_alloc() when
> > > gfp_mask != 0.
> > > 
> > 
> > What do you think about making pagefaults use out_of_memory() directly and 
> > respecting the sysctl_panic_on_oom settings?
> > 
> 
> I don't think this patch is good. Because several memcg can
> cause oom at the same time independently, system-wide oom locking is
> unsuitable.

And basically. memcg's oom means "the usage over the limits!!" and does
never means "resouce is exhausted!!".

Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg
in 64GB system.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:23                             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Wed, 17 Feb 2010 11:13:19 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 16 Feb 2010 17:58:05 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Tue, 16 Feb 2010, David Rientjes wrote:
> > 
> > > Ok, I'll eliminate pagefault_out_of_memory() and get it to use 
> > > out_of_memory() by only checking for constrained_alloc() when
> > > gfp_mask != 0.
> > > 
> > 
> > What do you think about making pagefaults use out_of_memory() directly and 
> > respecting the sysctl_panic_on_oom settings?
> > 
> 
> I don't think this patch is good. Because several memcg can
> cause oom at the same time independently, system-wide oom locking is
> unsuitable.

And basically. memcg's oom means "the usage over the limits!!" and does
never means "resouce is exhausted!!".

Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg
in 64GB system.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:13                           ` KAMEZAWA Hiroyuki
@ 2010-02-17  2:28                             ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > What do you think about making pagefaults use out_of_memory() directly and 
> > respecting the sysctl_panic_on_oom settings?
> > 
> 
> I don't think this patch is good. Because several memcg can
> cause oom at the same time independently, system-wide oom locking is
> unsuitable. BTW, what I doubt is much more fundamental thing.
> 

We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
needlessly killing more than one task regardless of how many memcgs are 
oom.

> What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
> to call oom_killer when page fault returns it".
> Is there someone who returns VM_FAULT_OOM without calling page allocator
> and oom-killer helps something in such situation ?
> 

Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a 
SIGKILL to current because we simply don't have memory to fault the page 
in, it's better to select a memory-hogging task to kill based on badness() 
than to constantly kill current which may not help in the long term.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:28                             ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > What do you think about making pagefaults use out_of_memory() directly and 
> > respecting the sysctl_panic_on_oom settings?
> > 
> 
> I don't think this patch is good. Because several memcg can
> cause oom at the same time independently, system-wide oom locking is
> unsuitable. BTW, what I doubt is much more fundamental thing.
> 

We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
needlessly killing more than one task regardless of how many memcgs are 
oom.

> What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
> to call oom_killer when page fault returns it".
> Is there someone who returns VM_FAULT_OOM without calling page allocator
> and oom-killer helps something in such situation ?
> 

Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a 
SIGKILL to current because we simply don't have memory to fault the page 
in, it's better to select a memory-hogging task to kill based on badness() 
than to constantly kill current which may not help in the long term.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:28                             ` David Rientjes
@ 2010-02-17  2:34                               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > What do you think about making pagefaults use out_of_memory() directly and 
> > > respecting the sysctl_panic_on_oom settings?
> > > 
> > 
> > I don't think this patch is good. Because several memcg can
> > cause oom at the same time independently, system-wide oom locking is
> > unsuitable. BTW, what I doubt is much more fundamental thing.
> > 
> 
> We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> needlessly killing more than one task regardless of how many memcgs are 
> oom.
> 
Current implentation archive what memcg want. Why remove and destroy memcg ?


> > What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
> > to call oom_killer when page fault returns it".
> > Is there someone who returns VM_FAULT_OOM without calling page allocator
> > and oom-killer helps something in such situation ?
> > 
> 
> Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a 
> SIGKILL to current because we simply don't have memory to fault the page 
> in, it's better to select a memory-hogging task to kill based on badness() 
> than to constantly kill current which may not help in the long term.
> 
What I mean is
 - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".

For example, when hugepages are all used, it may return VM_FAULT_OOM.
Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.

Then, what oom-killer can help it ? I think never and the requester should die.

Before modifying current code, I think we have to check all VM_FAULT_OOM and distinguish
 - memory is exhausted (and page allocater wasn't called.)
 - something other than memory is exhausted.

And, in hugepage case, even order > PAGE_ALLOC_COSTLY_ORDER, oom-killer is
called and pagegault_oom_kill kills tasks randomly.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:34                               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  2:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > What do you think about making pagefaults use out_of_memory() directly and 
> > > respecting the sysctl_panic_on_oom settings?
> > > 
> > 
> > I don't think this patch is good. Because several memcg can
> > cause oom at the same time independently, system-wide oom locking is
> > unsuitable. BTW, what I doubt is much more fundamental thing.
> > 
> 
> We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> needlessly killing more than one task regardless of how many memcgs are 
> oom.
> 
Current implentation archive what memcg want. Why remove and destroy memcg ?


> > What I doubt at most is "why VM_FAULT_OOM is necessary ? or why we have
> > to call oom_killer when page fault returns it".
> > Is there someone who returns VM_FAULT_OOM without calling page allocator
> > and oom-killer helps something in such situation ?
> > 
> 
> Before we invoked the oom killer for VM_FAULT_OOM, we simply sent a 
> SIGKILL to current because we simply don't have memory to fault the page 
> in, it's better to select a memory-hogging task to kill based on badness() 
> than to constantly kill current which may not help in the long term.
> 
What I mean is
 - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".

For example, when hugepages are all used, it may return VM_FAULT_OOM.
Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.

Then, what oom-killer can help it ? I think never and the requester should die.

Before modifying current code, I think we have to check all VM_FAULT_OOM and distinguish
 - memory is exhausted (and page allocater wasn't called.)
 - something other than memory is exhausted.

And, in hugepage case, even order > PAGE_ALLOC_COSTLY_ORDER, oom-killer is
called and pagegault_oom_kill kills tasks randomly.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:23                             ` KAMEZAWA Hiroyuki
@ 2010-02-17  2:37                               ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> And basically. memcg's oom means "the usage over the limits!!" and does
> never means "resouce is exhausted!!".
> 
> Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg
> in 64GB system.
> 

ZONE_OOM_LOCKED is taken system-wide because the result of a memcg oom is 
that a task will get killed and free memory, so VM_FAULT_OOM doesn't 
require any additional killing if we're oom, it should just retry after 
the task has exited.  If we remove the zone locking for memcg, it is 
possible that pagefaults will race with setting TIF_MEMDIE and two tasks 
get killed instead.  I guess that's acceptable considering its just as 
likely that the memcg will reallocate to the same limit again and cause 
VM_FAULT_OOM to rekill.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:37                               ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> And basically. memcg's oom means "the usage over the limits!!" and does
> never means "resouce is exhausted!!".
> 
> Then, marking OOM to zones sounds strange. You can cause oom in 64MB memcg
> in 64GB system.
> 

ZONE_OOM_LOCKED is taken system-wide because the result of a memcg oom is 
that a task will get killed and free memory, so VM_FAULT_OOM doesn't 
require any additional killing if we're oom, it should just retry after 
the task has exited.  If we remove the zone locking for memcg, it is 
possible that pagefaults will race with setting TIF_MEMDIE and two tasks 
get killed instead.  I guess that's acceptable considering its just as 
likely that the memcg will reallocate to the same limit again and cause 
VM_FAULT_OOM to rekill.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:34                               ` KAMEZAWA Hiroyuki
@ 2010-02-17  2:58                                 ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > needlessly killing more than one task regardless of how many memcgs are 
> > oom.
> > 
> Current implentation archive what memcg want. Why remove and destroy memcg ?
> 

I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg 
oom.  I'm hoping that you will add sysctl_panic_on_oom == 2 for this case 
later, however.

> What I mean is
>  - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".
> 
> For example, when hugepages are all used, it may return VM_FAULT_OOM.
> Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.
> 

The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
doesn't mean we simply don't have the memory to handle the page fault, 
i.e. your earlier "memory is exhausted" definition.  That was handled well 
before calling out_of_memory() by simply killing current since we know it 
is faulting hugetlb pages and its resource is limited.

We could pass the vma to pagefault_out_of_memory() and simply kill current 
if its killable and is_vm_hugetlb_page(vma).

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  2:58                                 ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  2:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > needlessly killing more than one task regardless of how many memcgs are 
> > oom.
> > 
> Current implentation archive what memcg want. Why remove and destroy memcg ?
> 

I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg 
oom.  I'm hoping that you will add sysctl_panic_on_oom == 2 for this case 
later, however.

> What I mean is
>  - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".
> 
> For example, when hugepages are all used, it may return VM_FAULT_OOM.
> Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.
> 

The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
doesn't mean we simply don't have the memory to handle the page fault, 
i.e. your earlier "memory is exhausted" definition.  That was handled well 
before calling out_of_memory() by simply killing current since we know it 
is faulting hugetlb pages and its resource is limited.

We could pass the vma to pagefault_out_of_memory() and simply kill current 
if its killable and is_vm_hugetlb_page(vma).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:58                                 ` David Rientjes
@ 2010-02-17  3:21                                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  3:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 18:58:17 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > needlessly killing more than one task regardless of how many memcgs are 
> > > oom.
> > > 
> > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > 
> 
> I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg 
> oom.  I'm hoping that you will add sysctl_panic_on_oom == 2 for this case 
> later, however.
> 
I'll write panic_on_oom for memcg, later. 

> > What I mean is
> >  - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".
> > 
> > For example, when hugepages are all used, it may return VM_FAULT_OOM.
> > Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.
> > 
> 
> The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
> doesn't mean we simply don't have the memory to handle the page fault, 
> i.e. your earlier "memory is exhausted" definition.  That was handled well 
> before calling out_of_memory() by simply killing current since we know it 
> is faulting hugetlb pages and its resource is limited.
> 
> We could pass the vma to pagefault_out_of_memory() and simply kill current 
> if its killable and is_vm_hugetlb_page(vma).
> 

No. hugepage is not only case.
You may not read but we annoyed i915's driver bug recently and it was clearly
misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months.
(thanks to Kosaki about this.)

quick glance around core codes...
 - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
 - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
   called oom_killer if it can. 
 - about relayfs, is VM_FAULT_OOM should be BUG_ON()...
 - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
   just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)
 - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
 - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.

i915's VM_FAULT_OOM is miterious but I can't find whether its real OOM or just shortage
of is own resource. I think VM_FAULT_NO_RESOUCE should be added.


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  3:21                                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-17  3:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 16 Feb 2010 18:58:17 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > needlessly killing more than one task regardless of how many memcgs are 
> > > oom.
> > > 
> > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > 
> 
> I've updated my patch to not take ZONE_OOM_LOCKED for any zones on memcg 
> oom.  I'm hoping that you will add sysctl_panic_on_oom == 2 for this case 
> later, however.
> 
I'll write panic_on_oom for memcg, later. 

> > What I mean is
> >  - What VM_FAULT_OOM means is not "memory is exhausted" but "something is exhausted".
> > 
> > For example, when hugepages are all used, it may return VM_FAULT_OOM.
> > Especially when nr_overcommit_hugepage == usage_of_hugepage, it returns VM_FAULT_OOM.
> > 
> 
> The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
> doesn't mean we simply don't have the memory to handle the page fault, 
> i.e. your earlier "memory is exhausted" definition.  That was handled well 
> before calling out_of_memory() by simply killing current since we know it 
> is faulting hugetlb pages and its resource is limited.
> 
> We could pass the vma to pagefault_out_of_memory() and simply kill current 
> if its killable and is_vm_hugetlb_page(vma).
> 

No. hugepage is not only case.
You may not read but we annoyed i915's driver bug recently and it was clearly
misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months.
(thanks to Kosaki about this.)

quick glance around core codes...
 - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
 - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
   called oom_killer if it can. 
 - about relayfs, is VM_FAULT_OOM should be BUG_ON()...
 - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
   just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)
 - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
 - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.

i915's VM_FAULT_OOM is miterious but I can't find whether its real OOM or just shortage
of is own resource. I think VM_FAULT_NO_RESOUCE should be added.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  3:21                                   ` KAMEZAWA Hiroyuki
@ 2010-02-17  9:11                                     ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  9:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
> > doesn't mean we simply don't have the memory to handle the page fault, 
> > i.e. your earlier "memory is exhausted" definition.  That was handled well 
> > before calling out_of_memory() by simply killing current since we know it 
> > is faulting hugetlb pages and its resource is limited.
> > 
> > We could pass the vma to pagefault_out_of_memory() and simply kill current 
> > if its killable and is_vm_hugetlb_page(vma).
> > 
> 
> No. hugepage is not only case.
> You may not read but we annoyed i915's driver bug recently and it was clearly
> misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months.
> (thanks to Kosaki about this.)
> 

That's been fixed, right?

> quick glance around core codes...
>  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.

We can detect this with is_vm_hugetlb_page() if we pass the vma into 
pagefault_out_of_memory() without adding another VM_FAULT flag.

>  - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
>    called oom_killer if it can. 

See below.

>  - about relayfs, is VM_FAULT_OOM should be BUG_ON()...

That looks appropriate at first glance.

>  - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
>    just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)
>  - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
>  - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.
> 

The filemap, shmem, and block_prepare_write() cases will call the oom 
killer but, depending on the gfp mask, they will retry their allocations 
after the oom killer is called so we should never return VM_FAULT_OOM 
because they return -ENOMEM.  They fail from either small objsize slab 
allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
default continues to retry even if direct reclaim fails.  If we're 
returning with VM_FAULT_OOM from these handlers, it should only be because 
of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
find memory (so we don't care if the oom killer is called again since it 
won't kill anything else).

So like I said, I don't really see a need where VM_FAULT_NO_RESOURCE would 
be helpful in any case other than hugetlb which we can already detect by 
passing the vma into the pagefault oom handler.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  9:11                                     ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17  9:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > The hugetlb case seems to be the only misuse of VM_FAULT_OOM where it 
> > doesn't mean we simply don't have the memory to handle the page fault, 
> > i.e. your earlier "memory is exhausted" definition.  That was handled well 
> > before calling out_of_memory() by simply killing current since we know it 
> > is faulting hugetlb pages and its resource is limited.
> > 
> > We could pass the vma to pagefault_out_of_memory() and simply kill current 
> > if its killable and is_vm_hugetlb_page(vma).
> > 
> 
> No. hugepage is not only case.
> You may not read but we annoyed i915's driver bug recently and it was clearly
> misuse of VM_FAULT_OOM. Then, we got many reports of OOM killer in these months.
> (thanks to Kosaki about this.)
> 

That's been fixed, right?

> quick glance around core codes...
>  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.

We can detect this with is_vm_hugetlb_page() if we pass the vma into 
pagefault_out_of_memory() without adding another VM_FAULT flag.

>  - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
>    called oom_killer if it can. 

See below.

>  - about relayfs, is VM_FAULT_OOM should be BUG_ON()...

That looks appropriate at first glance.

>  - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
>    just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)
>  - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
>  - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.
> 

The filemap, shmem, and block_prepare_write() cases will call the oom 
killer but, depending on the gfp mask, they will retry their allocations 
after the oom killer is called so we should never return VM_FAULT_OOM 
because they return -ENOMEM.  They fail from either small objsize slab 
allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
default continues to retry even if direct reclaim fails.  If we're 
returning with VM_FAULT_OOM from these handlers, it should only be because 
of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
find memory (so we don't care if the oom killer is called again since it 
won't kill anything else).

So like I said, I don't really see a need where VM_FAULT_NO_RESOURCE would 
be helpful in any case other than hugetlb which we can already detect by 
passing the vma into the pagefault oom handler.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  9:11                                     ` David Rientjes
@ 2010-02-17  9:52                                       ` Nick Piggin
  -1 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-17  9:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, Feb 17, 2010 at 01:11:30AM -0800, David Rientjes wrote:
> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > quick glance around core codes...
> >  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
> 
> We can detect this with is_vm_hugetlb_page() if we pass the vma into 
> pagefault_out_of_memory() without adding another VM_FAULT flag.

The real question is, what to do when returning to userspace. I don't
think there's a lot of options. SIGBUS is traditionally used for "no
resource".


> >  - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
> >    called oom_killer if it can. 
> 
> See below.
> 
> >  - about relayfs, is VM_FAULT_OOM should be BUG_ON()...
> 
> That looks appropriate at first glance.
> 
> >  - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
> >    just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)

SIGBUS, probably, yes. I questioned this as well but it mustn't
have been resolved.


> >  - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
> >  - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.
> > 
> 
> The filemap, shmem, and block_prepare_write() cases will call the oom 
> killer but, depending on the gfp mask, they will retry their allocations 
> after the oom killer is called so we should never return VM_FAULT_OOM 
> because they return -ENOMEM.  They fail from either small objsize slab 
> allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
> default continues to retry even if direct reclaim fails.  If we're 
> returning with VM_FAULT_OOM from these handlers, it should only be because 
> of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
> find memory (so we don't care if the oom killer is called again since it 
> won't kill anything else).

Yep. And yes you are right that we prefer to do the oom killing at the
allocation point where we know all the context, however the fact is that
VM_FAULT_OOM is an allowed part of the fault API so we have to handle it
somehow.

It can theoretically be called for valid reasons say if a driver or
arch page table has a high order allocation, or if the page allocator
implementation were to be changed.

We can't rightly just kill the task at this point, even if it has
invoked the oom killer, because it could have been marked as unkillable.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17  9:52                                       ` Nick Piggin
  0 siblings, 0 replies; 145+ messages in thread
From: Nick Piggin @ 2010-02-17  9:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, Feb 17, 2010 at 01:11:30AM -0800, David Rientjes wrote:
> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > quick glance around core codes...
> >  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
> 
> We can detect this with is_vm_hugetlb_page() if we pass the vma into 
> pagefault_out_of_memory() without adding another VM_FAULT flag.

The real question is, what to do when returning to userspace. I don't
think there's a lot of options. SIGBUS is traditionally used for "no
resource".


> >  - filemap.c's VM_FAULT_OOM shoudn't call page_fault_oom_kill because it has already
> >    called oom_killer if it can. 
> 
> See below.
> 
> >  - about relayfs, is VM_FAULT_OOM should be BUG_ON()...
> 
> That looks appropriate at first glance.
> 
> >  - filemap_xip.c return VM_FAULT_OOM....but it doesn't seem to be OOM..
> >    just like VM_FAULT_NO_VALID_PAGE_FOUND. (But I'm not familiar with this area.)

SIGBUS, probably, yes. I questioned this as well but it mustn't
have been resolved.


> >  - fs/buffer.c 's VM_FAULT_OOM is returned oom-killer is called.
> >  - shmem.c's VM_FAULT_OOM is retuned oom-killer is called.
> > 
> 
> The filemap, shmem, and block_prepare_write() cases will call the oom 
> killer but, depending on the gfp mask, they will retry their allocations 
> after the oom killer is called so we should never return VM_FAULT_OOM 
> because they return -ENOMEM.  They fail from either small objsize slab 
> allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
> default continues to retry even if direct reclaim fails.  If we're 
> returning with VM_FAULT_OOM from these handlers, it should only be because 
> of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
> find memory (so we don't care if the oom killer is called again since it 
> won't kill anything else).

Yep. And yes you are right that we prefer to do the oom killing at the
allocation point where we know all the context, however the fact is that
VM_FAULT_OOM is an allowed part of the fault API so we have to handle it
somehow.

It can theoretically be called for valid reasons say if a driver or
arch page table has a high order allocation, or if the page allocator
implementation were to be changed.

We can't rightly just kill the task at this point, even if it has
invoked the oom killer, because it could have been marked as unkillable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  9:52                                       ` Nick Piggin
@ 2010-02-17 22:04                                         ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17 22:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, Nick Piggin wrote:

> > > quick glance around core codes...
> > >  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
> > 
> > We can detect this with is_vm_hugetlb_page() if we pass the vma into 
> > pagefault_out_of_memory() without adding another VM_FAULT flag.
> 
> The real question is, what to do when returning to userspace. I don't
> think there's a lot of options. SIGBUS is traditionally used for "no
> resource".
> 

For is_vm_hugetlb_page() in the pagefault oom handler, I think it should 
default to killing current as we did previously until that's worked out 
(and as some architectures like ia64 and powerpc still do).  In fact, 
pagefault ooms should probably always default to killing current if its 
killable.

> > The filemap, shmem, and block_prepare_write() cases will call the oom 
> > killer but, depending on the gfp mask, they will retry their allocations 
> > after the oom killer is called so we should never return VM_FAULT_OOM 
> > because they return -ENOMEM.  They fail from either small objsize slab 
> > allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
> > default continues to retry even if direct reclaim fails.  If we're 
> > returning with VM_FAULT_OOM from these handlers, it should only be because 
> > of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
> > find memory (so we don't care if the oom killer is called again since it 
> > won't kill anything else).
> 
> Yep. And yes you are right that we prefer to do the oom killing at the
> allocation point where we know all the context, however the fact is that
> VM_FAULT_OOM is an allowed part of the fault API so we have to handle it
> somehow.
> 
> It can theoretically be called for valid reasons say if a driver or
> arch page table has a high order allocation, or if the page allocator
> implementation were to be changed.
> 
> We can't rightly just kill the task at this point, even if it has
> invoked the oom killer, because it could have been marked as unkillable.
> 

That's easy to test in the oom handler, we can default to killing current 
but then kill another task if it is unkillable:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -696,15 +696,23 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing.  If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * The pagefault handler calls here because it is out of memory, so kill current
+ * by default.  If it's unkillable, then fallback to killing a memory-hogging
+ * task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel oom killing is
+ * already in progress so do nothing.  If a task is found with TIF_MEMDIE set,
+ * it has been killed so do nothing and allow it to exit.
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
+	int err;
+
 	if (!try_set_system_oom())
 		return;
-	out_of_memory(NULL, 0, 0, NULL);
+	constrained_alloc(NULL, 0, NULL, &totalpages);
+	err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
+				"Out of memory (pagefault)"))
+	if (err)
+		out_of_memory(NULL, 0, 0, NULL);
 	clear_system_oom();
 }

We'll need to convert the architectures that still only issue a SIGKILL to 
current to use pagefault_out_of_memory() before OOM_DISABLE is fully 
respected across the kernel, though.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-17 22:04                                         ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-17 22:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Andrea Arcangeli,
	Balbir Singh, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Wed, 17 Feb 2010, Nick Piggin wrote:

> > > quick glance around core codes...
> > >  - HUGEPAGE at el. should return some VM_FAULT_NO_RESOUECE rather than VM_FAULT_OOM.
> > 
> > We can detect this with is_vm_hugetlb_page() if we pass the vma into 
> > pagefault_out_of_memory() without adding another VM_FAULT flag.
> 
> The real question is, what to do when returning to userspace. I don't
> think there's a lot of options. SIGBUS is traditionally used for "no
> resource".
> 

For is_vm_hugetlb_page() in the pagefault oom handler, I think it should 
default to killing current as we did previously until that's worked out 
(and as some architectures like ia64 and powerpc still do).  In fact, 
pagefault ooms should probably always default to killing current if its 
killable.

> > The filemap, shmem, and block_prepare_write() cases will call the oom 
> > killer but, depending on the gfp mask, they will retry their allocations 
> > after the oom killer is called so we should never return VM_FAULT_OOM 
> > because they return -ENOMEM.  They fail from either small objsize slab 
> > allocations or with orders less than PAGE_ALLOC_COSTLY_ORDER which by 
> > default continues to retry even if direct reclaim fails.  If we're 
> > returning with VM_FAULT_OOM from these handlers, it should only be because 
> > of GFP_NOFS | __GFP_NORETRY or current has been oom killed and still can't 
> > find memory (so we don't care if the oom killer is called again since it 
> > won't kill anything else).
> 
> Yep. And yes you are right that we prefer to do the oom killing at the
> allocation point where we know all the context, however the fact is that
> VM_FAULT_OOM is an allowed part of the fault API so we have to handle it
> somehow.
> 
> It can theoretically be called for valid reasons say if a driver or
> arch page table has a high order allocation, or if the page allocator
> implementation were to be changed.
> 
> We can't rightly just kill the task at this point, even if it has
> invoked the oom killer, because it could have been marked as unkillable.
> 

That's easy to test in the oom handler, we can default to killing current 
but then kill another task if it is unkillable:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -696,15 +696,23 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
- * oom killing is already in progress so do nothing.  If a task is found with
- * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ * The pagefault handler calls here because it is out of memory, so kill current
+ * by default.  If it's unkillable, then fallback to killing a memory-hogging
+ * task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel oom killing is
+ * already in progress so do nothing.  If a task is found with TIF_MEMDIE set,
+ * it has been killed so do nothing and allow it to exit.
  */
 void pagefault_out_of_memory(void)
 {
+	unsigned long totalpages;
+	int err;
+
 	if (!try_set_system_oom())
 		return;
-	out_of_memory(NULL, 0, 0, NULL);
+	constrained_alloc(NULL, 0, NULL, &totalpages);
+	err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
+				"Out of memory (pagefault)"))
+	if (err)
+		out_of_memory(NULL, 0, 0, NULL);
 	clear_system_oom();
 }

We'll need to convert the architectures that still only issue a SIGKILL to 
current to use pagefault_out_of_memory() before OOM_DISABLE is fully 
respected across the kernel, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-17  2:34                               ` KAMEZAWA Hiroyuki
  (?)
  (?)
@ 2010-02-22  5:31                               ` Daisuke Nishimura
  2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
  -1 siblings, 1 reply; 145+ messages in thread
From: Daisuke Nishimura @ 2010-02-22  5:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm, Daisuke Nishimura

[-- Attachment #1: Type: text/plain, Size: 7891 bytes --]

Hi.

On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > 
> > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > respecting the sysctl_panic_on_oom settings?
> > > > 
> > > 
> > > I don't think this patch is good. Because several memcg can
> > > cause oom at the same time independently, system-wide oom locking is
> > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > 
> > 
> > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > needlessly killing more than one task regardless of how many memcgs are 
> > oom.
> > 
> Current implentation archive what memcg want. Why remove and destroy memcg ?
> 
It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
not to work well under heavy load, and pagefault_out_of_memory() causes
global oom.

Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value

  > mkdir /cgroup/memory/test
  > echo 1M >/cgroup/memory/test/memory.limit_in_bytes

Stem.2 run attached test program(which allocates memory and does fork recursively)

  > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`

This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).

===
[348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
[348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
[348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
[348090.121832] Call Trace:
[348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
[348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
[348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
[348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
[348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
[348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
[348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
[348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
[348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
[348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
[348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
[348090.121929] Task in /test killed as a result of limit of /test
[348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
[348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
[348090.121943] Mem-Info:
[348090.121947] Node 0 DMA per-cpu:
[348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
[348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
[348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
[348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
[348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
[348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
[348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
[348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
[348090.121980] Node 0 DMA32 per-cpu:
[348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
[348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
[348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
[348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
[348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
[348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
[348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
[348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
[348090.122012] Node 0 Normal per-cpu:
[348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
[348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
[348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
[348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
[348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
[348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
[348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
[348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
[348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
[348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
[348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
[348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
[348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
[348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[348090.122076] lowmem_reserve[]: 0 3204 13932 13932
[348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
 isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[348090.122100] lowmem_reserve[]: 0 0 10728 10728
[348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
:0 all_unreclaimable? no
[348090.122125] lowmem_reserve[]: 0 0 0 0
[348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
4kB 2*2048kB 2*4096kB = 15676kB
[348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
*1024kB 4*2048kB 672*4096kB = 2773244kB
[348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
 1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
[348090.122936] 76936 total pagecache pages
[348090.122940] 816 pages in swap cache
[348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
[348090.122946] Free swap  = 1995492kB
[348090.122949] Total swap = 2000888kB
[348090.300467] 3670016 pages RAM
[348090.300471] 153596 pages reserved
[348090.300474] 38486 pages shared
[348090.300476] 162081 pages non-shared
[348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
48 or a child
[348090.300486] Killed process 22072 (recursive_fork3)
[348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
 selected.
[348090.300526]
[348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
[348090.311050] Call Trace:
[348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
[348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
[348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
[348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
[348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
===

I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.

crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
  last_oom_jiffies = 4642757419,
crash> p jiffies
jiffies = $10 = 4642757607

I agree this is a extreme example, but this is not a desirable behavior.
Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
this case, but it's not a essential fix.

Any thoughts?


Regards,
Daisuke Nishimura.


[-- Attachment #2: recursive_fork3.c --]
[-- Type: text/x-csrc, Size: 1394 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <fcntl.h>
#include <libgen.h>
#include <errno.h>

void
recursive_fork(size_t size)
{
	pid_t pid;

	while ((pid = fork() == 0))
	{
		if (size) {
			void *buf;
			buf = malloc(size);
			if (!buf) {
				perror("malloc error");
				exit(-errno);
			}
			memset(buf, 0, size);
			free(buf);
		}
	}
	if (pid < 0) {
		perror("fork error(child)");
		exit(-errno);
	}
	exit(0);
}

void usage(const char *cmd)
{
	fprintf(stderr, "Usage: %s [-c <num child>] [-s <size in bytes>] [-p <cgroup path>]\n", cmd);
	exit(-1);
}

int
main(int argc, char *argv[])
{
	pid_t pid;
	int opt;
	unsigned int numchild = 1;
	size_t alloc_size = 0;
	char path[64];
	int fd;

	while ((opt = getopt(argc, argv, "c:s:p:")) != -1) {
		switch (opt) {
		case 'c':
			numchild = atoi(optarg);
			break;
		case 's':
			alloc_size = atoi(optarg);
			break;
		case 'p':
			snprintf(path, sizeof(path), "%s/tasks", optarg);
			fd = open(path, O_WRONLY);
			if (fd < 0) {
				perror("open error");
				exit(-errno);
			}
			write(fd, "\0", 1);
			close(fd);
			break;
		default:
			usage(basename(argv[0]));
		}
	}

	while (numchild--) {
		pid = fork();
		if (pid < 0) {
			perror("fork error(parent)");
			exit(-errno);
		}
		if (pid == 0)	/* child */
			recursive_fork(alloc_size);
	}

	return 0;
}



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-22  5:31                               ` Daisuke Nishimura
@ 2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-22  6:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010 14:31:51 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > 
> > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > respecting the sysctl_panic_on_oom settings?
> > > > > 
> > > > 
> > > > I don't think this patch is good. Because several memcg can
> > > > cause oom at the same time independently, system-wide oom locking is
> > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > 
> > > 
> > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > needlessly killing more than one task regardless of how many memcgs are 
> > > oom.
> > > 
> > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > 
> It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> not to work well under heavy load, and pagefault_out_of_memory() causes
> global oom.
> 
> Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> 
>   > mkdir /cgroup/memory/test
>   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> 
> Stem.2 run attached test program(which allocates memory and does fork recursively)
> 
>   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> 
> This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> 
> ===
> [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> [348090.121832] Call Trace:
> [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> [348090.121929] Task in /test killed as a result of limit of /test
> [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> [348090.121943] Mem-Info:
> [348090.121947] Node 0 DMA per-cpu:
> [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> [348090.121980] Node 0 DMA32 per-cpu:
> [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> [348090.122012] Node 0 Normal per-cpu:
> [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
>  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> :0 all_unreclaimable? no
> [348090.122125] lowmem_reserve[]: 0 0 0 0
> [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> 4kB 2*2048kB 2*4096kB = 15676kB
> [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> *1024kB 4*2048kB 672*4096kB = 2773244kB
> [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
>  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> [348090.122936] 76936 total pagecache pages
> [348090.122940] 816 pages in swap cache
> [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> [348090.122946] Free swap  = 1995492kB
> [348090.122949] Total swap = 2000888kB
> [348090.300467] 3670016 pages RAM
> [348090.300471] 153596 pages reserved
> [348090.300474] 38486 pages shared
> [348090.300476] 162081 pages non-shared
> [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> 48 or a child
> [348090.300486] Killed process 22072 (recursive_fork3)
> [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
>  selected.
> [348090.300526]
> [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> [348090.311050] Call Trace:
> [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> ===
> 
> I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> 
> crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
>   last_oom_jiffies = 4642757419,
> crash> p jiffies
> jiffies = $10 = 4642757607
> 
> I agree this is a extreme example, but this is not a desirable behavior.
> Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> this case, but it's not a essential fix.

Yes, current design is not the best thing, my bad.
(I had to band-aid against unexpected panic in pagefault_out_of_memory.)

But tweaking that vaule seems not promissing.

Essential fix is better. The best fix is don't call oom-killer in
pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
So, please avoid quick fix. 

One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
is not set. (But..By this, memcg will never return faiulre in page fault.
but it may sound reasonable.)


Thanks,
-Kame

> 
> Any thoughts?
> 
> 
> Regards,
> Daisuke Nishimura.
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-22  6:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010 14:31:51 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > 
> > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > respecting the sysctl_panic_on_oom settings?
> > > > > 
> > > > 
> > > > I don't think this patch is good. Because several memcg can
> > > > cause oom at the same time independently, system-wide oom locking is
> > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > 
> > > 
> > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > needlessly killing more than one task regardless of how many memcgs are 
> > > oom.
> > > 
> > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > 
> It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> not to work well under heavy load, and pagefault_out_of_memory() causes
> global oom.
> 
> Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> 
>   > mkdir /cgroup/memory/test
>   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> 
> Stem.2 run attached test program(which allocates memory and does fork recursively)
> 
>   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> 
> This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> 
> ===
> [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> [348090.121832] Call Trace:
> [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> [348090.121929] Task in /test killed as a result of limit of /test
> [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> [348090.121943] Mem-Info:
> [348090.121947] Node 0 DMA per-cpu:
> [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> [348090.121980] Node 0 DMA32 per-cpu:
> [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> [348090.122012] Node 0 Normal per-cpu:
> [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
>  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> :0 all_unreclaimable? no
> [348090.122125] lowmem_reserve[]: 0 0 0 0
> [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> 4kB 2*2048kB 2*4096kB = 15676kB
> [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> *1024kB 4*2048kB 672*4096kB = 2773244kB
> [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
>  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> [348090.122936] 76936 total pagecache pages
> [348090.122940] 816 pages in swap cache
> [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> [348090.122946] Free swap  = 1995492kB
> [348090.122949] Total swap = 2000888kB
> [348090.300467] 3670016 pages RAM
> [348090.300471] 153596 pages reserved
> [348090.300474] 38486 pages shared
> [348090.300476] 162081 pages non-shared
> [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> 48 or a child
> [348090.300486] Killed process 22072 (recursive_fork3)
> [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
>  selected.
> [348090.300526]
> [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> [348090.311050] Call Trace:
> [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> ===
> 
> I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> 
> crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
>   last_oom_jiffies = 4642757419,
> crash> p jiffies
> jiffies = $10 = 4642757607
> 
> I agree this is a extreme example, but this is not a desirable behavior.
> Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> this case, but it's not a essential fix.

Yes, current design is not the best thing, my bad.
(I had to band-aid against unexpected panic in pagefault_out_of_memory.)

But tweaking that vaule seems not promissing.

Essential fix is better. The best fix is don't call oom-killer in
pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
So, please avoid quick fix. 

One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
is not set. (But..By this, memcg will never return faiulre in page fault.
but it may sound reasonable.)


Thanks,
-Kame

> 
> Any thoughts?
> 
> 
> Regards,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
@ 2010-02-22 11:42                                     ` Daisuke Nishimura
  -1 siblings, 0 replies; 145+ messages in thread
From: Daisuke Nishimura @ 2010-02-22 11:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm, Daisuke Nishimura

On Mon, 22 Feb 2010 15:15:13 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 22 Feb 2010 14:31:51 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > > David Rientjes <rientjes@google.com> wrote:
> > > 
> > > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > > 
> > > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > > respecting the sysctl_panic_on_oom settings?
> > > > > > 
> > > > > 
> > > > > I don't think this patch is good. Because several memcg can
> > > > > cause oom at the same time independently, system-wide oom locking is
> > > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > > 
> > > > 
> > > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > > needlessly killing more than one task regardless of how many memcgs are 
> > > > oom.
> > > > 
> > > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > > 
> > It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> > not to work well under heavy load, and pagefault_out_of_memory() causes
> > global oom.
> > 
> > Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> > 
> >   > mkdir /cgroup/memory/test
> >   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> > 
> > Stem.2 run attached test program(which allocates memory and does fork recursively)
> > 
> >   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> > 
> > This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> > 
> > ===
> > [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> > [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> > [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > [348090.121832] Call Trace:
> > [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> > [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> > [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> > [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> > [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> > [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> > [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> > [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> > [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> > [348090.121929] Task in /test killed as a result of limit of /test
> > [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> > [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> > [348090.121943] Mem-Info:
> > [348090.121947] Node 0 DMA per-cpu:
> > [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> > [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> > [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> > [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> > [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> > [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> > [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> > [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> > [348090.121980] Node 0 DMA32 per-cpu:
> > [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> > [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> > [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> > [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> > [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> > [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> > [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> > [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> > [348090.122012] Node 0 Normal per-cpu:
> > [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> > [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> > [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> > [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> > [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> > [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> > [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> > [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> > [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> > [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> > [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> > [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> > [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> > [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> > ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> > file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> > claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> > 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> > [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> > 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> >  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> > :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> > 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> > [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> > anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> > kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> > k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> > nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> > :0 all_unreclaimable? no
> > [348090.122125] lowmem_reserve[]: 0 0 0 0
> > [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> > 4kB 2*2048kB 2*4096kB = 15676kB
> > [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> > *1024kB 4*2048kB 672*4096kB = 2773244kB
> > [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
> >  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> > [348090.122936] 76936 total pagecache pages
> > [348090.122940] 816 pages in swap cache
> > [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> > [348090.122946] Free swap  = 1995492kB
> > [348090.122949] Total swap = 2000888kB
> > [348090.300467] 3670016 pages RAM
> > [348090.300471] 153596 pages reserved
> > [348090.300474] 38486 pages shared
> > [348090.300476] 162081 pages non-shared
> > [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> > 48 or a child
> > [348090.300486] Killed process 22072 (recursive_fork3)
> > [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
> >  selected.
> > [348090.300526]
> > [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > [348090.311050] Call Trace:
> > [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> > [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> > [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> > [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> > [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> > ===
> > 
> > I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> > 
> > crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
> >   last_oom_jiffies = 4642757419,
> > crash> p jiffies
> > jiffies = $10 = 4642757607
> > 
> > I agree this is a extreme example, but this is not a desirable behavior.
> > Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> > this case, but it's not a essential fix.
> 
> Yes, current design is not the best thing, my bad.
> (I had to band-aid against unexpected panic in pagefault_out_of_memory.)
> 
> But tweaking that vaule seems not promissing.
> 
> Essential fix is better. The best fix is don't call oom-killer in
> pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> So, please avoid quick fix. 
> 
> One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
> is not set. (But..By this, memcg will never return faiulre in page fault.
> but it may sound reasonable.)
> 
hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
(Mark the task which is being killed by memcg's oom ?).


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-22 11:42                                     ` Daisuke Nishimura
  0 siblings, 0 replies; 145+ messages in thread
From: Daisuke Nishimura @ 2010-02-22 11:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm, Daisuke Nishimura

On Mon, 22 Feb 2010 15:15:13 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 22 Feb 2010 14:31:51 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > > David Rientjes <rientjes@google.com> wrote:
> > > 
> > > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > > 
> > > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > > respecting the sysctl_panic_on_oom settings?
> > > > > > 
> > > > > 
> > > > > I don't think this patch is good. Because several memcg can
> > > > > cause oom at the same time independently, system-wide oom locking is
> > > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > > 
> > > > 
> > > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > > needlessly killing more than one task regardless of how many memcgs are 
> > > > oom.
> > > > 
> > > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > > 
> > It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> > not to work well under heavy load, and pagefault_out_of_memory() causes
> > global oom.
> > 
> > Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> > 
> >   > mkdir /cgroup/memory/test
> >   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> > 
> > Stem.2 run attached test program(which allocates memory and does fork recursively)
> > 
> >   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> > 
> > This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> > 
> > ===
> > [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> > [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> > [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > [348090.121832] Call Trace:
> > [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> > [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> > [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> > [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> > [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> > [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> > [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> > [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> > [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> > [348090.121929] Task in /test killed as a result of limit of /test
> > [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> > [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> > [348090.121943] Mem-Info:
> > [348090.121947] Node 0 DMA per-cpu:
> > [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> > [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> > [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> > [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> > [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> > [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> > [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> > [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> > [348090.121980] Node 0 DMA32 per-cpu:
> > [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> > [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> > [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> > [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> > [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> > [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> > [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> > [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> > [348090.122012] Node 0 Normal per-cpu:
> > [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> > [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> > [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> > [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> > [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> > [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> > [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> > [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> > [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> > [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> > [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> > [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> > [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> > [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> > ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> > file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> > claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> > 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> > [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> > 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> >  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> > :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> > 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> > [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> > anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> > kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> > k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> > nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> > :0 all_unreclaimable? no
> > [348090.122125] lowmem_reserve[]: 0 0 0 0
> > [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> > 4kB 2*2048kB 2*4096kB = 15676kB
> > [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> > *1024kB 4*2048kB 672*4096kB = 2773244kB
> > [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
> >  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> > [348090.122936] 76936 total pagecache pages
> > [348090.122940] 816 pages in swap cache
> > [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> > [348090.122946] Free swap  = 1995492kB
> > [348090.122949] Total swap = 2000888kB
> > [348090.300467] 3670016 pages RAM
> > [348090.300471] 153596 pages reserved
> > [348090.300474] 38486 pages shared
> > [348090.300476] 162081 pages non-shared
> > [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> > 48 or a child
> > [348090.300486] Killed process 22072 (recursive_fork3)
> > [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
> >  selected.
> > [348090.300526]
> > [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > [348090.311050] Call Trace:
> > [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> > [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> > [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> > [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> > [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> > ===
> > 
> > I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> > 
> > crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
> >   last_oom_jiffies = 4642757419,
> > crash> p jiffies
> > jiffies = $10 = 4642757607
> > 
> > I agree this is a extreme example, but this is not a desirable behavior.
> > Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> > this case, but it's not a essential fix.
> 
> Yes, current design is not the best thing, my bad.
> (I had to band-aid against unexpected panic in pagefault_out_of_memory.)
> 
> But tweaking that vaule seems not promissing.
> 
> Essential fix is better. The best fix is don't call oom-killer in
> pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> So, please avoid quick fix. 
> 
> One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
> is not set. (But..By this, memcg will never return faiulre in page fault.
> but it may sound reasonable.)
> 
hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
(Mark the task which is being killed by memcg's oom ?).


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
@ 2010-02-22 20:55                                     ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-22 20:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Essential fix is better. The best fix is don't call oom-killer in
> pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> So, please avoid quick fix. 
> 

The last patch in my oom killer series defaults pagefault_out_of_memory() 
to always kill current first, if it's killable.  If it is unsuccessful, we 
fallback to scanning the entire tasklist.

For tasks that are constrained by a memcg, we could probably use 
mem_cgroup_from_task(current) and if it's non-NULL and non-root, call 
mem_cgroup_out_of_memory() with a gfp_mask of 0.  That would at least 
penalize the same memcg instead of invoking a global oom and would try the 
additional logic that you plan on adding to avoid killing any task at all 
in such conditions.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-22 20:55                                     ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-22 20:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Essential fix is better. The best fix is don't call oom-killer in
> pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> So, please avoid quick fix. 
> 

The last patch in my oom killer series defaults pagefault_out_of_memory() 
to always kill current first, if it's killable.  If it is unsuccessful, we 
fallback to scanning the entire tasklist.

For tasks that are constrained by a memcg, we could probably use 
mem_cgroup_from_task(current) and if it's non-NULL and non-root, call 
mem_cgroup_out_of_memory() with a gfp_mask of 0.  That would at least 
penalize the same memcg instead of invoking a global oom and would try the 
additional logic that you plan on adding to avoid killing any task at all 
in such conditions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-22 11:42                                     ` Daisuke Nishimura
@ 2010-02-22 20:59                                       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-22 20:59 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010, Daisuke Nishimura wrote:

> hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
> and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
> at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
> (Mark the task which is being killed by memcg's oom ?).
> 

pagefault_out_of_memory() should use mem_cgroup_from_task(current) and 
then call mem_cgroup_out_of_memory() when it's non-NULL.  
select_bad_process() will return ERR_PTR(-1UL) if there is an already oom 
killed task attached to the memcg, so we can use that to avoid the 
panic_on_oom.  The setting of that sysctl doesn't imply that we can't scan 
the tasklist, it simply means we can't kill anything as a result of an 
oom.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-22 20:59                                       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-22 20:59 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Rik van Riel, Nick Piggin,
	Andrea Arcangeli, Balbir Singh, Lubos Lunak, KOSAKI Motohiro,
	linux-kernel, linux-mm

On Mon, 22 Feb 2010, Daisuke Nishimura wrote:

> hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
> and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
> at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
> (Mark the task which is being killed by memcg's oom ?).
> 

pagefault_out_of_memory() should use mem_cgroup_from_task(current) and 
then call mem_cgroup_out_of_memory() when it's non-NULL.  
select_bad_process() will return ERR_PTR(-1UL) if there is an already oom 
killed task attached to the memcg, so we can use that to avoid the 
panic_on_oom.  The setting of that sysctl doesn't imply that we can't scan 
the tasklist, it simply means we can't kill anything as a result of an 
oom.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
  2010-02-22 11:42                                     ` Daisuke Nishimura
@ 2010-02-22 23:51                                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-22 23:51 UTC (permalink / raw)
  To: nishimura
  Cc: Daisuke Nishimura, David Rientjes, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Mon, 22 Feb 2010 20:42:37 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> On Mon, 22 Feb 2010 15:15:13 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 22 Feb 2010 14:31:51 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > Hi.
> > > 
> > > On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > > > David Rientjes <rientjes@google.com> wrote:
> > > > 
> > > > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > > > 
> > > > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > > > respecting the sysctl_panic_on_oom settings?
> > > > > > > 
> > > > > > 
> > > > > > I don't think this patch is good. Because several memcg can
> > > > > > cause oom at the same time independently, system-wide oom locking is
> > > > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > > > 
> > > > > 
> > > > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > > > needlessly killing more than one task regardless of how many memcgs are 
> > > > > oom.
> > > > > 
> > > > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > > > 
> > > It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> > > not to work well under heavy load, and pagefault_out_of_memory() causes
> > > global oom.
> > > 
> > > Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> > > 
> > >   > mkdir /cgroup/memory/test
> > >   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> > > 
> > > Stem.2 run attached test program(which allocates memory and does fork recursively)
> > > 
> > >   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> > > 
> > > This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> > > 
> > > ===
> > > [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> > > [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> > > [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > > [348090.121832] Call Trace:
> > > [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> > > [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> > > [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> > > [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> > > [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> > > [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> > > [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > > [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> > > [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > > [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> > > [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> > > [348090.121929] Task in /test killed as a result of limit of /test
> > > [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> > > [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> > > [348090.121943] Mem-Info:
> > > [348090.121947] Node 0 DMA per-cpu:
> > > [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> > > [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> > > [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> > > [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> > > [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> > > [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> > > [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> > > [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> > > [348090.121980] Node 0 DMA32 per-cpu:
> > > [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> > > [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> > > [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> > > [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> > > [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> > > [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> > > [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> > > [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> > > [348090.122012] Node 0 Normal per-cpu:
> > > [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> > > [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> > > [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> > > [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> > > [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> > > [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> > > [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> > > [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> > > [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> > > [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> > > [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> > > [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> > > [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> > > [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> > > ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> > > file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> > > claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> > > 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > > [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> > > [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> > > 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> > >  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> > > :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> > > 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > > [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> > > [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> > > anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> > > kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> > > k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> > > nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> > > :0 all_unreclaimable? no
> > > [348090.122125] lowmem_reserve[]: 0 0 0 0
> > > [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> > > 4kB 2*2048kB 2*4096kB = 15676kB
> > > [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> > > *1024kB 4*2048kB 672*4096kB = 2773244kB
> > > [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
> > >  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> > > [348090.122936] 76936 total pagecache pages
> > > [348090.122940] 816 pages in swap cache
> > > [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> > > [348090.122946] Free swap  = 1995492kB
> > > [348090.122949] Total swap = 2000888kB
> > > [348090.300467] 3670016 pages RAM
> > > [348090.300471] 153596 pages reserved
> > > [348090.300474] 38486 pages shared
> > > [348090.300476] 162081 pages non-shared
> > > [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> > > 48 or a child
> > > [348090.300486] Killed process 22072 (recursive_fork3)
> > > [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
> > >  selected.
> > > [348090.300526]
> > > [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > > [348090.311050] Call Trace:
> > > [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> > > [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> > > [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> > > [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> > > [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> > > ===
> > > 
> > > I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> > > 
> > > crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
> > >   last_oom_jiffies = 4642757419,
> > > crash> p jiffies
> > > jiffies = $10 = 4642757607
> > > 
> > > I agree this is a extreme example, but this is not a desirable behavior.
> > > Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> > > this case, but it's not a essential fix.
> > 
> > Yes, current design is not the best thing, my bad.
> > (I had to band-aid against unexpected panic in pagefault_out_of_memory.)
> > 
> > But tweaking that vaule seems not promissing.
> > 
> > Essential fix is better. The best fix is don't call oom-killer in
> > pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> > the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> > So, please avoid quick fix. 
> > 
> > One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
> > is not set. (But..By this, memcg will never return faiulre in page fault.
> > but it may sound reasonable.)
> > 
> hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
> and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
> at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
> (Mark the task which is being killed by memcg's oom ?).
> 

I'll prepare a patch today in other thread.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode
@ 2010-02-22 23:51                                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 145+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-22 23:51 UTC (permalink / raw)
  To: nishimura
  Cc: Daisuke Nishimura, David Rientjes, Andrew Morton, Rik van Riel,
	Nick Piggin, Andrea Arcangeli, Balbir Singh, Lubos Lunak,
	KOSAKI Motohiro, linux-kernel, linux-mm

On Mon, 22 Feb 2010 20:42:37 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> On Mon, 22 Feb 2010 15:15:13 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 22 Feb 2010 14:31:51 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > Hi.
> > > 
> > > On Wed, 17 Feb 2010 11:34:30 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Tue, 16 Feb 2010 18:28:05 -0800 (PST)
> > > > David Rientjes <rientjes@google.com> wrote:
> > > > 
> > > > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > > > 
> > > > > > > What do you think about making pagefaults use out_of_memory() directly and 
> > > > > > > respecting the sysctl_panic_on_oom settings?
> > > > > > > 
> > > > > > 
> > > > > > I don't think this patch is good. Because several memcg can
> > > > > > cause oom at the same time independently, system-wide oom locking is
> > > > > > unsuitable. BTW, what I doubt is much more fundamental thing.
> > > > > > 
> > > > > 
> > > > > We want to lock all populated zones with ZONE_OOM_LOCKED to avoid 
> > > > > needlessly killing more than one task regardless of how many memcgs are 
> > > > > oom.
> > > > > 
> > > > Current implentation archive what memcg want. Why remove and destroy memcg ?
> > > > 
> > > It might be a bit off-topic, but memcg's check for last_oom_jiffies seems
> > > not to work well under heavy load, and pagefault_out_of_memory() causes
> > > global oom.
> > > 
> > > Step.1 make a memory cgroup directory and sed memory.limit_in_bytes to a small value
> > > 
> > >   > mkdir /cgroup/memory/test
> > >   > echo 1M >/cgroup/memory/test/memory.limit_in_bytes
> > > 
> > > Stem.2 run attached test program(which allocates memory and does fork recursively)
> > > 
> > >   > ./recursive_fork -c 8 -s `expr 1 \* 1024 \* 1024`
> > > 
> > > This causes not only memcg's oom, but also global oom(My machine has 8 CPUS).
> > > 
> > > ===
> > > [348090.121808] recursive_fork3 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
> > > [348090.121821] recursive_fork3 cpuset=/ mems_allowed=0
> > > [348090.121829] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > > [348090.121832] Call Trace:
> > > [348090.121849]  [<ffffffff810d6015>] oom_kill_process+0x86/0x295
> > > [348090.121855]  [<ffffffff810d64cf>] ? select_bad_process+0x63/0xf0
> > > [348090.121861]  [<ffffffff810d687a>] mem_cgroup_out_of_memory+0x69/0x87
> > > [348090.121870]  [<ffffffff811119c2>] __mem_cgroup_try_charge+0x15f/0x1d4
> > > [348090.121876]  [<ffffffff811126bc>] mem_cgroup_try_charge_swapin+0x104/0x159
> > > [348090.121885]  [<ffffffff810edd9b>] handle_mm_fault+0x4ca/0x76c
> > > [348090.121895]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > > [348090.121904]  [<ffffffff81087286>] ? trace_hardirqs_on+0xd/0xf
> > > [348090.121910]  [<ffffffff8143419f>] ? do_page_fault+0x141/0x2da
> > > [348090.121915]  [<ffffffff8143431c>] do_page_fault+0x2be/0x2da
> > > [348090.121922]  [<ffffffff81432115>] page_fault+0x25/0x30
> > > [348090.121929] Task in /test killed as a result of limit of /test
> > > [348090.121936] memory: usage 1024kB, limit 1024kB, failcnt 279335
> > > [348090.121940] memory+swap: usage 4260kB, limit 9007199254740991kB, failcnt 0
> > > [348090.121943] Mem-Info:
> > > [348090.121947] Node 0 DMA per-cpu:
> > > [348090.121952] CPU    0: hi:    0, btch:   1 usd:   0
> > > [348090.121956] CPU    1: hi:    0, btch:   1 usd:   0
> > > [348090.121960] CPU    2: hi:    0, btch:   1 usd:   0
> > > [348090.121963] CPU    3: hi:    0, btch:   1 usd:   0
> > > [348090.121967] CPU    4: hi:    0, btch:   1 usd:   0
> > > [348090.121970] CPU    5: hi:    0, btch:   1 usd:   0
> > > [348090.121974] CPU    6: hi:    0, btch:   1 usd:   0
> > > [348090.121977] CPU    7: hi:    0, btch:   1 usd:   0
> > > [348090.121980] Node 0 DMA32 per-cpu:
> > > [348090.121984] CPU    0: hi:  186, btch:  31 usd:  19
> > > [348090.121988] CPU    1: hi:  186, btch:  31 usd:  11
> > > [348090.121992] CPU    2: hi:  186, btch:  31 usd: 178
> > > [348090.121995] CPU    3: hi:  186, btch:  31 usd:   0
> > > [348090.121999] CPU    4: hi:  186, btch:  31 usd: 182
> > > [348090.122002] CPU    5: hi:  186, btch:  31 usd:  29
> > > [348090.122006] CPU    6: hi:  186, btch:  31 usd:   0
> > > [348090.122009] CPU    7: hi:  186, btch:  31 usd:   0
> > > [348090.122012] Node 0 Normal per-cpu:
> > > [348090.122016] CPU    0: hi:  186, btch:  31 usd:  54
> > > [348090.122020] CPU    1: hi:  186, btch:  31 usd: 109
> > > [348090.122023] CPU    2: hi:  186, btch:  31 usd: 149
> > > [348090.122027] CPU    3: hi:  186, btch:  31 usd: 119
> > > [348090.122030] CPU    4: hi:  186, btch:  31 usd: 123
> > > [348090.122033] CPU    5: hi:  186, btch:  31 usd: 145
> > > [348090.122037] CPU    6: hi:  186, btch:  31 usd:  54
> > > [348090.122041] CPU    7: hi:  186, btch:  31 usd:  95
> > > [348090.122049] active_anon:5354 inactive_anon:805 isolated_anon:0
> > > [348090.122051]  active_file:18317 inactive_file:57785 isolated_file:0
> > > [348090.122053]  unevictable:0 dirty:0 writeback:211 unstable:0
> > > [348090.122054]  free:3324478 slab_reclaimable:18860 slab_unreclaimable:13472
> > > [348090.122056]  mapped:4315 shmem:63 pagetables:1098 bounce:0
> > > [348090.122059] Node 0 DMA free:15676kB min:12kB low:12kB high:16kB active_anon:0kB inacti
> > > ve_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(
> > > file):0kB present:15100kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_re
> > > claimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:
> > > 0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > > [348090.122076] lowmem_reserve[]: 0 3204 13932 13932
> > > [348090.122083] Node 0 DMA32 free:2773244kB min:3472kB low:4340kB high:5208kB active_anon:
> > > 0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> > >  isolated(file):0kB present:3281248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem
> > > :0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:
> > > 0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> > > [348090.122100] lowmem_reserve[]: 0 0 10728 10728
> > > [348090.122108] Node 0 Normal free:10508992kB min:11624kB low:14528kB high:17436kB active_
> > > anon:21416kB inactive_anon:3220kB active_file:73268kB inactive_file:231140kB unevictable:0
> > > kB isolated(anon):0kB isolated(file):0kB present:10985984kB mlocked:0kB dirty:0kB writebac
> > > k:844kB mapped:17260kB shmem:252kB slab_reclaimable:75440kB slab_unreclaimable:53872kB ker
> > > nel_stack:1224kB pagetables:4392kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned
> > > :0 all_unreclaimable? no
> > > [348090.122125] lowmem_reserve[]: 0 0 0 0
> > > [348090.122788] Node 0 DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 1*256kB 1*512kB 2*102
> > > 4kB 2*2048kB 2*4096kB = 15676kB
> > > [348090.122853] Node 0 DMA32: 11*4kB 6*8kB 2*16kB 4*32kB 6*64kB 13*128kB 4*256kB 6*512kB 6
> > > *1024kB 4*2048kB 672*4096kB = 2773244kB
> > > [348090.122915] Node 0 Normal: 188*4kB 128*8kB 214*16kB 409*32kB 107*64kB 18*128kB 4*256kB
> > >  1*512kB 2*1024kB 0*2048kB 2558*4096kB = 10508592kB
> > > [348090.122936] 76936 total pagecache pages
> > > [348090.122940] 816 pages in swap cache
> > > [348090.122943] Swap cache stats: add 7851711, delete 7850894, find 3676243/4307445
> > > [348090.122946] Free swap  = 1995492kB
> > > [348090.122949] Total swap = 2000888kB
> > > [348090.300467] 3670016 pages RAM
> > > [348090.300471] 153596 pages reserved
> > > [348090.300474] 38486 pages shared
> > > [348090.300476] 162081 pages non-shared
> > > [348090.300482] Memory cgroup out of memory: kill process 22072 (recursive_fork3) score 12
> > > 48 or a child
> > > [348090.300486] Killed process 22072 (recursive_fork3)
> > > [348090.300524] Kernel panic - not syncing: out of memory from page fault. panic_on_oom is
> > >  selected.
> > > [348090.300526]
> > > [348090.311038] Pid: 22744, comm: recursive_fork3 Not tainted 2.6.32.8-00001-gb6cd517 #3
> > > [348090.311050] Call Trace:
> > > [348090.311073]  [<ffffffff8142efa4>] panic+0x75/0x133
> > > [348090.311090]  [<ffffffff810d67d2>] pagefault_out_of_memory+0x50/0x8f
> > > [348090.311104]  [<ffffffff81036a2d>] mm_fault_error+0x37/0xba
> > > [348090.311117]  [<ffffffff8143428d>] do_page_fault+0x22f/0x2da
> > > [348090.311130]  [<ffffffff81432115>] page_fault+0x25/0x30
> > > ===
> > > 
> > > I take a kdump by enabling panic_on_oom, and compared the last_oom_jiffies and jiffies.
> > > 
> > > crash> struct mem_cgroup.last_oom_jiffies 0xffffc90013514000
> > >   last_oom_jiffies = 4642757419,
> > > crash> p jiffies
> > > jiffies = $10 = 4642757607
> > > 
> > > I agree this is a extreme example, but this is not a desirable behavior.
> > > Changing "HZ/10" in mem_cgroup_last_oom_called() to "HZ/2" or some would fix
> > > this case, but it's not a essential fix.
> > 
> > Yes, current design is not the best thing, my bad.
> > (I had to band-aid against unexpected panic in pagefault_out_of_memory.)
> > 
> > But tweaking that vaule seems not promissing.
> > 
> > Essential fix is better. The best fix is don't call oom-killer in
> > pagefault_out_of_memory. So, returning other than VM_FAULT_OOM is
> > the best, I think. But hmm...we don't have VM_FAULT_AGAIN etc..
> > So, please avoid quick fix. 
> > 
> > One thing I can think of is sleep-and-retry in try_charge() if PF_MEMDIE
> > is not set. (But..By this, memcg will never return faiulre in page fault.
> > but it may sound reasonable.)
> > 
> hmm, I can agree with you. But I think we need some trick to distinguish normal VM_FAULT_OOM
> and memcg's VM_FAULT_OOM(the current itself was killed by memcg's oom, so exited the retry)
> at mem_cgroup_oom_called() to avoid the system from panic when panic_on_oom is enabled.
> (Mark the task which is being killed by memcg's oom ?).
> 

I'll prepare a patch today in other thread.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
  2010-02-15 22:20   ` David Rientjes
@ 2010-02-23  6:31     ` Balbir Singh
  -1 siblings, 0 replies; 145+ messages in thread
From: Balbir Singh @ 2010-02-23  6:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-02-15 14:20:06]:

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Seems reasonable, but I think it will require lots of testing.
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
@ 2010-02-23  6:31     ` Balbir Singh
  0 siblings, 0 replies; 145+ messages in thread
From: Balbir Singh @ 2010-02-23  6:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-02-15 14:20:06]:

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Seems reasonable, but I think it will require lots of testing.
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
  2010-02-23  6:31     ` Balbir Singh
@ 2010-02-23  8:17       ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-23  8:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 23 Feb 2010, Balbir Singh wrote:

> > The oom killer presently kills current whenever there is no more memory
> > free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> > current is a memory-hogging task or that killing it will free any
> > substantial amount of memory, however.
> > 
> > In such situations, it is better to scan the tasklist for nodes that are
> > allowed to allocate on current's set of nodes and kill the task with the
> > highest badness() score.  This ensures that the most memory-hogging task,
> > or the one configured by the user with /proc/pid/oom_adj, is always
> > selected in such scenarios.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Seems reasonable, but I think it will require lots of testing.

I already tested it by checking that tasks with very elevated oom_adj 
values don't get killed when they do not share the same MPOL_BIND nodes as 
a memory-hogging task.

What additional testing did you have in mind?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms
@ 2010-02-23  8:17       ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-23  8:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Rik van Riel, KAMEZAWA Hiroyuki, Nick Piggin,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 23 Feb 2010, Balbir Singh wrote:

> > The oom killer presently kills current whenever there is no more memory
> > free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> > current is a memory-hogging task or that killing it will free any
> > substantial amount of memory, however.
> > 
> > In such situations, it is better to scan the tasklist for nodes that are
> > allowed to allocate on current's set of nodes and kill the task with the
> > highest badness() score.  This ensures that the most memory-hogging task,
> > or the one configured by the user with /proc/pid/oom_adj, is always
> > selected in such scenarios.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Seems reasonable, but I think it will require lots of testing.

I already tested it by checking that tasks with very elevated oom_adj 
values don't get killed when they do not share the same MPOL_BIND nodes as 
a memory-hogging task.

What additional testing did you have in mind?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-17  0:21                       ` David Rientjes
@ 2010-02-23 11:24                         ` Balbir Singh
  -1 siblings, 0 replies; 145+ messages in thread
From: Balbir Singh @ 2010-02-23 11:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Andrew Morton, Rik van Riel,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-02-16 16:21:11]:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > 
> > > > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > > > 
> > > > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > > > or corrupting memory.
> > > > > > 
> > > > > 
> > > > > Here's the new patch for your consideration.
> > > > > 
> > > > 
> > > > Then, can we take kdump in this endlessly looping situaton ?
> > > > 
> > > > panic_on_oom=always + kdump can do that. 
> > > > 
> > > 
> > > The endless loop is only helpful if something is going to free memory 
> > > external to the current page allocation: either another task with 
> > > __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> > > memory, or a task that exits.
> > > 
> > > The most notable endless loop in the page allocator is the one when a task 
> > > has been oom killed, gets access to memory reserves, and then cannot find 
> > > a page for a __GFP_NOFAIL allocation:
> > > 
> > > 	do {
> > > 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > > 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> > > 			preferred_zone, migratetype);
> > > 
> > > 		if (!page && gfp_mask & __GFP_NOFAIL)
> > > 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> > > 
> > > We don't expect any such allocations to happen during the exit path, but 
> > > we could probably find some in the fs layer.
> > > 
> > > I don't want to check sysctl_panic_on_oom in the page allocator because it 
> > > would start panicking the machine unnecessarily for the integrity 
> > > metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> > > order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> > > for oom kill that wouldn't have panicked before.
> > > 
> > 
> > Then, why don't you check higzone_idx in oom_kill.c
> > 
> 
> out_of_memory() doesn't return a value to specify whether the page 
> allocator should retry the allocation or just return NULL, all that policy 
> is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
> fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
> killer when it's __GFP_NOFAIL.
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		/* The OOM killer will not help higher order allocs */
>  		if (order > PAGE_ALLOC_COSTLY_ORDER)
>  			goto out;
> +		/* The OOM killer does not needlessly kill tasks for lowmem */
> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto out;

I am not sure if this is a good idea, ZONE_DMA could have a lot of
memory on some architectures. IIUC, we return NULL for allocations
from ZONE_DMA? What is the reason for the heuristic?

>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> @@ -1924,15 +1927,23 @@ rebalance:
>  			if (page)
>  				goto got_pg;
> 
> -			/*
> -			 * The OOM killer does not trigger for high-order
> -			 * ~__GFP_NOFAIL allocations so if no progress is being
> -			 * made, there are no other options and retrying is
> -			 * unlikely to help.
> -			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> -				goto nopage;
> +			if (!(gfp_mask & __GFP_NOFAIL)) {
> +				/*
> +				 * The oom killer is not called for high-order
> +				 * allocations that may fail, so if no progress
> +				 * is being made, there are no other options and
> +				 * retrying is unlikely to help.
> +				 */
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					goto nopage;
> +				/*
> +				 * The oom killer is not called for lowmem
> +				 * allocations to prevent needlessly killing
> +				 * innocent tasks.
> +				 */
> +				if (high_zoneidx < ZONE_NORMAL)
> +					goto nopage;
> +			}
> 
>  			goto restart;
>  		}

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-23 11:24                         ` Balbir Singh
  0 siblings, 0 replies; 145+ messages in thread
From: Balbir Singh @ 2010-02-23 11:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Andrew Morton, Rik van Riel,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-02-16 16:21:11]:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > > 
> > > > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask & 
> > > > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > > > > 
> > > > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > > > or corrupting memory.
> > > > > > 
> > > > > 
> > > > > Here's the new patch for your consideration.
> > > > > 
> > > > 
> > > > Then, can we take kdump in this endlessly looping situaton ?
> > > > 
> > > > panic_on_oom=always + kdump can do that. 
> > > > 
> > > 
> > > The endless loop is only helpful if something is going to free memory 
> > > external to the current page allocation: either another task with 
> > > __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees 
> > > memory, or a task that exits.
> > > 
> > > The most notable endless loop in the page allocator is the one when a task 
> > > has been oom killed, gets access to memory reserves, and then cannot find 
> > > a page for a __GFP_NOFAIL allocation:
> > > 
> > > 	do {
> > > 		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > > 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> > > 			preferred_zone, migratetype);
> > > 
> > > 		if (!page && gfp_mask & __GFP_NOFAIL)
> > > 			congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > 	} while (!page && (gfp_mask & __GFP_NOFAIL));
> > > 
> > > We don't expect any such allocations to happen during the exit path, but 
> > > we could probably find some in the fs layer.
> > > 
> > > I don't want to check sysctl_panic_on_oom in the page allocator because it 
> > > would start panicking the machine unnecessarily for the integrity 
> > > metadata GFP_NOIO | __GFP_NOFAIL allocation, for any 
> > > order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist 
> > > for oom kill that wouldn't have panicked before.
> > > 
> > 
> > Then, why don't you check higzone_idx in oom_kill.c
> > 
> 
> out_of_memory() doesn't return a value to specify whether the page 
> allocator should retry the allocation or just return NULL, all that policy 
> is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
> fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
> killer when it's __GFP_NOFAIL.
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		/* The OOM killer will not help higher order allocs */
>  		if (order > PAGE_ALLOC_COSTLY_ORDER)
>  			goto out;
> +		/* The OOM killer does not needlessly kill tasks for lowmem */
> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto out;

I am not sure if this is a good idea, ZONE_DMA could have a lot of
memory on some architectures. IIUC, we return NULL for allocations
from ZONE_DMA? What is the reason for the heuristic?

>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> @@ -1924,15 +1927,23 @@ rebalance:
>  			if (page)
>  				goto got_pg;
> 
> -			/*
> -			 * The OOM killer does not trigger for high-order
> -			 * ~__GFP_NOFAIL allocations so if no progress is being
> -			 * made, there are no other options and retrying is
> -			 * unlikely to help.
> -			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> -				goto nopage;
> +			if (!(gfp_mask & __GFP_NOFAIL)) {
> +				/*
> +				 * The oom killer is not called for high-order
> +				 * allocations that may fail, so if no progress
> +				 * is being made, there are no other options and
> +				 * retrying is unlikely to help.
> +				 */
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					goto nopage;
> +				/*
> +				 * The oom killer is not called for lowmem
> +				 * allocations to prevent needlessly killing
> +				 * innocent tasks.
> +				 */
> +				if (high_zoneidx < ZONE_NORMAL)
> +					goto nopage;
> +			}
> 
>  			goto restart;
>  		}

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
  2010-02-23 11:24                         ` Balbir Singh
@ 2010-02-23 21:12                           ` David Rientjes
  -1 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-23 21:12 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Andrew Morton, Rik van Riel,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 23 Feb 2010, Balbir Singh wrote:

> > out_of_memory() doesn't return a value to specify whether the page 
> > allocator should retry the allocation or just return NULL, all that policy 
> > is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
> > fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
> > killer when it's __GFP_NOFAIL.
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		/* The OOM killer will not help higher order allocs */
> >  		if (order > PAGE_ALLOC_COSTLY_ORDER)
> >  			goto out;
> > +		/* The OOM killer does not needlessly kill tasks for lowmem */
> > +		if (high_zoneidx < ZONE_NORMAL)
> > +			goto out;
> 
> I am not sure if this is a good idea, ZONE_DMA could have a lot of
> memory on some architectures. IIUC, we return NULL for allocations
> from ZONE_DMA? What is the reason for the heuristic?
> 

As the patch description says, we would otherwise needlessly kill tasks 
that may not be consuming any lowmem since there is no way to determine 
its usage and typically the memory in lowmem will either be reclaimable 
(or migratable via memory compaction) if it is not pinned for I/O in which 
case we shouldn't kill for it anyway at this point.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations
@ 2010-02-23 21:12                           ` David Rientjes
  0 siblings, 0 replies; 145+ messages in thread
From: David Rientjes @ 2010-02-23 21:12 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Nick Piggin, Andrew Morton, Rik van Riel,
	Andrea Arcangeli, Lubos Lunak, KOSAKI Motohiro, linux-kernel,
	linux-mm

On Tue, 23 Feb 2010, Balbir Singh wrote:

> > out_of_memory() doesn't return a value to specify whether the page 
> > allocator should retry the allocation or just return NULL, all that policy 
> > is kept in mm/page_alloc.c.  For highzone_idx < ZONE_NORMAL, we want to 
> > fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom 
> > killer when it's __GFP_NOFAIL.
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		/* The OOM killer will not help higher order allocs */
> >  		if (order > PAGE_ALLOC_COSTLY_ORDER)
> >  			goto out;
> > +		/* The OOM killer does not needlessly kill tasks for lowmem */
> > +		if (high_zoneidx < ZONE_NORMAL)
> > +			goto out;
> 
> I am not sure if this is a good idea, ZONE_DMA could have a lot of
> memory on some architectures. IIUC, we return NULL for allocations
> from ZONE_DMA? What is the reason for the heuristic?
> 

As the patch description says, we would otherwise needlessly kill tasks 
that may not be consuming any lowmem since there is no way to determine 
its usage and typically the memory in lowmem will either be reclaimable 
(or migratable via memory compaction) if it is not pinned for I/O in which 
case we shouldn't kill for it anyway at this point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2010-02-23 21:12 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-15 22:19 [patch -mm 0/9 v2] oom killer rewrite David Rientjes
2010-02-15 22:19 ` David Rientjes
2010-02-15 22:20 ` [patch -mm 1/9 v2] oom: filter tasks not sharing the same cpuset David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-16  6:14   ` Nick Piggin
2010-02-16  6:14     ` Nick Piggin
2010-02-15 22:20 ` [patch -mm 2/9 v2] oom: sacrifice child with highest badness score for parent David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-16  6:15   ` Nick Piggin
2010-02-16  6:15     ` Nick Piggin
2010-02-15 22:20 ` [patch -mm 3/9 v2] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-23  6:31   ` Balbir Singh
2010-02-23  6:31     ` Balbir Singh
2010-02-23  8:17     ` David Rientjes
2010-02-23  8:17       ` David Rientjes
2010-02-15 22:20 ` [patch -mm 4/9 v2] oom: remove compulsory panic_on_oom mode David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-16  0:00   ` KAMEZAWA Hiroyuki
2010-02-16  0:00     ` KAMEZAWA Hiroyuki
2010-02-16  0:14     ` David Rientjes
2010-02-16  0:14       ` David Rientjes
2010-02-16  0:23       ` KAMEZAWA Hiroyuki
2010-02-16  0:23         ` KAMEZAWA Hiroyuki
2010-02-16  9:02         ` David Rientjes
2010-02-16  9:02           ` David Rientjes
2010-02-16 23:42           ` KAMEZAWA Hiroyuki
2010-02-16 23:42             ` KAMEZAWA Hiroyuki
2010-02-16 23:54             ` David Rientjes
2010-02-16 23:54               ` David Rientjes
2010-02-17  0:01               ` KAMEZAWA Hiroyuki
2010-02-17  0:01                 ` KAMEZAWA Hiroyuki
2010-02-17  0:31                 ` David Rientjes
2010-02-17  0:31                   ` David Rientjes
2010-02-17  0:41                   ` KAMEZAWA Hiroyuki
2010-02-17  0:41                     ` KAMEZAWA Hiroyuki
2010-02-17  0:54                     ` David Rientjes
2010-02-17  0:54                       ` David Rientjes
2010-02-17  1:03                       ` KAMEZAWA Hiroyuki
2010-02-17  1:03                         ` KAMEZAWA Hiroyuki
2010-02-17  1:58                       ` David Rientjes
2010-02-17  1:58                         ` David Rientjes
2010-02-17  2:13                         ` KAMEZAWA Hiroyuki
2010-02-17  2:13                           ` KAMEZAWA Hiroyuki
2010-02-17  2:23                           ` KAMEZAWA Hiroyuki
2010-02-17  2:23                             ` KAMEZAWA Hiroyuki
2010-02-17  2:37                             ` David Rientjes
2010-02-17  2:37                               ` David Rientjes
2010-02-17  2:28                           ` David Rientjes
2010-02-17  2:28                             ` David Rientjes
2010-02-17  2:34                             ` KAMEZAWA Hiroyuki
2010-02-17  2:34                               ` KAMEZAWA Hiroyuki
2010-02-17  2:58                               ` David Rientjes
2010-02-17  2:58                                 ` David Rientjes
2010-02-17  3:21                                 ` KAMEZAWA Hiroyuki
2010-02-17  3:21                                   ` KAMEZAWA Hiroyuki
2010-02-17  9:11                                   ` David Rientjes
2010-02-17  9:11                                     ` David Rientjes
2010-02-17  9:52                                     ` Nick Piggin
2010-02-17  9:52                                       ` Nick Piggin
2010-02-17 22:04                                       ` David Rientjes
2010-02-17 22:04                                         ` David Rientjes
2010-02-22  5:31                               ` Daisuke Nishimura
2010-02-22  6:15                                 ` KAMEZAWA Hiroyuki
2010-02-22  6:15                                   ` KAMEZAWA Hiroyuki
2010-02-22 11:42                                   ` Daisuke Nishimura
2010-02-22 11:42                                     ` Daisuke Nishimura
2010-02-22 20:59                                     ` David Rientjes
2010-02-22 20:59                                       ` David Rientjes
2010-02-22 23:51                                     ` KAMEZAWA Hiroyuki
2010-02-22 23:51                                       ` KAMEZAWA Hiroyuki
2010-02-22 20:55                                   ` David Rientjes
2010-02-22 20:55                                     ` David Rientjes
2010-02-17  2:19                         ` KOSAKI Motohiro
2010-02-17  2:19                           ` KOSAKI Motohiro
2010-02-16  6:20   ` Nick Piggin
2010-02-16  6:20     ` Nick Piggin
2010-02-16  6:59     ` David Rientjes
2010-02-16  6:59       ` David Rientjes
2010-02-16  7:20       ` Nick Piggin
2010-02-16  7:20         ` Nick Piggin
2010-02-16  7:53         ` David Rientjes
2010-02-16  7:53           ` David Rientjes
2010-02-16  8:08           ` Nick Piggin
2010-02-16  8:08             ` Nick Piggin
2010-02-16  8:10             ` KAMEZAWA Hiroyuki
2010-02-16  8:10               ` KAMEZAWA Hiroyuki
2010-02-16  8:42             ` David Rientjes
2010-02-16  8:42               ` David Rientjes
2010-02-15 22:20 ` [patch -mm 5/9 v2] oom: badness heuristic rewrite David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-15 22:20 ` [patch -mm 6/9 v2] oom: deprecate oom_adj tunable David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-15 22:28   ` Alan Cox
2010-02-15 22:28     ` Alan Cox
2010-02-15 22:35     ` David Rientjes
2010-02-15 22:35       ` David Rientjes
2010-02-15 22:20 ` [patch -mm 7/9 v2] oom: replace sysctls with quick mode David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-16  6:28   ` Nick Piggin
2010-02-16  6:28     ` Nick Piggin
2010-02-16  8:58     ` David Rientjes
2010-02-16  8:58       ` David Rientjes
2010-02-15 22:20 ` [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations David Rientjes
2010-02-15 22:20   ` David Rientjes
2010-02-15 23:57   ` KAMEZAWA Hiroyuki
2010-02-15 23:57     ` KAMEZAWA Hiroyuki
2010-02-16  0:10     ` David Rientjes
2010-02-16  0:10       ` David Rientjes
2010-02-16  0:21       ` KAMEZAWA Hiroyuki
2010-02-16  0:21         ` KAMEZAWA Hiroyuki
2010-02-16  1:13         ` [patch] mm: add comment about deprecation of __GFP_NOFAIL David Rientjes
2010-02-16  1:13           ` David Rientjes
2010-02-16  1:26           ` KAMEZAWA Hiroyuki
2010-02-16  1:26             ` KAMEZAWA Hiroyuki
2010-02-16  7:03             ` David Rientjes
2010-02-16  7:03               ` David Rientjes
2010-02-16  7:23               ` Nick Piggin
2010-02-16  7:23                 ` Nick Piggin
2010-02-16  5:32       ` [patch -mm 8/9 v2] oom: avoid oom killer for lowmem allocations KOSAKI Motohiro
2010-02-16  5:32         ` KOSAKI Motohiro
2010-02-16  7:29         ` David Rientjes
2010-02-16  7:29           ` David Rientjes
2010-02-16  6:44       ` Nick Piggin
2010-02-16  6:44         ` Nick Piggin
2010-02-16  7:41         ` David Rientjes
2010-02-16  7:41           ` David Rientjes
2010-02-16  7:53           ` Nick Piggin
2010-02-16  7:53             ` Nick Piggin
2010-02-16  8:25             ` David Rientjes
2010-02-16  8:25               ` David Rientjes
2010-02-16 23:48               ` KAMEZAWA Hiroyuki
2010-02-16 23:48                 ` KAMEZAWA Hiroyuki
2010-02-17  0:03                 ` David Rientjes
2010-02-17  0:03                   ` David Rientjes
2010-02-17  0:03                   ` KAMEZAWA Hiroyuki
2010-02-17  0:03                     ` KAMEZAWA Hiroyuki
2010-02-17  0:21                     ` David Rientjes
2010-02-17  0:21                       ` David Rientjes
2010-02-23 11:24                       ` Balbir Singh
2010-02-23 11:24                         ` Balbir Singh
2010-02-23 21:12                         ` David Rientjes
2010-02-23 21:12                           ` David Rientjes
2010-02-15 22:20 ` [patch -mm 9/9 v2] oom: remove unnecessary code and cleanup David Rientjes
2010-02-15 22:20   ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.