All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/18] oom killer rewrite
@ 2010-06-06 22:33 David Rientjes
  2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
                   ` (17 more replies)
  0 siblings, 18 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

This is the latest update of the oom killer rewrite based on
mmotm-2010-06-03-16-36, although it applies cleanly to 2.6.35-rc2 as
well.

There are two changes in this update, which I hope to now be considered
for -mm inclusion and pushed for 2.6.36:

 - reordered the patches to more accurately seperate fixes from
   enhancements: the order is now very close to how KAMEZAWA Hiroyuki
   suggested (thanks!), and

 - the changelog for "oom: badness heuristic rewrite" was slightly
   expanded to mention how this rewrite improves the oom killer's
   behavior on the desktop.

Many thanks to Nick Piggin <npiggin@suse.de> for converting the remaining
architectures that weren't using the oom killer to handle pagefault oom
conditions to do so.  His patches have hit mainline, so there is no
longer an inconsistency in the semantics of panic_on_oom in such cases!

Many thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for his
help and patience in working with me on this patchset.
---
 Documentation/feature-removal-schedule.txt |   25 +
 Documentation/filesystems/proc.txt         |  100 ++--
 Documentation/sysctl/vm.txt                |   23 
 fs/proc/base.c                             |  107 ++++
 include/linux/memcontrol.h                 |    8 
 include/linux/mempolicy.h                  |   13 
 include/linux/oom.h                        |   27 +
 include/linux/sched.h                      |    3 
 kernel/fork.c                              |    1 
 kernel/sysctl.c                            |   12 
 mm/memcontrol.c                            |   18 
 mm/mempolicy.c                             |   44 +
 mm/oom_kill.c                              |  675 ++++++++++++++++-------------
 mm/page_alloc.c                            |   29 -
 14 files changed, 727 insertions(+), 358 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-07 12:12   ` Balbir Singh
  2010-06-08 19:33   ` Andrew Morton
  2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

From: Oleg Nesterov <oleg@redhat.com>

select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().

Change the code to check PF_KTHREAD.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/*
-		 * skip kernel threads and tasks which have already released
-		 * their mm.
-		 */
+		/* skip tasks that have already released their mm */
 		if (!p->mm)
 			continue;
-		/* skip the init task */
-		if (is_global_init(p))
+		/* skip the init task and kthreads */
+		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
  2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-07 12:58   ` Balbir Singh
  2010-06-08 19:42   ` Andrew Morton
  2010-06-06 22:34 ` [patch 03/18] oom: dump_tasks use find_lock_task_mm too David Rientjes
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

From: Oleg Nesterov <oleg@redhat.com>

Almost all ->mm == NUL checks in oom_kill.c are wrong.

The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.

- Remove the "if (!p->mm)" check in select_bad_process(), it is
  just wrong.

- Add the new helper, find_lock_task_mm(), which finds the live
  thread which uses the memory and takes task_lock() to pin ->mm

- change oom_badness() to use this helper instead of just checking
  ->mm != NULL.

- As David pointed out, select_bad_process() must never choose the
  task without ->mm, but no matter what oom_badness() returns the
  task can be chosen if nothing else has been found yet.

  Change oom_badness() to return int, change it to return -1 if
  find_lock_task_mm() fails, and change select_bad_process() to
  check points >= 0.

Note! This patch is not enough, we need more changes.

	- oom_badness() was fixed, but oom_kill_task() still ignores
	  the task without ->mm

	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
	  and it also needs other changes to actually find the first
	  first-descendant children

This will be addressed later.

[kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   74 +++++++++++++++++++++++++++++++++------------------------
 1 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 	return 0;
 }
 
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+	struct task_struct *t = p;
+
+	do {
+		task_lock(t);
+		if (likely(t->mm))
+			return t;
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return NULL;
+}
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
-	struct mm_struct *mm;
 	struct task_struct *child;
+	struct task_struct *c, *t;
 	int oom_adj = p->signal->oom_adj;
 	struct task_cputime task_time;
 	unsigned long utime;
@@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	if (oom_adj == OOM_DISABLE)
 		return 0;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm) {
-		task_unlock(p);
+	p = find_lock_task_mm(p);
+	if (!p)
 		return 0;
-	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
 	 */
-	points = mm->total_vm;
+	points = p->mm->total_vm;
 
 	/*
 	 * After this unlock we can no longer dereference local variable `mm'
@@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * child is eating the vast majority of memory, adding only half
 	 * to the parents will make the child our kill candidate of choice.
 	 */
-	list_for_each_entry(child, &p->children, sibling) {
-		task_lock(child);
-		if (child->mm != mm && child->mm)
-			points += child->mm->total_vm/2 + 1;
-		task_unlock(child);
-	}
+	t = p;
+	do {
+		list_for_each_entry(c, &t->children, sibling) {
+			child = find_lock_task_mm(c);
+			if (child) {
+				if (child->mm != p->mm)
+					points += child->mm->total_vm/2 + 1;
+				task_unlock(child);
+			}
+		}
+	} while_each_thread(p, t);
 
 	/*
 	 * CPU time is in tens of seconds and run time is in thousands
@@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	for_each_process(p) {
 		unsigned long points;
 
-		/* skip tasks that have already released their mm */
-		if (!p->mm)
-			continue;
 		/* skip the init task and kthreads */
 		if (is_global_init(p) || (p->flags & PF_KTHREAD))
 			continue;
@@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
 		return;
 	}
 
-	task_lock(p);
-	if (!p->mm) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
-			task_pid_nr(p), p->comm);
-		task_unlock(p);
+	p = find_lock_task_mm(p);
+	if (!p)
 		return;
-	}
 
 	if (verbose)
 		printk(KERN_ERR "Killed process %d (%s) "
@@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    const char *message)
 {
 	struct task_struct *c;
+	struct task_struct *t = p;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 					message, task_pid_nr(p), p->comm, points);
 
 	/* Try to kill a child first */
-	list_for_each_entry(c, &p->children, sibling) {
-		if (c->mm == p->mm)
-			continue;
-		if (mem && !task_in_mem_cgroup(c, mem))
-			continue;
-		if (!oom_kill_task(c))
-			return 0;
-	}
+	do {
+		list_for_each_entry(c, &t->children, sibling) {
+			if (c->mm == p->mm)
+				continue;
+			if (mem && !task_in_mem_cgroup(c, mem))
+				continue;
+			if (!oom_kill_task(c))
+				return 0;
+		}
+	} while_each_thread(p, t);
+
 	return oom_kill_task(p);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 03/18] oom: dump_tasks use find_lock_task_mm too
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
  2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
  2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 19:55   ` Andrew Morton
  2010-06-06 22:34 ` [patch 04/18] oom: PF_EXITING check should take mm into account David Rientjes
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

dump_task() should use find_lock_task_mm() too. It is necessary for
protecting task-exiting race.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   39 +++++++++++++++++++++------------------
 1 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  */
 static void dump_tasks(const struct mem_cgroup *mem)
 {
-	struct task_struct *g, *p;
+	struct task_struct *p;
+	struct task_struct *task;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
 	       "name\n");
-	do_each_thread(g, p) {
-		struct mm_struct *mm;
-
-		if (mem && !task_in_mem_cgroup(p, mem))
+	for_each_process(p) {
+		/*
+		 * We don't have is_global_init() check here, because the old
+		 * code do that. printing init process is not big matter. But
+		 * we don't hope to make unnecessary compatibility breaking.
+		 */
+		if (p->flags & PF_KTHREAD)
 			continue;
-		if (!thread_group_leader(p))
+		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 
-		task_lock(p);
-		mm = p->mm;
-		if (!mm) {
+		task = find_lock_task_mm(p);
+		if (!task) {
 			/*
-			 * total_vm and rss sizes do not exist for tasks with no
-			 * mm so there's no need to report them; they can't be
-			 * oom killed anyway.
+			 * Probably oom vs task-exiting race was happen and ->mm
+			 * have been detached. thus there's no need to report
+			 * them; they can't be oom killed anyway.
 			 */
-			task_unlock(p);
 			continue;
 		}
+
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
-		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
-		       p->comm);
-		task_unlock(p);
-	} while_each_thread(g, p);
+		       task->pid, __task_cred(task)->uid, task->tgid,
+		       task->mm->total_vm, get_mm_rss(task->mm),
+		       (int)task_cpu(task), task->signal->oom_adj, p->comm);
+		task_unlock(task);
+	}
 }
 
 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 04/18] oom: PF_EXITING check should take mm into account
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (2 preceding siblings ...)
  2010-06-06 22:34 ` [patch 03/18] oom: dump_tasks use find_lock_task_mm too David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 20:00   ` Andrew Morton
  2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

From: Oleg Nesterov <oleg@redhat.com>

select_bad_process() checks PF_EXITING to detect the task which is going
to release its memory, but the logic is very wrong.

	- a single process P with the dead group leader disables
	  select_bad_process() completely, it will always return
	  ERR_PTR() while P can live forever

	- if the PF_EXITING task has already released its ->mm
	  it doesn't make sense to expect it is goiing to free
	  more memory (except task_struct/etc)

Change the code to ignore the PF_EXITING tasks without ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (p->flags & PF_EXITING) {
+		if ((p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (3 preceding siblings ...)
  2010-06-06 22:34 ` [patch 04/18] oom: PF_EXITING check should take mm into account David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 20:08   ` Andrew Morton
  2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

It's possible to livelock the page allocator if a thread has mm->mmap_sem
and fails to make forward progress because the oom killer selects another
thread sharing the same ->mm to kill that cannot exit until the semaphore
is dropped.

The oom killer will not kill multiple tasks at the same time; each oom
killed task must exit before another task may be killed.  Thus, if one
thread is holding mm->mmap_sem and cannot allocate memory, all threads
sharing the same ->mm are blocked from exiting as well.  In the oom kill
case, that means the thread holding mm->mmap_sem will never free
additional memory since it cannot get access to memory reserves and the
thread that depends on it with access to memory reserves cannot exit
because it cannot acquire the semaphore.  Thus, the page allocators
livelocks.

When the oom killer is called and current happens to have a pending
SIGKILL, this patch automatically gives it access to memory reserves and
returns.  Upon returning to the page allocator, its allocation will
hopefully succeed so it can quickly exit and free its memory.  If not, the
page allocator will fail the allocation if it is not __GFP_NOFAIL.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		/* Got some memory back in the last second. */
 		return;
 
+	/*
+	 * If current has a pending SIGKILL, then automatically select it.  The
+	 * goal is to allow it to allocate so that it may quickly exit and free
+	 * its memory.
+	 */
+	if (fatal_signal_pending(current)) {
+		set_thread_flag(TIF_MEMDIE);
+		return;
+	}
+
 	if (sysctl_panic_on_oom == 2) {
 		dump_header(NULL, gfp_mask, order, NULL);
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (4 preceding siblings ...)
  2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

It's unnecessary to SIGKILL a task that is already PF_EXITING and can
actually cause a NULL pointer dereference of the sighand if it has already
been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
reserves and can quickly exit as the comment implies.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
+		set_tsk_thread_flag(p, TIF_MEMDIE);
 		return 0;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (5 preceding siblings ...)
  2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 20:23   ` Andrew Morton
  2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.  To use a sane heuristic, we must ensure that killing a
task would likely free memory for current and avoid needlessly killing
others at all costs just because their potential memory freeing is
unknown.  It is better to kill current than another task needlessly.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 4;
 
 	/*
-	 * If p's nodes don't overlap ours, it may still help to kill p
-	 * because p may have allocated or otherwise mapped memory on
-	 * this node before. However it will be less likely.
-	 */
-	if (!has_intersects_mems_allowed(p))
-		points /= 8;
-
-	/*
 	 * Adjust the score by oom_adj.
 	 */
 	if (oom_adj) {
@@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
+		if (!has_intersects_mems_allowed(p))
+			continue;
 
 		/*
 		 * This task already has access to memory reserves and is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 08/18] oom: sacrifice child with highest badness score for parent
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (6 preceding siblings ...)
  2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 20:33   ` Andrew Morton
  2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.  Unfortunately,
this often kills in a seemingly random fashion based on the ordering of
the selected task's child list.  Additionally, it is not guaranteed at all
to free a large amount of memory that we need to prevent additional oom
killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists.  The worst child is indicated with the
highest badness() score.  This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful and
then the parent if none were found while the new code simply kills the
most memory-hogging task or the parent.  Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or has
a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   23 +++++++++++++++++------
 1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
 			    const char *message)
 {
+	struct task_struct *victim = p;
 	struct task_struct *c;
 	struct task_struct *t = p;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
-					message, task_pid_nr(p), p->comm, points);
+	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
+		message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	/* Try to sacrifice the worst child first */
+	do_posix_clock_monotonic_gettime(&uptime);
 	do {
+		unsigned long cpoints;
+
 		list_for_each_entry(c, &t->children, sibling) {
 			if (c->mm == p->mm)
 				continue;
 			if (mem && !task_in_mem_cgroup(c, mem))
 				continue;
-			if (!oom_kill_task(c))
-				return 0;
+
+			/* badness() returns 0 if the thread is unkillable */
+			cpoints = badness(c, uptime.tv_sec);
+			if (cpoints > victim_points) {
+				victim = c;
+				victim_points = cpoints;
+			}
 		}
 	} while_each_thread(p, t);
 
-	return oom_kill_task(p);
+	return oom_kill_task(victim);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (7 preceding siblings ...)
  2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
                     ` (2 more replies)
  2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
                   ` (8 subsequent siblings)
  17 siblings, 3 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes.  There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score.  This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |   13 +++++++-
 mm/mempolicy.c            |   44 ++++++++++++++++++++++++
 mm/oom_kill.c             |   80 +++++++++++++++++++++++++++-----------------
 3 files changed, 105 insertions(+), 32 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+				const nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+	return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+			const nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 }
 #endif
 
+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy.  Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
+ * policy, always return true since it may allocate elsewhere on fallback.
+ *
+ * Takes task_lock(tsk) to prevent freeing of its mempolicy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+					const nodemask_t *mask)
+{
+	struct mempolicy *mempolicy;
+	bool ret = true;
+
+	if (!mask)
+		return ret;
+	task_lock(tsk);
+	mempolicy = tsk->mempolicy;
+	if (!mempolicy)
+		goto out;
+
+	switch (mempolicy->mode) {
+	case MPOL_PREFERRED:
+		/*
+		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
+		 * allocate from, they may fallback to other nodes when oom.
+		 * Thus, it's possible for tsk to have allocated memory from
+		 * nodes in mask.
+		 */
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		break;
+	default:
+		BUG();
+	}
+out:
+	task_unlock(tsk);
+	return ret;
+}
+
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 int sysctl_panic_on_oom;
@@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+					const nodemask_t *mask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
 
-	t = tsk;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
-			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
-
-	return 0;
+		if (mask) {
+			/*
+			 * If this is a mempolicy constrained oom, tsk's
+			 * cpuset is irrelevant.  Only return true if its
+			 * mempolicy intersects current, otherwise it may be
+			 * needlessly killed.
+			 */
+			if (mempolicy_nodemask_intersects(tsk, mask))
+				return true;
+		} else {
+			/*
+			 * This is not a mempolicy constrained oom, so only
+			 * check the mems of tsk's cpuset.
+			 */
+			if (cpuset_mems_allowed_intersects(current, tsk))
+				return true;
+		}
+		tsk = next_thread(tsk);
+	} while (tsk != start);
+	return false;
 }
 
 static struct task_struct *find_lock_task_mm(struct task_struct *p)
@@ -253,7 +270,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+		struct mem_cgroup *mem, enum oom_constraint constraint,
+		const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -269,7 +287,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			continue;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
-		if (!has_intersects_mems_allowed(p))
+		if (!has_intersects_mems_allowed(p,
+				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+									 NULL))
 			continue;
 
 		/*
@@ -495,7 +515,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 		panic("out of memory(memcg). panic_on_oom is selected.\n");
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -574,7 +594,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -588,7 +609,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -622,7 +643,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -638,6 +660,7 @@ void pagefault_out_of_memory(void)
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
  *
  * If we run out of memory, we have the choice between either
  * killing a random task (bad), letting the system crash (worse)
@@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
-
-	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
-	case CONSTRAINT_NONE:
-		if (sysctl_panic_on_oom) {
+	if (unlikely(sysctl_panic_on_oom)) {
+		/*
+		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+		 * should not panic for cpuset or mempolicy induced memory
+		 * failures.
+		 */
+		if (constraint == CONSTRAINT_NONE) {
 			dump_header(NULL, gfp_mask, order, NULL);
-			panic("out of memory. panic_on_oom is selected\n");
+			read_unlock(&tasklist_lock);
+			panic("Out of memory: panic_on_oom is enabled\n");
 		}
-		/* Fall-through */
-	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
-		break;
 	}
-
+	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 10/18] oom: enable oom tasklist dump by default
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (8 preceding siblings ...)
  2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 21:13   ` Andrew Morton
  2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
very helpful information in diagnosing why a user's task has been killed.
It emits useful information such as each eligible thread's memory usage
that can determine why the system is oom, so it should be enabled by
default.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/sysctl/vm.txt |    2 +-
 mm/oom_kill.c               |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -511,7 +511,7 @@ information may not be desired.
 If this is set to non-zero, this information is shown whenever the
 OOM killer actually kills a memory-hogging task.
 
-The default value is 0.
+The default value is 1 (enabled).
 
 ==============================================================
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ef048c1..833de48 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -32,7 +32,7 @@
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
+int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [patch 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (9 preceding siblings ...)
  2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 21:19   ` Andrew Morton
  2010-06-06 22:34 ` [patch 12/18] oom: extract panic helper function David Rientjes
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help.  The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O.  Killing such an
application may leave the hardware in an unspecified state and there is no
guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   29 ++++++++++++++++++++---------
 1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer will not help higher order allocs */
 		if (order > PAGE_ALLOC_COSTLY_ORDER)
 			goto out;
+		/* The OOM killer does not needlessly kill tasks for lowmem */
+		if (high_zoneidx < ZONE_NORMAL)
+			goto out;
 		/*
 		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
 		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -2052,15 +2055,23 @@ rebalance:
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The oom killer is not called for high-order
+				 * allocations that may fail, so if no progress
+				 * is being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+				/*
+				 * The oom killer is not called for lowmem
+				 * allocations to prevent needlessly killing
+				 * innocent tasks.
+				 */
+				if (high_zoneidx < ZONE_NORMAL)
+					goto nopage;
+			}
 
 			goto restart;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 12/18] oom: extract panic helper function
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (10 preceding siblings ...)
  2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

There are various points in the oom killer where the kernel must
determine whether to panic or not.  It's better to extract this to a
helper function to remove all the confusion as to its semantics.

Also fix a call to dump_header() where tasklist_lock is not read-
locked, as required.

There's no functional change with this patch.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    1 +
 mm/oom_kill.c       |   53 +++++++++++++++++++++++++++-----------------------
 2 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -22,6 +22,7 @@ enum oom_constraint {
 	CONSTRAINT_NONE,
 	CONSTRAINT_CPUSET,
 	CONSTRAINT_MEMORY_POLICY,
+	CONSTRAINT_MEMCG,
 };
 
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -505,17 +505,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	return oom_kill_task(victim);
 }
 
+/*
+ * Determines whether the kernel must panic because of the panic_on_oom sysctl.
+ */
+static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+				int order)
+{
+	if (likely(!sysctl_panic_on_oom))
+		return;
+	if (sysctl_panic_on_oom != 2) {
+		/*
+		 * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
+		 * does not panic for cpuset, mempolicy, or memcg allocation
+		 * failures.
+		 */
+		if (constraint != CONSTRAINT_NONE)
+			return;
+	}
+	read_lock(&tasklist_lock);
+	dump_header(NULL, gfp_mask, order, NULL);
+	read_unlock(&tasklist_lock);
+	panic("Out of memory: %s panic_on_oom is enabled\n",
+		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
 	unsigned long points = 0;
 	struct task_struct *p;
 
-	if (sysctl_panic_on_oom == 2)
-		panic("out of memory(memcg). panic_on_oom is selected.\n");
+	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -616,8 +639,8 @@ retry:
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
-		read_unlock(&tasklist_lock);
 		dump_header(NULL, gfp_mask, order, NULL);
+		read_unlock(&tasklist_lock);
 		panic("Out of memory and no killable processes...\n");
 	}
 
@@ -639,9 +662,7 @@ void pagefault_out_of_memory(void)
 		/* Got some memory back in the last second. */
 		return;
 
-	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
-
+	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
 	read_lock(&tasklist_lock);
 	/* unknown gfp_mask and order */
 	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
@@ -688,29 +709,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 	}
 
-	if (sysctl_panic_on_oom == 2) {
-		dump_header(NULL, gfp_mask, order, NULL);
-		panic("out of memory. Compulsory panic_on_oom is selected.\n");
-	}
-
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
 	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
-	if (unlikely(sysctl_panic_on_oom)) {
-		/*
-		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
-		 * should not panic for cpuset or mempolicy induced memory
-		 * failures.
-		 */
-		if (constraint == CONSTRAINT_NONE) {
-			dump_header(NULL, gfp_mask, order, NULL);
-			read_unlock(&tasklist_lock);
-			panic("Out of memory: panic_on_oom is enabled\n");
-		}
-	}
 	__out_of_memory(gfp_mask, order, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 13/18] oom: remove special handling for pagefault ooms
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (11 preceding siblings ...)
  2010-06-06 22:34 ` [patch 12/18] oom: extract panic helper function David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 21:27   ` Andrew Morton
  2010-06-06 22:34 ` [patch 14/18] oom: move sysctl declarations to oom.h David Rientjes
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

It is possible to remove the special pagefault oom handler by simply oom
locking all system zones and then calling directly into out_of_memory().

All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
parallel oom killing in progress that will lead to eventual memory freeing
so it's not necessary to needlessly kill another task.  The context in
which the pagefault is allocating memory is unknown to the oom killer, so
this is done on a system-wide level.

If a task has already been oom killed and hasn't fully exited yet, this
will be a no-op since select_bad_process() recognizes tasks across the
system with TIF_MEMDIE set.

Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   86 +++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 57 insertions(+), 29 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -615,6 +615,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /*
+ * Try to acquire the oom killer lock for all system zones.  Returns zero if a
+ * parallel oom killing is taking place, otherwise locks all zones and returns
+ * non-zero.
+ */
+static int try_set_system_oom(void)
+{
+	struct zone *zone;
+	int ret = 1;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		if (zone_is_oom_locked(zone)) {
+			ret = 0;
+			goto out;
+		}
+	for_each_populated_zone(zone)
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+out:
+	spin_unlock(&zone_scan_lock);
+	return ret;
+}
+
+/*
+ * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
+ * attempts or page faults may now recall the oom killer, if necessary.
+ */
+static void clear_system_oom(void)
+{
+	struct zone *zone;
+
+	spin_lock(&zone_scan_lock);
+	for_each_populated_zone(zone)
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	spin_unlock(&zone_scan_lock);
+}
+
+
+/*
  * Must be called with tasklist_lock held for read.
  */
 static void __out_of_memory(gfp_t gfp_mask, int order,
@@ -649,33 +687,6 @@ retry:
 		goto retry;
 }
 
-/*
- * pagefault handler calls into here because it is out of memory but
- * doesn't know exactly how or why.
- */
-void pagefault_out_of_memory(void)
-{
-	unsigned long freed = 0;
-
-	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-	if (freed > 0)
-		/* Got some memory back in the last second. */
-		return;
-
-	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
-	read_lock(&tasklist_lock);
-	/* unknown gfp_mask and order */
-	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
-	read_unlock(&tasklist_lock);
-
-	/*
-	 * Give "p" a good chance of killing itself before we
-	 * retry to allocate memory.
-	 */
-	if (!test_thread_flag(TIF_MEMDIE))
-		schedule_timeout_uninterruptible(1);
-}
-
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -692,7 +703,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
-	enum oom_constraint constraint;
+	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
@@ -713,7 +724,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	if (zonelist)
+		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
 	__out_of_memory(gfp_mask, order, constraint, nodemask);
@@ -726,3 +738,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
+
+/*
+ * The pagefault handler calls here because it is out of memory, so kill a
+ * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
+ * oom killing is already in progress so do nothing.  If a task is found with
+ * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
+ */
+void pagefault_out_of_memory(void)
+{
+	if (try_set_system_oom()) {
+		out_of_memory(NULL, 0, 0, NULL);
+		clear_system_oom();
+	}
+	if (!test_thread_flag(TIF_MEMDIE))
+		schedule_timeout_uninterruptible(1);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 14/18] oom: move sysctl declarations to oom.h
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (12 preceding siblings ...)
  2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-06 22:34 ` [patch 15/18] oom: remove unnecessary code and cleanup David Rientjes
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

The three oom killer sysctl variables (sysctl_oom_dump_tasks,
sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
declared in include/linux/oom.h rather than kernel/sysctl.c.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    5 +++++
 kernel/sysctl.c     |    4 +---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,5 +44,10 @@ static inline void oom_killer_enable(void)
 {
 	oom_killer_disabled = false;
 }
+
+/* sysctls */
+extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_panic_on_oom;
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -55,6 +55,7 @@
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -87,9 +88,6 @@
 /* External variables not in a header file. */
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 15/18] oom: remove unnecessary code and cleanup
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (13 preceding siblings ...)
  2010-06-06 22:34 ` [patch 14/18] oom: move sysctl declarations to oom.h David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

Remove the redundancy in __oom_kill_task() since:

 - init can never be passed to this function: it will never be PF_EXITING
   or selectable from select_bad_process(), and

 - it will never be passed a task from oom_kill_task() without an ->mm
   and we're unconcerned about detachment from exiting tasks, there's no
   reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity is
not always emitted here; we need not print an error message if an exiting
task is given a longer timeslice.

__oom_kill_task() only has a single caller, so it can be merged into that
function at the same time.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   56 ++++++++++----------------------------------------------
 1 files changed, 10 insertions(+), 46 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -401,61 +401,25 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-
-/*
- * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
- * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
- * set.
- */
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static int oom_kill_task(struct task_struct *p)
 {
-	if (is_global_init(p)) {
-		WARN_ON(1);
-		printk(KERN_WARNING "tried to kill init!\n");
-		return;
-	}
-
 	p = find_lock_task_mm(p);
-	if (!p)
-		return;
-
-	if (verbose)
-		printk(KERN_ERR "Killed process %d (%s) "
-		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		       task_pid_nr(p), p->comm,
-		       K(p->mm->total_vm),
-		       K(get_mm_counter(p->mm, MM_ANONPAGES)),
-		       K(get_mm_counter(p->mm, MM_FILEPAGES)));
+	if (!p || p->signal->oom_adj == OOM_DISABLE) {
+		task_unlock(p);
+		return 1;
+	}
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+		task_pid_nr(p), p->comm, K(p->mm->total_vm),
+		K(get_mm_counter(p->mm, MM_ANONPAGES)),
+		K(get_mm_counter(p->mm, MM_FILEPAGES)));
 	task_unlock(p);
 
-	/*
-	 * We give our sacrificial lamb high priority and access to
-	 * all the memory it needs. That way it should be able to
-	 * exit() and clear out its resources quickly...
-	 */
 	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);
-
 	force_sig(SIGKILL, p);
-}
-
-static int oom_kill_task(struct task_struct *p)
-{
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-	if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
-		return 1;
-
-	__oom_kill_task(p, 1);
-
 	return 0;
 }
+#undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 16/18] oom: badness heuristic rewrite
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (14 preceding siblings ...)
  2010-06-06 22:34 ` [patch 15/18] oom: remove unnecessary code and cleanup David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 22:58   ` Andrew Morton
  2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
  2010-06-06 22:35 ` [patch 18/18] oom: deprecate oom_adj tunable David Rientjes
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   94 ++++++++-----
 fs/proc/base.c                     |   99 ++++++++++++-
 include/linux/memcontrol.h         |    8 +
 include/linux/oom.h                |   14 ++-
 include/linux/sched.h              |    3 +-
 kernel/fork.c                      |    1 +
 mm/memcontrol.c                    |   18 +++
 mm/oom_kill.c                      |  279 ++++++++++++++++--------------------
 8 files changed, 316 insertions(+), 200 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
   2	Modifying System Parameters
 
   3	Per-Process Parameters
-  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
+  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+								score
   3.2	/proc/<pid>/oom_score - Display current oom-killer score
   3.3	/proc/<pid>/io - Display the IO accounting fields
   3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1234,42 +1235,61 @@ of the kernel.
 CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------------------------------
 
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- 	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- 	to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
-	points <<= oom_adj when it is positive and
-	points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted.  The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000.  If it is using half of its allowed memory, its score will be 500.
+
+There is an additional factor included in the badness score: root
+processes are given 3% extra memory over other tasks.
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called.  If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes.  If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit.  Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill.  Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it.  The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory.  A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score.  Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible.  This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
 
 3.2 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -63,6 +63,7 @@
 #include <linux/namei.h>
 #include <linux/mnt_namespace.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <linux/rcupdate.h>
 #include <linux/kallsyms.h>
 #include <linux/stacktrace.h>
@@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = {
 #endif
 
 /* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
 static int proc_oom_score(struct task_struct *task, char *buffer)
 {
 	unsigned long points = 0;
-	struct timespec uptime;
 
-	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
 	if (pid_alive(task))
-		points = badness(task, uptime.tv_sec);
+		points = oom_badness(task->group_leader,
+					global_page_state(NR_INACTIVE_ANON) +
+					global_page_state(NR_ACTIVE_ANON) +
+					global_page_state(NR_INACTIVE_FILE) +
+					global_page_state(NR_ACTIVE_FILE) +
+					total_swap_pages);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
@@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	}
 
 	task->signal->oom_adj = oom_adjust;
-
+	/*
+	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+	 * value is always attainable.
+	 */
+	if (task->signal->oom_adj == OOM_ADJUST_MAX)
+		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+	else
+		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+								-OOM_DISABLE;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 
@@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.llseek		= generic_file_llseek,
 };
 
+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	char buffer[PROC_NUMBUF];
+	int oom_score_adj = OOM_SCORE_ADJ_MIN;
+	unsigned long flags;
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+	if (lock_task_sighand(task, &flags)) {
+		oom_score_adj = task->signal->oom_score_adj;
+		unlock_task_sighand(task, &flags);
+	}
+	put_task_struct(task);
+	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	unsigned long flags;
+	long oom_score_adj;
+	int err;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+	if (err)
+		return -EINVAL;
+	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+			oom_score_adj > OOM_SCORE_ADJ_MAX)
+		return -EINVAL;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	if (!lock_task_sighand(task, &flags)) {
+		put_task_struct(task);
+		return -ESRCH;
+	}
+	if (oom_score_adj < task->signal->oom_score_adj &&
+			!capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	task->signal->oom_score_adj = oom_score_adj;
+	/*
+	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+	 * always attainable.
+	 */
+	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		task->signal->oom_adj = OOM_DISABLE;
+	else
+		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+							OOM_SCORE_ADJ_MAX;
+	unlock_task_sighand(task, &flags);
+	put_task_struct(task);
+	return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+	.read		= oom_score_adj_read,
+	.write		= oom_score_adj_write,
+};
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 	INF("oom_score",  S_IRUGO, proc_oom_score),
 	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
@@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 	INF("oom_score", S_IRUGO, proc_oom_score),
 	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
 #ifdef CONFIG_AUDITSYSCALL
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline
+u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,24 @@
 #ifndef __INCLUDE_LINUX_OOM_H
 #define __INCLUDE_LINUX_OOM_H
 
-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
 #define OOM_DISABLE (-17)
 /* inclusive */
 #define OOM_ADJUST_MIN (-16)
 #define OOM_ADJUST_MAX 15
 
+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/nodemask.h>
 
@@ -25,6 +35,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+extern unsigned int oom_badness(struct task_struct *p,
+					unsigned long totalpages);
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,7 +629,8 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
-	int oom_adj;	/* OOM kill score adjustment (bit shift) */
+	int oom_adj;		/* OOM kill score adjustment (bit shift) */
+	int oom_score_adj;	/* OOM kill score adjustment */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 
 	sig->oom_adj = current->signal->oom_adj;
+	sig->oom_score_adj = current->signal->oom_score_adj;
 
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
 }
 
 /*
+ * Return the memory (and swap, if configured) limit for a memcg.
+ */
+u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
+{
+	u64 limit;
+	u64 memsw;
+
+	limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
+			total_swap_pages;
+	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+	/*
+	 * If memsw is finite and limits the amount of swap space available
+	 * to this memcg, return that limit.
+	 */
+	return min(limit, memsw);
+}
+
+/*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
  * that to reclaim free pages from.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
+ *  Copyright (C)  2010  Google, Inc.
+ *	Rewritten by David Rientjes
  *
  *  The routines in this file are used to kill a process when
  *  we're seriously out of memory. This gets called from __alloc_pages()
@@ -34,7 +36,6 @@ int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */
 
 /*
  * Do all threads of the target process overlap our allowed nodes?
@@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 }
 
 /**
- * badness - calculate a numeric value for how bad this task has been
+ * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+ * @totalpages: total present RAM allowed for page allocation
  *
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
- *
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- *    algorithm has been meticulously tuned to meet the principle
- *    of least surprise ... (be careful when you change it)
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible.  The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom failures.
  */
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 {
-	unsigned long points, cpu_time, run_time;
-	struct task_struct *child;
-	struct task_struct *c, *t;
-	int oom_adj = p->signal->oom_adj;
-	struct task_cputime task_time;
-	unsigned long utime;
-	unsigned long stime;
-
-	if (oom_adj == OOM_DISABLE)
-		return 0;
+	int points;
 
 	p = find_lock_task_mm(p);
 	if (!p)
 		return 0;
 
 	/*
-	 * The memory size of the process is the basis for the badness.
-	 */
-	points = p->mm->total_vm;
-
-	/*
-	 * After this unlock we can no longer dereference local variable `mm'
-	 */
-	task_unlock(p);
-
-	/*
-	 * swapoff can easily use up all memory, so kill those first.
+	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+	 * need to be executed for something that cannot be killed.
 	 */
-	if (p->flags & PF_OOM_ORIGIN)
-		return ULONG_MAX;
-
-	/*
-	 * Processes which fork a lot of child processes are likely
-	 * a good choice. We add half the vmsize of the children if they
-	 * have an own mm. This prevents forking servers to flood the
-	 * machine with an endless amount of children. In case a single
-	 * child is eating the vast majority of memory, adding only half
-	 * to the parents will make the child our kill candidate of choice.
-	 */
-	t = p;
-	do {
-		list_for_each_entry(c, &t->children, sibling) {
-			child = find_lock_task_mm(c);
-			if (child) {
-				if (child->mm != p->mm)
-					points += child->mm->total_vm/2 + 1;
-				task_unlock(child);
-			}
-		}
-	} while_each_thread(p, t);
+	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+		task_unlock(p);
+		return 0;
+	}
 
 	/*
-	 * CPU time is in tens of seconds and run time is in thousands
-         * of seconds. There is no particular reason for this other than
-         * that it turned out to work very well in practice.
+	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+	 * priority for oom killing.
 	 */
-	thread_group_cputime(p, &task_time);
-	utime = cputime_to_jiffies(task_time.utime);
-	stime = cputime_to_jiffies(task_time.stime);
-	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
-	if (uptime >= p->start_time.tv_sec)
-		run_time = (uptime - p->start_time.tv_sec) >> 10;
-	else
-		run_time = 0;
-
-	if (cpu_time)
-		points /= int_sqrt(cpu_time);
-	if (run_time)
-		points /= int_sqrt(int_sqrt(run_time));
+	if (p->flags & PF_OOM_ORIGIN) {
+		task_unlock(p);
+		return 1000;
+	}
 
 	/*
-	 * Niced processes are most likely less important, so double
-	 * their badness points.
+	 * The memory controller may have a limit of 0 bytes, so avoid a divide
+	 * by zero if necessary.
 	 */
-	if (task_nice(p) > 0)
-		points *= 2;
+	if (!totalpages)
+		totalpages = 1;
 
 	/*
-	 * Superuser processes are usually more important, so we make it
-	 * less likely that we kill those.
+	 * The baseline for the badness score is the proportion of RAM that each
+	 * task's rss and swap space use.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
-	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
-		points /= 4;
+	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
+			totalpages;
+	task_unlock(p);
 
 	/*
-	 * We don't want to kill a process with direct hardware access.
-	 * Not only could that mess up the hardware, but usually users
-	 * tend to only have this flag set on applications they think
-	 * of as important.
+	 * Root processes get 3% bonus, just like the __vm_enough_memory()
+	 * implementation used by LSMs.
 	 */
-	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
-		points /= 4;
+	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+		points -= 30;
 
 	/*
-	 * Adjust the score by oom_adj.
+	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
+	 * either completely disable oom killing or always prefer a certain
+	 * task.
 	 */
-	if (oom_adj) {
-		if (oom_adj > 0) {
-			if (!points)
-				points = 1;
-			points <<= oom_adj;
-		} else
-			points >>= -(oom_adj);
-	}
+	points += p->signal->oom_score_adj;
 
-#ifdef DEBUG
-	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
-	p->pid, p->comm, points);
-#endif
-	return points;
+	if (points < 0)
+		return 0;
+	return (points < 1000) ? points : 1000;
 }
 
 /*
@@ -224,12 +158,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
  */
 #ifdef CONFIG_NUMA
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				    gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	bool cpuset_limited = false;
+	int nid;
 
+	/* Default to all anonymous memory, page cache, and swap */
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
+
+	if (!zonelist)
+		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -239,26 +185,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 		return CONSTRAINT_NONE;
 
 	/*
-	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
-	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
-	 * feature. mempolicy is an only user of nodemask here.
-	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+	 * the page allocator means a mempolicy is in effect.  Cpuset policy
+	 * is enforced in get_page_from_freelist().
 	 */
-	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, *nodemask)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
 		return CONSTRAINT_MEMORY_POLICY;
+	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 			high_zoneidx, nodemask)
 		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
-			return CONSTRAINT_CPUSET;
-
+			cpuset_limited = true;
+
+	if (cpuset_limited) {
+		*totalpages = total_swap_pages;
+		for_each_node_mask(nid, cpuset_current_mems_allowed)
+			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+					node_page_state(nid, NR_ACTIVE_ANON) +
+					node_page_state(nid, NR_INACTIVE_FILE) +
+					node_page_state(nid, NR_ACTIVE_FILE);
+		return CONSTRAINT_CPUSET;
+	}
 	return CONSTRAINT_NONE;
 }
 #else
 static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				unsigned long *totalpages)
 {
+	*totalpages = global_page_state(NR_INACTIVE_ANON) +
+			global_page_state(NR_ACTIVE_ANON) +
+			global_page_state(NR_INACTIVE_FILE) +
+			global_page_state(NR_ACTIVE_FILE) +
+			total_swap_pages;
 	return CONSTRAINT_NONE;
 }
 #endif
@@ -269,18 +236,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned long *ppoints,
-		struct mem_cgroup *mem, enum oom_constraint constraint,
-		const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+		unsigned long totalpages, struct mem_cgroup *mem,
+		enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
-	struct timespec uptime;
 	*ppoints = 0;
 
-	do_posix_clock_monotonic_gettime(&uptime);
 	for_each_process(p) {
-		unsigned long points;
+		unsigned int points;
 
 		/* skip the init task and kthreads */
 		if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -319,14 +284,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 				return ERR_PTR(-1UL);
 
 			chosen = p;
-			*ppoints = ULONG_MAX;
+			*ppoints = 1000;
 		}
 
-		if (p->signal->oom_adj == OOM_DISABLE)
-			continue;
-
-		points = badness(p, uptime.tv_sec);
-		if (points > *ppoints || !chosen) {
+		points = oom_badness(p, totalpages);
+		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -341,7 +303,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
  *
  * Dumps the current memory state of all system tasks, excluding kernel threads.
  * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
  *
  * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
  * shown.
@@ -354,7 +316,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	struct task_struct *task;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
-	       "name\n");
+	       "oom_score_adj name\n");
 	for_each_process(p) {
 		/*
 		 * We don't have is_global_init() check here, because the old
@@ -376,10 +338,11 @@ static void dump_tasks(const struct mem_cgroup *mem)
 			continue;
 		}
 
-		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
+		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
 		       task->pid, __task_cred(task)->uid, task->tgid,
 		       task->mm->total_vm, get_mm_rss(task->mm),
-		       (int)task_cpu(task), task->signal->oom_adj, p->comm);
+		       (int)task_cpu(task), task->signal->oom_adj,
+		       task->signal->oom_score_adj, p->comm);
 		task_unlock(task);
 	}
 }
@@ -388,8 +351,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 							struct mem_cgroup *mem)
 {
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
-		"oom_adj=%d\n",
-		current->comm, gfp_mask, order, current->signal->oom_adj);
+		"oom_adj=%d, oom_score_adj=%d\n",
+		current->comm, gfp_mask, order, current->signal->oom_adj,
+		current->signal->oom_score_adj);
 	task_lock(current);
 	cpuset_print_task_mems_allowed(current);
 	task_unlock(current);
@@ -404,7 +368,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 static int oom_kill_task(struct task_struct *p)
 {
 	p = find_lock_task_mm(p);
-	if (!p || p->signal->oom_adj == OOM_DISABLE) {
+	if (!p || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 		task_unlock(p);
 		return 1;
 	}
@@ -422,14 +386,13 @@ static int oom_kill_task(struct task_struct *p)
 #undef K
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned int points, unsigned long totalpages,
+			    struct mem_cgroup *mem, const char *message)
 {
 	struct task_struct *victim = p;
 	struct task_struct *c;
 	struct task_struct *t = p;
-	unsigned long victim_points = 0;
-	struct timespec uptime;
+	unsigned int victim_points = 0;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -443,13 +406,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
+	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
 	/* Try to sacrifice the worst child first */
-	do_posix_clock_monotonic_gettime(&uptime);
 	do {
-		unsigned long cpoints;
+		unsigned int cpoints;
 
 		list_for_each_entry(c, &t->children, sibling) {
 			if (c->mm == p->mm)
@@ -457,8 +419,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			if (mem && !task_in_mem_cgroup(c, mem))
 				continue;
 
-			/* badness() returns 0 if the thread is unkillable */
-			cpoints = badness(c, uptime.tv_sec);
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			cpoints = oom_badness(c, totalpages);
 			if (cpoints > victim_points) {
 				victim = c;
 				victim_points = cpoints;
@@ -496,17 +460,19 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
-	unsigned long points = 0;
+	unsigned long limit;
+	unsigned int points = 0;
 	struct task_struct *p;
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
+	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
+	p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL);
 	if (!p || PTR_ERR(p) == -1UL)
 		goto out;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -619,22 +585,22 @@ static void clear_system_oom(void)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
 			enum oom_constraint constraint, const nodemask_t *mask)
 {
 	struct task_struct *p;
-	unsigned long points;
+	unsigned int points;
 
 	if (sysctl_oom_kill_allocating_task)
-		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+			NULL, "Out of memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL, constraint, mask);
+	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -646,7 +612,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 			     "Out of memory"))
 		goto retry;
 }
@@ -666,6 +632,7 @@ retry:
 void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask)
 {
+	unsigned long totalpages;
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
@@ -688,11 +655,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	if (zonelist)
-		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+						&totalpages);
 	check_panic_on_oom(constraint, gfp_mask, order);
 	read_lock(&tasklist_lock);
-	__out_of_memory(gfp_mask, order, constraint, nodemask);
+	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
 	read_unlock(&tasklist_lock);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 17/18] oom: add forkbomb penalty to badness heuristic
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (15 preceding siblings ...)
  2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
@ 2010-06-06 22:34 ` David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 23:15   ` Andrew Morton
  2010-06-06 22:35 ` [patch 18/18] oom: deprecate oom_adj tunable David Rientjes
  17 siblings, 2 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

Add a forkbomb penalty for processes that fork an excessively large
number of children to penalize that group of tasks and not others.  A
threshold is configurable from userspace to determine how many first-
generation execve children (those with their own address spaces) a task
may have before it is considered a forkbomb.  This can be tuned by
altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

	(average rss of children) * (# of 1st generation execve children)
	-----------------------------------------------------------------
			oom_forkbomb_thres

is assessed.  So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks.  A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected as
the victim, the oom killer will still kill the 1st generation execve child
with the highest badness() score in its place.  The avoids killing
important servers or system daemons.  When a web server forks a very large
number of threads for client connections, for example, it is much better
to kill one of those threads than to kill the server and make it
unresponsive.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |    7 +++-
 Documentation/sysctl/vm.txt        |   21 +++++++++++
 include/linux/oom.h                |    4 ++
 kernel/sysctl.c                    |    8 ++++
 mm/oom_kill.c                      |   66 ++++++++++++++++++++++++++++++++++++
 5 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1248,8 +1248,11 @@ may allocate from based on an estimation of its current memory and swap use.
 For example, if a task is using all allowed memory, its badness score will be
 1000.  If it is using half of its allowed memory, its score will be 500.
 
-There is an additional factor included in the badness score: root
-processes are given 3% extra memory over other tasks.
+There are a couple of additional factor included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
 
 The amount of "allowed" memory depends on the context in which the oom killer
 was called.  If it is due to the memory assigned to the allocating task's cpuset
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -46,6 +46,7 @@ Currently, these files are in /proc/sys/vm:
 - nr_trim_pages         (only if CONFIG_MMU=n)
 - numa_zonelist_order
 - oom_dump_tasks
+- oom_forkbomb_thres
 - oom_kill_allocating_task
 - overcommit_memory
 - overcommit_ratio
@@ -515,6 +516,26 @@ The default value is 1 (enabled).
 
 ==============================================================
 
+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb.  Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+	(average rss of execve children) * (# of 1st generation execve children)
+	------------------------------------------------------------------------
+				oom_forkbomb_thres
+
+in the oom killer's badness heuristic.  Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
 oom_kill_allocating_task
 
 This enables or disables killing the OOM-triggering task in
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -16,6 +16,9 @@
 #define OOM_SCORE_ADJ_MIN	(-1000)
 #define OOM_SCORE_ADJ_MAX	1000
 
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES	1000
+
 #ifdef __KERNEL__
 
 #include <linux/sched.h>
@@ -59,6 +62,7 @@ static inline void oom_killer_enable(void)
 
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
 #endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1001,6 +1001,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "oom_forkbomb_thres",
+		.data		= &sysctl_oom_forkbomb_thres,
+		.maxlen		= sizeof(sysctl_oom_forkbomb_thres),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
 		.maxlen		= sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,7 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
 static DEFINE_SPINLOCK(zone_scan_lock);
 
 /*
@@ -84,6 +85,70 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return NULL;
 }
 
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose such as a webserver.  The oom killer assesses
+ * a penalty equaling
+ *
+ *	(average rss of children) * (# of 1st generation execve children)
+ *	-----------------------------------------------------------------
+ *			sysctl_oom_forkbomb_thres
+ *
+ * for such tasks to target the parent.  oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method.  A web server, for example, may fork a very large number of
+ * threads to respond to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive.  The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
+ */
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
+{
+	struct task_struct *child;
+	struct task_struct *c, *t;
+	unsigned long child_rss = 0;
+	int forkcount = 0;
+
+	if (!sysctl_oom_forkbomb_thres)
+		return 0;
+
+	t = tsk;
+	do {
+		struct task_cputime task_time;
+		unsigned long runtime;
+		unsigned long rss;
+
+		list_for_each_entry(c, &t->children, sibling) {
+			child = find_lock_task_mm(c);
+			if (!child)
+				continue;
+			if (child->mm == tsk->mm) {
+				task_unlock(child);
+				continue;
+			}
+			rss = get_mm_rss(child->mm);
+			task_unlock(child);
+
+			thread_group_cputime(child, &task_time);
+			runtime = cputime_to_jiffies(task_time.utime) +
+				  cputime_to_jiffies(task_time.stime);
+			/*
+			 * Only threads that have run for less than a second are
+			 * considered toward the forkbomb penalty, these threads
+			 * rarely get to execute at all in such cases anyway.
+			 */
+			if (runtime < HZ) {
+				child_rss += rss;
+				forkcount++;
+			}
+		}
+	} while_each_thread(tsk, t);
+	return forkcount > sysctl_oom_forkbomb_thres ?
+				(child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -133,6 +198,7 @@ unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
 	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
 			totalpages;
 	task_unlock(p);
+	points += oom_forkbomb_penalty(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
                   ` (16 preceding siblings ...)
  2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
@ 2010-06-06 22:35 ` David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
  17 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-06 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed.  The target date for removal is June 2012.

A warning will be printed to the kernel log if a task attempts to use this
interface.  Future warning will be suppressed until the kernel is rebooted
to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/feature-removal-schedule.txt |   25 +++++++++++++++++++++++++
 Documentation/filesystems/proc.txt         |    3 +++
 fs/proc/base.c                             |    8 ++++++++
 include/linux/oom.h                        |    3 +++
 4 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -174,6 +174,31 @@ Who:	Eric Biederman <ebiederm@xmission.com>
 
 ---------------------------
 
+What:	/proc/<pid>/oom_adj
+When:	June 2012
+Why:	/proc/<pid>/oom_adj allows userspace to influence the oom killer's
+	badness heuristic used to determine which task to kill when the kernel
+	is out of memory.
+
+	The badness heuristic has since been rewritten since the introduction of
+	this tunable such that its meaning is deprecated.  The value was
+	implemented as a bitshift on a score generated by the badness()
+	function that did not have any precise units of measure.  With the
+	rewrite, the score is given as a proportion of available memory to the
+	task allocating pages, so using a bitshift which grows the score
+	exponentially is, thus, impossible to tune with fine granularity.
+
+	A much more powerful interface, /proc/<pid>/oom_score_adj, was
+	introduced with the oom killer rewrite that allows users to increase or
+	decrease the badness() score linearly.  This interface will replace
+	/proc/<pid>/oom_adj.
+
+	A warning will be emitted to the kernel log if an application uses this
+	deprecated interface.  After it is printed once, future warnings will be
+	suppressed until the kernel is rebooted.
+
+---------------------------
+
 What:	remove EXPORT_SYMBOL(kernel_thread)
 When:	August 2006
 Files:	arch/*/kernel/*_ksyms.c
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1288,6 +1288,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.
 Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
 other with its scaled value.
 
+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+
 Caveat: when a parent task is selected, the oom killer will sacrifice any first
 generation children with seperate address spaces instead, if possible.  This
 avoids servers and important system daemons from being killed and loses the
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1044,6 +1044,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 		return -EACCES;
 	}
 
+	/*
+	 * Warn that /proc/pid/oom_adj is deprecated, see
+	 * Documentation/feature-removal-schedule.txt.
+	 */
+	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+			"please use /proc/%d/oom_score_adj instead.\n",
+			current->comm, task_pid_nr(current),
+			task_pid_nr(task), task_pid_nr(task));
 	task->signal->oom_adj = oom_adjust;
 	/*
 	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -2,6 +2,9 @@
 #define __INCLUDE_LINUX_OOM_H
 
 /*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *
  * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
  */
 #define OOM_DISABLE (-17)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
@ 2010-06-07 12:12   ` Balbir Singh
  2010-06-07 19:50     ` David Rientjes
  2010-06-08 19:33   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: Balbir Singh @ 2010-06-07 12:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

* David Rientjes <rientjes@google.com> [2010-06-06 15:34:00]:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
>

Quick check are all kernel threads marked with PF_KTHREAD? daemonize()
marks threads as kernel threads and I suppose children of init_task
inherit the flag on fork. I suppose both should cover all kernel
threads, but just checking to see if we missed anything.
 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
@ 2010-06-07 12:58   ` Balbir Singh
  2010-06-07 13:49     ` Minchan Kim
  2010-06-08 19:42   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: Balbir Singh @ 2010-06-07 12:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

* David Rientjes <rientjes@google.com> [2010-06-06 15:34:03]:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.

typo should be NULL

> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what oom_badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
>   Change oom_badness() to return int, change it to return -1 if
>   find_lock_task_mm() fails, and change select_bad_process() to
>   check points >= 0.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- oom_badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> 	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
> 	  and it also needs other changes to actually find the first
> 	  first-descendant children
> 
> This will be addressed later.
> 
> [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   74 +++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 43 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  	return 0;
>  }
> 
> +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t = p;
> +
> +	do {
> +		task_lock(t);
> +		if (likely(t->mm))
> +			return t;
> +		task_unlock(t);
> +	} while_each_thread(p, t);
> +
> +	return NULL;
> +}
> +

Even if we miss this mm via p->mm, won't for_each_process actually
catch it? Are you suggesting that the main thread could have detached
the mm and a thread might still have it mapped? 


-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-07 12:58   ` Balbir Singh
@ 2010-06-07 13:49     ` Minchan Kim
  2010-06-07 19:49       ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: Minchan Kim @ 2010-06-07 13:49 UTC (permalink / raw)
  To: balbir
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

Hi, Balbir.

On Mon, Jun 7, 2010 at 9:58 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * David Rientjes <rientjes@google.com> [2010-06-06 15:34:03]:
>
>> From: Oleg Nesterov <oleg@redhat.com>
>>
>> Almost all ->mm == NUL checks in oom_kill.c are wrong.
>
> typo should be NULL
>
>>
>> The current code assumes that the task without ->mm has already
>> released its memory and ignores the process. However this is not
>> necessarily true when this process is multithreaded, other live
>> sub-threads can use this ->mm.
>>
>> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>>   just wrong.
>>
>> - Add the new helper, find_lock_task_mm(), which finds the live
>>   thread which uses the memory and takes task_lock() to pin ->mm
>>
>> - change oom_badness() to use this helper instead of just checking
>>   ->mm != NULL.
>>
>> - As David pointed out, select_bad_process() must never choose the
>>   task without ->mm, but no matter what oom_badness() returns the
>>   task can be chosen if nothing else has been found yet.
>>
>>   Change oom_badness() to return int, change it to return -1 if
>>   find_lock_task_mm() fails, and change select_bad_process() to
>>   check points >= 0.
>>
>> Note! This patch is not enough, we need more changes.
>>
>>       - oom_badness() was fixed, but oom_kill_task() still ignores
>>         the task without ->mm
>>
>>       - oom_forkbomb_penalty() should use find_lock_task_mm() too,
>>         and it also needs other changes to actually find the first
>>         first-descendant children
>>
>> This will be addressed later.
>>
>> [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
>> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> ---
>>  mm/oom_kill.c |   74 +++++++++++++++++++++++++++++++++------------------------
>>  1 files changed, 43 insertions(+), 31 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>>       return 0;
>>  }
>>
>> +static struct task_struct *find_lock_task_mm(struct task_struct *p)
>> +{
>> +     struct task_struct *t = p;
>> +
>> +     do {
>> +             task_lock(t);
>> +             if (likely(t->mm))
>> +                     return t;
>> +             task_unlock(t);
>> +     } while_each_thread(p, t);
>> +
>> +     return NULL;
>> +}
>> +
>
> Even if we miss this mm via p->mm, won't for_each_process actually
> catch it? Are you suggesting that the main thread could have detached
> the mm and a thread might still have it mapped?

Yes.  Although main thread detach mm, sub-thread still may have the mm.
As you have confused, I think this function name isn't good.
So I suggested following as.

http://lkml.org/lkml/2010/6/2/325

Anyway, It does make sense to me.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-07 13:49     ` Minchan Kim
@ 2010-06-07 19:49       ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-07 19:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: balbir, Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Mon, 7 Jun 2010, Minchan Kim wrote:

> Yes.  Although main thread detach mm, sub-thread still may have the mm.
> As you have confused, I think this function name isn't good.
> So I suggested following as.
> 

I think the function name is fine, it describes exactly what it does: it 
finds the relevant mm for the task and returns it with task_lock() held.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-07 12:12   ` Balbir Singh
@ 2010-06-07 19:50     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-07 19:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Mon, 7 Jun 2010, Balbir Singh wrote:

> > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> > is not true due to use_mm().
> > 
> > Change the code to check PF_KTHREAD.
> >
> 
> Quick check are all kernel threads marked with PF_KTHREAD? daemonize()
> marks threads as kernel threads and I suppose children of init_task
> inherit the flag on fork. I suppose both should cover all kernel
> threads, but just checking to see if we missed anything.
>  

Right, it's the inheritance from init_task that is the key which gets 
cleared on exec for all user threads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 23:02     ` Andrew Morton
  2010-06-17  5:12     ` David Rientjes
  2010-06-08 22:58   ` Andrew Morton
  1 sibling, 2 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

Hi

> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions.  The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.
> 
> Instead of basing the heuristic on mm->total_vm for each task, the task's
> rss and swap space is used instead.  This is a better indication of the
> amount of memory that will be freeable if the oom killed task is chosen
> and subsequently exits.  This helps specifically in cases where KDE or
> GNOME is chosen for oom kill on desktop systems instead of a memory
> hogging task.
> 
> The baseline for the heuristic is a proportion of memory that each task is
> currently using in memory plus swap compared to the amount of "allowable"
> memory.  "Allowable," in this sense, means the system-wide resources for
> unconstrained oom conditions, the set of mempolicy nodes, the mems
> attached to current's cpuset, or a memory controller's limit.  The
> proportion is given on a scale of 0 (never kill) to 1000 (always kill),
> roughly meaning that if a task has a badness() score of 500 that the task
> consumes approximately 50% of allowable memory resident in RAM or in swap
> space.
> 
> The proportion is always relative to the amount of "allowable" memory and
> not the total amount of RAM systemwide so that mempolicies and cpusets may
> operate in isolation; they shall not need to know the true size of the
> machine on which they are running if they are bound to a specific set of
> nodes or mems, respectively.
> 
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs.  In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.
> 
> Because of the change in the badness() heuristic's baseline, it is also
> necessary to introduce a new user interface to tune it.  It's not possible
> to redefine the meaning of /proc/pid/oom_adj with a new scale since the
> ABI cannot be changed for backward compatability.  Instead, a new tunable,
> /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
> be used to polarize the heuristic such that certain tasks are never
> considered for oom kill while others may always be considered.  The value
> is added directly into the badness() score so a value of -500, for
> example, means to discount 50% of its memory consumption in comparison to
> other tasks either on the system, bound to the mempolicy, in the cpuset,
> or sharing the same memory controller.
> 
> /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> these per-task tunables will rescale the value of the other to an
> equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> a bitshift on the badness score, it now shares the same linear growth as
> /proc/pid/oom_score_adj but with different granularity.  This is required
> so the ABI is not broken with userspace applications and allows oom_adj to
> be deprecated for future removal.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt |   94 ++++++++-----
>  fs/proc/base.c                     |   99 ++++++++++++-
>  include/linux/memcontrol.h         |    8 +
>  include/linux/oom.h                |   14 ++-
>  include/linux/sched.h              |    3 +-
>  kernel/fork.c                      |    1 +
>  mm/memcontrol.c                    |   18 +++
>  mm/oom_kill.c                      |  279 ++++++++++++++++--------------------
>  8 files changed, 316 insertions(+), 200 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -33,7 +33,8 @@ Table of Contents
>    2	Modifying System Parameters
>  
>    3	Per-Process Parameters
> -  3.1	/proc/<pid>/oom_adj - Adjust the oom-killer score
> +  3.1	/proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
> +								score
>    3.2	/proc/<pid>/oom_score - Display current oom-killer score
>    3.3	/proc/<pid>/io - Display the IO accounting fields
>    3.4	/proc/<pid>/coredump_filter - Core dump filtering settings
> @@ -1234,42 +1235,61 @@ of the kernel.
>  CHAPTER 3: PER-PROCESS PARAMETERS
>  ------------------------------------------------------------------------------
>  
> -3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
> -------------------------------------------------------
> -
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> -
> -The process to be killed in an out-of-memory situation is selected among all others
> -based on its badness score. This value equals the original memory size of the process
> -and is then updated according to its CPU time (utime + stime) and the
> -run time (uptime - start time). The longer it runs the smaller is the score.
> -Badness score is divided by the square root of the CPU time and then by
> -the double square root of the run time.
> -
> -Swapped out tasks are killed first. Half of each child's memory size is added to
> -the parent's score if they do not share the same memory. Thus forking servers
> -are the prime candidates to be killed. Having only one 'hungry' child will make
> -parent less preferable than the child.
> -
> -/proc/<pid>/oom_score shows process' current badness score.
> -
> -The following heuristics are then applied:
> - * if the task was reniced, its score doubles
> - * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
> - 	or CAP_SYS_RAWIO) have their score divided by 4
> - * if oom condition happened in one cpuset and checked process does not belong
> - 	to it, its score is divided by 8
> - * the resulting score is multiplied by two to the power of oom_adj, i.e.
> -	points <<= oom_adj when it is positive and
> -	points >>= -(oom_adj) otherwise
> -
> -The task with the highest badness score is then selected and its children
> -are killed, process itself will be killed in an OOM situation when it does
> -not have children or some of them disabled oom like described above.
> +3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
> +--------------------------------------------------------------------------------
> +
> +These file can be used to adjust the badness heuristic used to select which
> +process gets killed in out of memory conditions.
> +
> +The badness heuristic assigns a value to each candidate task ranging from 0
> +(never kill) to 1000 (always kill) to determine which process is targeted.  The
> +units are roughly a proportion along that range of allowed memory the process
> +may allocate from based on an estimation of its current memory and swap use.
> +For example, if a task is using all allowed memory, its badness score will be
> +1000.  If it is using half of its allowed memory, its score will be 500.
> +
> +There is an additional factor included in the badness score: root
> +processes are given 3% extra memory over other tasks.
> +
> +The amount of "allowed" memory depends on the context in which the oom killer
> +was called.  If it is due to the memory assigned to the allocating task's cpuset
> +being exhausted, the allowed memory represents the set of mems assigned to that
> +cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
> +memory represents the set of mempolicy nodes.  If it is due to a memory
> +limit (or swap limit) being reached, the allowed memory is that configured
> +limit.  Finally, if it is due to the entire system being out of memory, the
> +allowed memory represents all allocatable resources.
> +
> +The value of /proc/<pid>/oom_score_adj is added to the badness score before it
> +is used to determine which task to kill.  Acceptable values range from -1000
> +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
> +polarize the preference for oom killing either by always preferring a certain
> +task or completely disabling it.  The lowest possible value, -1000, is
> +equivalent to disabling oom killing entirely for that task since it will always
> +report a badness score of 0.
> +
> +Consequently, it is very simple for userspace to define the amount of memory to
> +consider for each task.  Setting a /proc/<pid>/oom_score_adj value of +500, for
> +example, is roughly equivalent to allowing the remainder of tasks sharing the
> +same system, cpuset, mempolicy, or memory controller resources to use at least
> +50% more memory.  A value of -500, on the other hand, would be roughly
> +equivalent to discounting 50% of the task's allowed memory from being considered
> +as scoring against the task.
> +
> +For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
> +be used to tune the badness score.  Its acceptable values range from -16
> +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
> +(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
> +scaled linearly with /proc/<pid>/oom_score_adj.
> +
> +Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
> +other with its scaled value.
> +
> +Caveat: when a parent task is selected, the oom killer will sacrifice any first
> +generation children with seperate address spaces instead, if possible.  This
> +avoids servers and important system daemons from being killed and loses the
> +minimal amount of work.
> +
>  
>  3.2 /proc/<pid>/oom_score - Display current oom-killer score
>  -------------------------------------------------------------
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -63,6 +63,7 @@
>  #include <linux/namei.h>
>  #include <linux/mnt_namespace.h>
>  #include <linux/mm.h>
> +#include <linux/swap.h>
>  #include <linux/rcupdate.h>
>  #include <linux/kallsyms.h>
>  #include <linux/stacktrace.h>
> @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = {
>  #endif
>  
>  /* The badness from the OOM killer */
> -unsigned long badness(struct task_struct *p, unsigned long uptime);
>  static int proc_oom_score(struct task_struct *task, char *buffer)
>  {
>  	unsigned long points = 0;
> -	struct timespec uptime;
>  
> -	do_posix_clock_monotonic_gettime(&uptime);
>  	read_lock(&tasklist_lock);
>  	if (pid_alive(task))
> -		points = badness(task, uptime.tv_sec);
> +		points = oom_badness(task->group_leader,
> +					global_page_state(NR_INACTIVE_ANON) +
> +					global_page_state(NR_ACTIVE_ANON) +
> +					global_page_state(NR_INACTIVE_FILE) +
> +					global_page_state(NR_ACTIVE_FILE) +
> +					total_swap_pages);

Sorry I can't ack this. again and again, I try to explain why this is wrong
(hopefully last)

1) incompatibility
   oom_score is one of ABI. then, we can't change this. from enduser view,
   this change is no merit. In general, an incompatibility is allowed on very
   limited situation such as that an end-user get much benefit than compatibility.
   In other word, old style ABI doesn't works fine from end user view.
   But, in this case, it isn't.

2) technically incorrect
   this math is not correct math. this is not represented "allowed memory".
   example, 1) this is not accumulated mlocked memory, but it can be freed
   task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not.
   if not, task killing doesn't free SYSV IPC memory.
   In additon, 3) This normalization doesn't works on asymmetric numa. 
   total pages and oom are not related almostly. 4) scalability. if the 
   system 10TB memory, 1 point oom score mean 10GB memory consumption.
   it seems too rough. generically, a value suppression itself is evil for
   scalability software.

Then, we can't merge this our kernel. if your workload really need this,
we consider following simplest hook instead.

	if (badness_hook_fn)
		points = badness_hook_fn(p)
	else
		points = oom_badness(p);

Please implement your specific oom-score in your hook func.


>  	read_unlock(&tasklist_lock);
>  	return sprintf(buffer, "%lu\n", points);
>  }
> @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	}
>  
>  	task->signal->oom_adj = oom_adjust;
> -
> +	/*
> +	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
> +	 * value is always attainable.
> +	 */
> +	if (task->signal->oom_adj == OOM_ADJUST_MAX)
> +		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
> +	else
> +		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
> +								-OOM_DISABLE;
>  	unlock_task_sighand(task, &flags);
>  	put_task_struct(task);

Generically, I wasn't against the feature for rare use-case. but sorry,
as far as I investigated, I haven't find any actual user. then, I don't
put ack, because my reviewing basically stand on 1) how much user use this
2) how strongly required this from an users 3) how much side effect is there
etc etc. not cool or not.
A zero user feature is basically out of scope of mine. please separate 
this feature, and discuss another reviewers (e.g. Nick, Kamezawa-san). 
If you can get one or more reviewer ack, I don't put objection.

I don't want dicuss this topic you anymore. I can't imazine I and you
reach to agree this.



> @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = {
>  	.llseek		= generic_file_llseek,
>  };
>  
> +static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
> +					size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> +	char buffer[PROC_NUMBUF];
> +	int oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	unsigned long flags;
> +	size_t len;
> +
> +	if (!task)
> +		return -ESRCH;
> +	if (lock_task_sighand(task, &flags)) {
> +		oom_score_adj = task->signal->oom_score_adj;
> +		unlock_task_sighand(task, &flags);
> +	}
> +	put_task_struct(task);
> +	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
> +	return simple_read_from_buffer(buf, count, ppos, buffer, len);
> +}
> +
> +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> +					size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char buffer[PROC_NUMBUF];
> +	unsigned long flags;
> +	long oom_score_adj;
> +	int err;
> +
> +	memset(buffer, 0, sizeof(buffer));
> +	if (count > sizeof(buffer) - 1)
> +		count = sizeof(buffer) - 1;
> +	if (copy_from_user(buffer, buf, count))
> +		return -EFAULT;
> +
> +	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
> +	if (err)
> +		return -EINVAL;
> +	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> +			oom_score_adj > OOM_SCORE_ADJ_MAX)
> +		return -EINVAL;
> +
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +	if (!lock_task_sighand(task, &flags)) {
> +		put_task_struct(task);
> +		return -ESRCH;
> +	}
> +	if (oom_score_adj < task->signal->oom_score_adj &&
> +			!capable(CAP_SYS_RESOURCE)) {
> +		unlock_task_sighand(task, &flags);
> +		put_task_struct(task);
> +		return -EACCES;
> +	}
> +
> +	task->signal->oom_score_adj = oom_score_adj;
> +	/*
> +	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
> +	 * always attainable.
> +	 */
> +	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		task->signal->oom_adj = OOM_DISABLE;
> +	else
> +		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
> +							OOM_SCORE_ADJ_MAX;
> +	unlock_task_sighand(task, &flags);
> +	put_task_struct(task);
> +	return count;
> +}
> +
> +static const struct file_operations proc_oom_score_adj_operations = {
> +	.read		= oom_score_adj_read,
> +	.write		= oom_score_adj_write,
> +};
> +
>  #ifdef CONFIG_AUDITSYSCALL
>  #define TMPBUFLEN 21
>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #endif
>  	INF("oom_score",  S_IRUGO, proc_oom_score),
>  	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #endif
>  	INF("oom_score", S_IRUGO, proc_oom_score),
>  	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> +	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
>  #ifdef CONFIG_AUDITSYSCALL
>  	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -130,6 +130,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> +u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> @@ -309,6 +311,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline
> +u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -1,14 +1,24 @@
>  #ifndef __INCLUDE_LINUX_OOM_H
>  #define __INCLUDE_LINUX_OOM_H
>  
> -/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
> +/*
> + * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
> + */
>  #define OOM_DISABLE (-17)
>  /* inclusive */
>  #define OOM_ADJUST_MIN (-16)
>  #define OOM_ADJUST_MAX 15
>  
> +/*
> + * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
> + * pid.
> + */
> +#define OOM_SCORE_ADJ_MIN	(-1000)
> +#define OOM_SCORE_ADJ_MAX	1000
> +
>  #ifdef __KERNEL__
>  
> +#include <linux/sched.h>
>  #include <linux/types.h>
>  #include <linux/nodemask.h>
>  
> @@ -25,6 +35,8 @@ enum oom_constraint {
>  	CONSTRAINT_MEMCG,
>  };
>  
> +extern unsigned int oom_badness(struct task_struct *p,
> +					unsigned long totalpages);
>  extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,7 +629,8 @@ struct signal_struct {
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
>  
> -	int oom_adj;	/* OOM kill score adjustment (bit shift) */
> +	int oom_adj;		/* OOM kill score adjustment (bit shift) */
> +	int oom_score_adj;	/* OOM kill score adjustment */
>  };
>  
>  /* Context switch must be unlocked if interrupts are to be enabled */
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -899,6 +899,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	tty_audit_fork(sig);
>  
>  	sig->oom_adj = current->signal->oom_adj;
> +	sig->oom_score_adj = current->signal->oom_score_adj;
>  
>  	return 0;
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1158,6 +1158,24 @@ static int mem_cgroup_count_children(struct mem_cgroup *mem)
>  }
>  
>  /*
> + * Return the memory (and swap, if configured) limit for a memcg.
> + */
> +u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> +{
> +	u64 limit;
> +	u64 memsw;
> +
> +	limit = res_counter_read_u64(&memcg->res, RES_LIMIT) +
> +			total_swap_pages;
> +	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> +	/*
> +	 * If memsw is finite and limits the amount of swap space available
> +	 * to this memcg, return that limit.
> +	 */
> +	return min(limit, memsw);
> +}
> +
> +/*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
>   * that to reclaim free pages from.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -4,6 +4,8 @@
>   *  Copyright (C)  1998,2000  Rik van Riel
>   *	Thanks go out to Claus Fischer for some serious inspiration and
>   *	for goading me into coding this file...
> + *  Copyright (C)  2010  Google, Inc.
> + *	Rewritten by David Rientjes

don't put it.



>   *
>   *  The routines in this file are used to kill a process when
>   *  we're seriously out of memory. This gets called from __alloc_pages()
> @@ -34,7 +36,6 @@ int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
>  int sysctl_oom_dump_tasks = 1;
>  static DEFINE_SPINLOCK(zone_scan_lock);
> -/* #define DEBUG */
>  
>  /*
>   * Do all threads of the target process overlap our allowed nodes?
> @@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
>  }
>  
>  /**
> - * badness - calculate a numeric value for how bad this task has been
> + * oom_badness - heuristic function to determine which candidate task to kill
>   * @p: task struct of which task we should calculate
> - * @uptime: current uptime in seconds
> + * @totalpages: total present RAM allowed for page allocation
>   *
> - * The formula used is relatively simple and documented inline in the
> - * function. The main rationale is that we want to select a good task
> - * to kill when we run out of memory.
> - *
> - * Good in this context means that:
> - * 1) we lose the minimum amount of work done
> - * 2) we recover a large amount of memory
> - * 3) we don't kill anything innocent of eating tons of memory
> - * 4) we want to kill the minimum amount of processes (one)
> - * 5) we try to kill the process the user expects us to kill, this
> - *    algorithm has been meticulously tuned to meet the principle
> - *    of least surprise ... (be careful when you change it)
> + * The heuristic for determining which task to kill is made to be as simple and
> + * predictable as possible.  The goal is to return the highest value for the
> + * task consuming the most memory to avoid subsequent oom failures.
>   */
> -
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
>  {
> -	unsigned long points, cpu_time, run_time;
> -	struct task_struct *child;
> -	struct task_struct *c, *t;
> -	int oom_adj = p->signal->oom_adj;
> -	struct task_cputime task_time;
> -	unsigned long utime;
> -	unsigned long stime;
> -
> -	if (oom_adj == OOM_DISABLE)
> -		return 0;
> +	int points;
>  
>  	p = find_lock_task_mm(p);
>  	if (!p)
>  		return 0;
>  
>  	/*
> -	 * The memory size of the process is the basis for the badness.
> -	 */
> -	points = p->mm->total_vm;
> -
> -	/*
> -	 * After this unlock we can no longer dereference local variable `mm'
> -	 */
> -	task_unlock(p);
> -
> -	/*
> -	 * swapoff can easily use up all memory, so kill those first.
> +	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
> +	 * need to be executed for something that cannot be killed.
>  	 */
> -	if (p->flags & PF_OOM_ORIGIN)
> -		return ULONG_MAX;
> -
> -	/*
> -	 * Processes which fork a lot of child processes are likely
> -	 * a good choice. We add half the vmsize of the children if they
> -	 * have an own mm. This prevents forking servers to flood the
> -	 * machine with an endless amount of children. In case a single
> -	 * child is eating the vast majority of memory, adding only half
> -	 * to the parents will make the child our kill candidate of choice.
> -	 */
> -	t = p;
> -	do {
> -		list_for_each_entry(c, &t->children, sibling) {
> -			child = find_lock_task_mm(c);
> -			if (child) {
> -				if (child->mm != p->mm)
> -					points += child->mm->total_vm/2 + 1;
> -				task_unlock(child);
> -			}
> -		}
> -	} while_each_thread(p, t);
> +	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +		task_unlock(p);
> +		return 0;
> +	}
>  
>  	/*
> -	 * CPU time is in tens of seconds and run time is in thousands
> -         * of seconds. There is no particular reason for this other than
> -         * that it turned out to work very well in practice.
> +	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
> +	 * priority for oom killing.
>  	 */
> -	thread_group_cputime(p, &task_time);
> -	utime = cputime_to_jiffies(task_time.utime);
> -	stime = cputime_to_jiffies(task_time.stime);
> -	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
> -
> -
> -	if (uptime >= p->start_time.tv_sec)
> -		run_time = (uptime - p->start_time.tv_sec) >> 10;
> -	else
> -		run_time = 0;
> -
> -	if (cpu_time)
> -		points /= int_sqrt(cpu_time);
> -	if (run_time)
> -		points /= int_sqrt(int_sqrt(run_time));
> +	if (p->flags & PF_OOM_ORIGIN) {
> +		task_unlock(p);
> +		return 1000;
> +	}
>  
>  	/*
> -	 * Niced processes are most likely less important, so double
> -	 * their badness points.
> +	 * The memory controller may have a limit of 0 bytes, so avoid a divide
> +	 * by zero if necessary.
>  	 */
> -	if (task_nice(p) > 0)
> -		points *= 2;

You removed 
  - run time check
  - cpu time check
  - nice check

but no described the reason. reviewers are puzzled. How do we review
this though we don't get your point? please write

 - What benerit is there?
 - Why do you think no bad effect?
 - How confirm do you?


> +	if (!totalpages)
> +		totalpages = 1;
>  
>  	/*
> -	 * Superuser processes are usually more important, so we make it
> -	 * less likely that we kill those.
> +	 * The baseline for the badness score is the proportion of RAM that each
> +	 * task's rss and swap space use.
>  	 */
> -	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> -	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
> -		points /= 4;
> +	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
> +			totalpages;
> +	task_unlock(p);
>  
>  	/*
> -	 * We don't want to kill a process with direct hardware access.
> -	 * Not only could that mess up the hardware, but usually users
> -	 * tend to only have this flag set on applications they think
> -	 * of as important.
> +	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> +	 * implementation used by LSMs.
>  	 */
> -	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> -		points /= 4;
> +	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> +		points -= 30;


CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive
process. but killing interactive process only cause force logout. but
killing system daemon can makes more catastrophic disaster.


Last of all, I'll pulled this one. but only do cherry-pick.


>  
>  	/*
> -	 * Adjust the score by oom_adj.
> +	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> +	 * either completely disable oom killing or always prefer a certain
> +	 * task.
>  	 */
> -	if (oom_adj) {
> -		if (oom_adj > 0) {
> -			if (!points)
> -				points = 1;
> -			points <<= oom_adj;
> -		} else
> -			points >>= -(oom_adj);
> -	}
> +	points += p->signal->oom_score_adj;
>  
> -#ifdef DEBUG
> -	printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
> -	p->pid, p->comm, points);
> -#endif
> -	return points;
> +	if (points < 0)
> +		return 0;
> +	return (points < 1000) ? points : 1000;
>  }
>  
>  /*
> @@ -224,12 +158,24 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>   */
>  #ifdef CONFIG_NUMA
>  static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -				    gfp_t gfp_mask, nodemask_t *nodemask)
> +				gfp_t gfp_mask, nodemask_t *nodemask,
> +				unsigned long *totalpages)
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	bool cpuset_limited = false;
> +	int nid;
>  
> +	/* Default to all anonymous memory, page cache, and swap */
> +	*totalpages = global_page_state(NR_INACTIVE_ANON) +
> +			global_page_state(NR_ACTIVE_ANON) +
> +			global_page_state(NR_INACTIVE_FILE) +
> +			global_page_state(NR_ACTIVE_FILE) +
> +			total_swap_pages;
> +
> +	if (!zonelist)
> +		return CONSTRAINT_NONE;
>  	/*
>  	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
>  	 * to kill current.We have to random task kill in this case.
> @@ -239,26 +185,47 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>  		return CONSTRAINT_NONE;
>  
>  	/*
> -	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> -	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> -	 * feature. mempolicy is an only user of nodemask here.
> -	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
> +	 * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
> +	 * the page allocator means a mempolicy is in effect.  Cpuset policy
> +	 * is enforced in get_page_from_freelist().
>  	 */
> -	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> +	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
> +		*totalpages = total_swap_pages;
> +		for_each_node_mask(nid, *nodemask)
> +			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
> +					node_page_state(nid, NR_ACTIVE_ANON) +
> +					node_page_state(nid, NR_INACTIVE_FILE) +
> +					node_page_state(nid, NR_ACTIVE_FILE);
>  		return CONSTRAINT_MEMORY_POLICY;
> +	}
>  
>  	/* Check this allocation failure is caused by cpuset's wall function */
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  			high_zoneidx, nodemask)
>  		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			return CONSTRAINT_CPUSET;
> -
> +			cpuset_limited = true;
> +
> +	if (cpuset_limited) {
> +		*totalpages = total_swap_pages;
> +		for_each_node_mask(nid, cpuset_current_mems_allowed)
> +			*totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
> +					node_page_state(nid, NR_ACTIVE_ANON) +
> +					node_page_state(nid, NR_INACTIVE_FILE) +
> +					node_page_state(nid, NR_ACTIVE_FILE);
> +		return CONSTRAINT_CPUSET;
> +	}
>  	return CONSTRAINT_NONE;
>  }
>  #else
>  static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -				gfp_t gfp_mask, nodemask_t *nodemask)
> +				gfp_t gfp_mask, nodemask_t *nodemask,
> +				unsigned long *totalpages)
>  {
> +	*totalpages = global_page_state(NR_INACTIVE_ANON) +
> +			global_page_state(NR_ACTIVE_ANON) +
> +			global_page_state(NR_INACTIVE_FILE) +
> +			global_page_state(NR_ACTIVE_FILE) +
> +			total_swap_pages;
>  	return CONSTRAINT_NONE;
>  }
>  #endif
> @@ -269,18 +236,16 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   *
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
> -static struct task_struct *select_bad_process(unsigned long *ppoints,
> -		struct mem_cgroup *mem, enum oom_constraint constraint,
> -		const nodemask_t *mask)
> +static struct task_struct *select_bad_process(unsigned int *ppoints,
> +		unsigned long totalpages, struct mem_cgroup *mem,
> +		enum oom_constraint constraint, const nodemask_t *mask)
>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> -	struct timespec uptime;
>  	*ppoints = 0;
>  
> -	do_posix_clock_monotonic_gettime(&uptime);
>  	for_each_process(p) {
> -		unsigned long points;
> +		unsigned int points;
>  
>  		/* skip the init task and kthreads */
>  		if (is_global_init(p) || (p->flags & PF_KTHREAD))
> @@ -319,14 +284,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  				return ERR_PTR(-1UL);
>  
>  			chosen = p;
> -			*ppoints = ULONG_MAX;
> +			*ppoints = 1000;
>  		}
>  
> -		if (p->signal->oom_adj == OOM_DISABLE)
> -			continue;
> -
> -		points = badness(p, uptime.tv_sec);
> -		if (points > *ppoints || !chosen) {
> +		points = oom_badness(p, totalpages);
> +		if (points > *ppoints) {
>  			chosen = p;
>  			*ppoints = points;
>  		}
> @@ -341,7 +303,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>   *
>   * Dumps the current memory state of all system tasks, excluding kernel threads.
>   * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
> - * score, and name.
> + * value, oom_score_adj value, and name.
>   *
>   * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
>   * shown.
> @@ -354,7 +316,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  	struct task_struct *task;
>  
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
> -	       "name\n");
> +	       "oom_score_adj name\n");
>  	for_each_process(p) {
>  		/*
>  		 * We don't have is_global_init() check here, because the old
> @@ -376,10 +338,11 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  			continue;
>  		}
>  
> -		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
> +		pr_info("[%5d] %5d %5d %8lu %8lu %3d     %3d          %4d %s\n",
>  		       task->pid, __task_cred(task)->uid, task->tgid,
>  		       task->mm->total_vm, get_mm_rss(task->mm),
> -		       (int)task_cpu(task), task->signal->oom_adj, p->comm);
> +		       (int)task_cpu(task), task->signal->oom_adj,
> +		       task->signal->oom_score_adj, p->comm);
>  		task_unlock(task);
>  	}
>  }
> @@ -388,8 +351,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  							struct mem_cgroup *mem)
>  {
>  	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
> -		"oom_adj=%d\n",
> -		current->comm, gfp_mask, order, current->signal->oom_adj);
> +		"oom_adj=%d, oom_score_adj=%d\n",
> +		current->comm, gfp_mask, order, current->signal->oom_adj,
> +		current->signal->oom_score_adj);
>  	task_lock(current);
>  	cpuset_print_task_mems_allowed(current);
>  	task_unlock(current);
> @@ -404,7 +368,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  static int oom_kill_task(struct task_struct *p)
>  {
>  	p = find_lock_task_mm(p);
> -	if (!p || p->signal->oom_adj == OOM_DISABLE) {
> +	if (!p || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>  		task_unlock(p);
>  		return 1;
>  	}
> @@ -422,14 +386,13 @@ static int oom_kill_task(struct task_struct *p)
>  #undef K
>  
>  static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> -			    unsigned long points, struct mem_cgroup *mem,
> -			    const char *message)
> +			    unsigned int points, unsigned long totalpages,
> +			    struct mem_cgroup *mem, const char *message)
>  {
>  	struct task_struct *victim = p;
>  	struct task_struct *c;
>  	struct task_struct *t = p;
> -	unsigned long victim_points = 0;
> -	struct timespec uptime;
> +	unsigned int victim_points = 0;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
> @@ -443,13 +406,12 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		return 0;
>  	}
>  
> -	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> +	pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
>  		message, task_pid_nr(p), p->comm, points);
>  
>  	/* Try to sacrifice the worst child first */
> -	do_posix_clock_monotonic_gettime(&uptime);
>  	do {
> -		unsigned long cpoints;
> +		unsigned int cpoints;
>  
>  		list_for_each_entry(c, &t->children, sibling) {
>  			if (c->mm == p->mm)
> @@ -457,8 +419,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			if (mem && !task_in_mem_cgroup(c, mem))
>  				continue;
>  
> -			/* badness() returns 0 if the thread is unkillable */
> -			cpoints = badness(c, uptime.tv_sec);
> +			/*
> +			 * oom_badness() returns 0 if the thread is unkillable
> +			 */
> +			cpoints = oom_badness(c, totalpages);
>  			if (cpoints > victim_points) {
>  				victim = c;
>  				victim_points = cpoints;
> @@ -496,17 +460,19 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  {
> -	unsigned long points = 0;
> +	unsigned long limit;
> +	unsigned int points = 0;
>  	struct task_struct *p;
>  
>  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
> +	limit = mem_cgroup_get_limit(mem) >> PAGE_SHIFT;
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
> +	p = select_bad_process(&points, limit, mem, CONSTRAINT_MEMCG, NULL);
>  	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> -	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> +	if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
>  				"Memory cgroup out of memory"))
>  		goto retry;
>  out:
> @@ -619,22 +585,22 @@ static void clear_system_oom(void)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order,
> +static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
>  			enum oom_constraint constraint, const nodemask_t *mask)
>  {
>  	struct task_struct *p;
> -	unsigned long points;
> +	unsigned int points;
>  
>  	if (sysctl_oom_kill_allocating_task)
> -		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"Out of memory (oom_kill_allocating_task)"))
> +		if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
> +			NULL, "Out of memory (oom_kill_allocating_task)"))
>  			return;
>  retry:
>  	/*
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL, constraint, mask);
> +	p = select_bad_process(&points, totalpages, NULL, constraint, mask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -646,7 +612,7 @@ retry:
>  		panic("Out of memory and no killable processes...\n");
>  	}
>  
> -	if (oom_kill_process(p, gfp_mask, order, points, NULL,
> +	if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
>  			     "Out of memory"))
>  		goto retry;
>  }
> @@ -666,6 +632,7 @@ retry:
>  void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
> +	unsigned long totalpages;
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
> @@ -688,11 +655,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	if (zonelist)
> -		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
> +						&totalpages);
>  	check_panic_on_oom(constraint, gfp_mask, order);
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(gfp_mask, order, constraint, nodemask);
> +	__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  
>  	/*



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:47     ` David Rientjes
  2010-06-08 20:12     ` Andrew Morton
  2010-06-08 20:08   ` Andrew Morton
  1 sibling, 2 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> It's possible to livelock the page allocator if a thread has mm->mmap_sem
> and fails to make forward progress because the oom killer selects another
> thread sharing the same ->mm to kill that cannot exit until the semaphore
> is dropped.
> 
> The oom killer will not kill multiple tasks at the same time; each oom
> killed task must exit before another task may be killed.  Thus, if one
> thread is holding mm->mmap_sem and cannot allocate memory, all threads
> sharing the same ->mm are blocked from exiting as well.  In the oom kill
> case, that means the thread holding mm->mmap_sem will never free
> additional memory since it cannot get access to memory reserves and the
> thread that depends on it with access to memory reserves cannot exit
> because it cannot acquire the semaphore.  Thus, the page allocators
> livelocks.
> 
> When the oom killer is called and current happens to have a pending
> SIGKILL, this patch automatically gives it access to memory reserves and
> returns.  Upon returning to the page allocator, its allocation will
> hopefully succeed so it can quickly exit and free its memory.  If not, the
> page allocator will fail the allocation if it is not __GFP_NOFAIL.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		/* Got some memory back in the last second. */
>  		return;
>  
> +	/*
> +	 * If current has a pending SIGKILL, then automatically select it.  The
> +	 * goal is to allow it to allocate so that it may quickly exit and free
> +	 * its memory.
> +	 */
> +	if (fatal_signal_pending(current)) {
> +		set_thread_flag(TIF_MEMDIE);
> +		return;
> +	}
> +
>  	if (sysctl_panic_on_oom == 2) {
>  		dump_header(NULL, gfp_mask, order, NULL);
>  		panic("out of memory. Compulsory panic_on_oom is selected.\n");

Sorry, I had found this patch works incorrect. I don't pulled.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 17/18] oom: add forkbomb penalty to badness heuristic
  2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 23:15   ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> Add a forkbomb penalty for processes that fork an excessively large
> number of children to penalize that group of tasks and not others.  A
> threshold is configurable from userspace to determine how many first-
> generation execve children (those with their own address spaces) a task
> may have before it is considered a forkbomb.  This can be tuned by
> altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> 1000.
> 
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
> 
> 	(average rss of children) * (# of 1st generation execve children)
> 	-----------------------------------------------------------------
> 			oom_forkbomb_thres
> 
> is assessed.  So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks.  A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
> 
> Even though a particular task may be designated a forkbomb and selected as
> the victim, the oom killer will still kill the 1st generation execve child
> with the highest badness() score in its place.  The avoids killing
> important servers or system daemons.  When a web server forks a very large
> number of threads for client connections, for example, it is much better
> to kill one of those threads than to kill the server and make it
> unresponsive.

Today, I've test this patch. but I can't observed this works.

test way

prepare:
	make 500M memory cgroup

console1:
	run memtoy (consume 100M memory)

console2: 
	run forkbomb bash script ":(){ :|:& };:"
	AFAIK, this is most typical forkbom.  see http://en.wikipedia.org/wiki/Fork_bomb

each bash consume about 100KB and about 4000 bash process consume rest 400M.
oom_score list is here. 

1) almost bash don't get forkbomb bonus at all
2) maxmumly root bash get 2x bonus and the score changed from 90 to 180.
   but memtoy (100MB process) have score 25840. Still 143 times score
difference is there.



  pid     uid     total_vs anonrss(kb) filerss(kb) oom_adj oom_score comm
-----------------------------------------------------------------------------------
 [ 1865]     0     2880      448     1264 |        0      415 bash
 [ 1887]     0    12076      284     1056 |        0      325 su
 [ 1889]  1264     6313      992     1604 |        0      649 zsh
 [ 1906]  1264    29317   102660      700 |        0    25840 memtoy
 [ 2006]     0    26999      448     1376 |        0      442 bash
 [ 2024]     0    36195      292     1160 |        0      352 su
 [ 2025]  1268    26968      360     1380 |        0      435 bash
 [ 5555]  1268    26968      364      300 |        0      166 bash
 [ 5623]  1268    26968      364      300 |        0      166 bash
 [ 5688]  1268    26968      364      300 |        0      166 bash
 [ 5711]  1268    26968      364      300 |        0      166 bash
 [ 5742]  1268    26968      364      300 |        0      166 bash
 [ 5749]  1268    26968      364      300 |        0      166 bash
 [ 5752]  1268    26968      364      388 |        0      188 bash
 [ 5755]  1268    26968      364      300 |        0      166 bash
 [ 5765]  1268    26968      364      300 |        0      166 bash
 [ 5791]  1268    26968      364      300 |        0      166 bash
 [ 5808]  1268    26968      364      300 |        0      166 bash
 [ 5819]  1268    26968      364      324 |        0      172 bash
 [ 5835]  1268    26968      364      300 |        0      166 bash
 [ 5889]  1268    26968      364      300 |        0      166 bash
 [ 5903]  1268    26968      364      300 |        0      166 bash
 [ 5924]  1268    26968      364      424 |        0      197 bash
..... (continue to very much bash)

[10198]  1268    26968      368       20 |        0       97 bash
[10199]  1268    26968      368       20 |        0       97 bash
[10200]  1268    26968      368       20 |        0       97 bash
[10201]  1268    26968      368       20 |        0       97 bash
[10202]  1268    26968      368       20 |        0       97 bash
[10203]  1268    26968      368       20 |        0       97 bash
[10204]  1268    26968      368       20 |        0       97 bash
[10205]  1268    26968      368       20 |        0       97 bash
[10206]  1268    26968      368       20 |        0       97 bash
[10207]  1268    26968      364       20 |        0       96 bash
[10208]  1268    26968      364       20 |        0       96 bash
[10209]  1268    26968      368       20 |        0       97 bash
[10210]  1268    26968      368       20 |        0       97 bash
[10211]  1268    26968      368       20 |        0       97 bash
[10212]  1268    26968      368       20 |        0       97 bash
[10213]  1268    26968      368       20 |        0       97 bash
[10214]  1268    26968      368       20 |        0       97 bash
[10215]  1268    26968      368       20 |        0       97 bash
[10216]  1268    26968      368       20 |        0       97 bash
[10217]  1268    26968      368       20 |        0       97 bash
[10218]  1268    26968      368       20 |        0       97 bash
Memory cgroup out of memory: Kill process 1906 (memtoy) with score 25840 or sacrifice child
Killed process 1906 (memtoy) vsz:117268kB, anon-rss:102660kB, file-rss:700kB



At least, the patch author must define which problem is called as "forkbomb"
in this description.

I don't pulled this one.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:51     ` David Rientjes
  2010-06-08 20:23   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.  To use a sane heuristic, we must ensure that killing a
> task would likely free memory for current and avoid needlessly killing
> others at all costs just because their potential memory freeing is
> unknown.  It is better to kill current than another task needlessly.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   10 ++--------
>  1 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 4;
>  
>  	/*
> -	 * If p's nodes don't overlap ours, it may still help to kill p
> -	 * because p may have allocated or otherwise mapped memory on
> -	 * this node before. However it will be less likely.
> -	 */
> -	if (!has_intersects_mems_allowed(p))
> -		points /= 8;
> -
> -	/*
>  	 * Adjust the score by oom_adj.
>  	 */
>  	if (oom_adj) {
> @@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> +		if (!has_intersects_mems_allowed(p))
> +			continue;
>  
>  		/*
>  		 * This task already has access to memory reserves and is

pulled. but I'll merge my fix. and append historical remark.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:48     ` David Rientjes
  2010-06-08 20:17   ` Andrew Morton
  2010-06-08 20:26   ` Oleg Nesterov
  2 siblings, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> It's unnecessary to SIGKILL a task that is already PF_EXITING and can
> actually cause a NULL pointer dereference of the sighand if it has already
> been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
> reserves and can quickly exit as the comment implies.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
>  	if (p->flags & PF_EXITING) {
> -		__oom_kill_task(p, 0);
> +		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		return 0;
>  	}
>  

I don't pulled PF_EXITING related thing.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 08/18] oom: sacrifice child with highest badness score for parent
  2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:53     ` David Rientjes
  2010-06-08 20:33   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.  Unfortunately,
> this often kills in a seemingly random fashion based on the ordering of
> the selected task's child list.  Additionally, it is not guaranteed at all
> to free a large amount of memory that we need to prevent additional oom
> killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful and
> then the parent if none were found while the new code simply kills the
> most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or has
> a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
> so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   23 +++++++++++++++++------
>  1 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			    unsigned long points, struct mem_cgroup *mem,
>  			    const char *message)
>  {
> +	struct task_struct *victim = p;
>  	struct task_struct *c;
>  	struct task_struct *t = p;
> +	unsigned long victim_points = 0;
> +	struct timespec uptime;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
> @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		return 0;
>  	}
>  
> -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> -					message, task_pid_nr(p), p->comm, points);
> +	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);
>  
> -	/* Try to kill a child first */
> +	/* Try to sacrifice the worst child first */
> +	do_posix_clock_monotonic_gettime(&uptime);
>  	do {
> +		unsigned long cpoints;
> +
>  		list_for_each_entry(c, &t->children, sibling) {
>  			if (c->mm == p->mm)
>  				continue;
>  			if (mem && !task_in_mem_cgroup(c, mem))
>  				continue;
> -			if (!oom_kill_task(c))
> -				return 0;
> +
> +			/* badness() returns 0 if the thread is unkillable */
> +			cpoints = badness(c, uptime.tv_sec);
> +			if (cpoints > victim_points) {
> +				victim = c;
> +				victim_points = cpoints;
> +			}
>  		}
>  	} while_each_thread(p, t);
>  
> -	return oom_kill_task(p);
> +	return oom_kill_task(victim);
>  }
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR

better version already is there in my patch kit.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
@ 2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 21:08   ` Andrew Morton
  2010-06-08 23:43   ` Andrew Morton
  2 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mempolicy.h |   13 +++++++-
>  mm/mempolicy.c            |   44 ++++++++++++++++++++++++
>  mm/oom_kill.c             |   80 +++++++++++++++++++++++++++-----------------
>  3 files changed, 105 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -210,6 +210,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +				const nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -338,7 +340,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> +{
> +	return false;
> +}
> +
> +static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +			const nodemask_t *mask)
> +{
> +	return false;
> +}
>  
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1712,6 +1712,50 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  }
>  #endif
>  
> +/*
> + * mempolicy_nodemask_intersects
> + *
> + * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> + * policy.  Otherwise, check for intersection between mask and the policy
> + * nodemask for 'bind' or 'interleave' policy.  For 'perferred' or 'local'
> + * policy, always return true since it may allocate elsewhere on fallback.
> + *
> + * Takes task_lock(tsk) to prevent freeing of its mempolicy.
> + */
> +bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +					const nodemask_t *mask)
> +{
> +	struct mempolicy *mempolicy;
> +	bool ret = true;
> +
> +	if (!mask)
> +		return ret;
> +	task_lock(tsk);
> +	mempolicy = tsk->mempolicy;
> +	if (!mempolicy)
> +		goto out;
> +
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		/*
> +		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
> +		 * allocate from, they may fallback to other nodes when oom.
> +		 * Thus, it's possible for tsk to have allocated memory from
> +		 * nodes in mask.
> +		 */
> +		break;
> +	case MPOL_BIND:
> +	case MPOL_INTERLEAVE:
> +		ret = nodes_intersects(mempolicy->v.nodes, *mask);
> +		break;
> +	default:
> +		BUG();
> +	}
> +out:
> +	task_unlock(tsk);
> +	return ret;
> +}
> +
>  /* Allocate a page in interleaved policy.
>     Own path because it needs to do special accounting. */
>  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/module.h>
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms
>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +					const nodemask_t *mask)
>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
>  
> -	t = tsk;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> -			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> -
> -	return 0;
> +		if (mask) {
> +			/*
> +			 * If this is a mempolicy constrained oom, tsk's
> +			 * cpuset is irrelevant.  Only return true if its
> +			 * mempolicy intersects current, otherwise it may be
> +			 * needlessly killed.
> +			 */
> +			if (mempolicy_nodemask_intersects(tsk, mask))
> +				return true;
> +		} else {
> +			/*
> +			 * This is not a mempolicy constrained oom, so only
> +			 * check the mems of tsk's cpuset.
> +			 */
> +			if (cpuset_mems_allowed_intersects(current, tsk))
> +				return true;
> +		}
> +		tsk = next_thread(tsk);
> +	} while (tsk != start);
> +	return false;
>  }
>  
>  static struct task_struct *find_lock_task_mm(struct task_struct *p)
> @@ -253,7 +270,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
>  static struct task_struct *select_bad_process(unsigned long *ppoints,
> -						struct mem_cgroup *mem)
> +		struct mem_cgroup *mem, enum oom_constraint constraint,
> +		const nodemask_t *mask)
>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> @@ -269,7 +287,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
> -		if (!has_intersects_mems_allowed(p))
> +		if (!has_intersects_mems_allowed(p,
> +				constraint == CONSTRAINT_MEMORY_POLICY ? mask :
> +									 NULL))
>  			continue;
>  
>  		/*
> @@ -495,7 +515,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  		panic("out of memory(memcg). panic_on_oom is selected.\n");
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
>  	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -574,7 +594,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> +			enum oom_constraint constraint, const nodemask_t *mask)
>  {
>  	struct task_struct *p;
>  	unsigned long points;
> @@ -588,7 +609,7 @@ retry:
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL);
> +	p = select_bad_process(&points, NULL, constraint, mask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -622,7 +643,8 @@ void pagefault_out_of_memory(void)
>  		panic("out of memory from page fault. panic_on_oom is selected.\n");
>  
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> +	/* unknown gfp_mask and order */
> +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -638,6 +660,7 @@ void pagefault_out_of_memory(void)
>   * @zonelist: zonelist pointer
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
>   *
>   * If we run out of memory, we have the choice between either
>   * killing a random task (bad), letting the system crash (worse)
> @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
> -
> -	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
> -	case CONSTRAINT_NONE:
> -		if (sysctl_panic_on_oom) {
> +	if (unlikely(sysctl_panic_on_oom)) {
> +		/*
> +		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> +		 * should not panic for cpuset or mempolicy induced memory
> +		 * failures.
> +		 */
> +		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> -			panic("out of memory. panic_on_oom is selected\n");
> +			read_unlock(&tasklist_lock);
> +			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
> -		/* Fall-through */
> -	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> -		break;
>  	}
> -
> +	__out_of_memory(gfp_mask, order, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  
>  	/*

pulled.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 10/18] oom: enable oom tasklist dump by default
  2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 18:56     ` David Rientjes
  2010-06-08 21:13   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
> very helpful information in diagnosing why a user's task has been killed.
> It emits useful information such as each eligible thread's memory usage
> that can determine why the system is oom, so it should be enabled by
> default.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/sysctl/vm.txt |    2 +-
>  mm/oom_kill.c               |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -511,7 +511,7 @@ information may not be desired.
>  If this is set to non-zero, this information is shown whenever the
>  OOM killer actually kills a memory-hogging task.
>  
> -The default value is 0.
> +The default value is 1 (enabled).
>  
>  ==============================================================
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef048c1..833de48 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -32,7 +32,7 @@
>  
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
> -int sysctl_oom_dump_tasks;
> +int sysctl_oom_dump_tasks = 1;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  

pulled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 21:19   ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help.  The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O.  Killing such an
> application may leave the hardware in an unspecified state and there is no
> guarantee that it will be able to make a timely exit.
> 
> Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
> not used so that the task can perhaps recover or try again later.
> 
> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
> 
> high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |   29 ++++++++++++++++++++---------
>  1 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		/* The OOM killer will not help higher order allocs */
>  		if (order > PAGE_ALLOC_COSTLY_ORDER)
>  			goto out;
> +		/* The OOM killer does not needlessly kill tasks for lowmem */
> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto out;
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> @@ -2052,15 +2055,23 @@ rebalance:
>  			if (page)
>  				goto got_pg;
>  
> -			/*
> -			 * The OOM killer does not trigger for high-order
> -			 * ~__GFP_NOFAIL allocations so if no progress is being
> -			 * made, there are no other options and retrying is
> -			 * unlikely to help.
> -			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> -				goto nopage;
> +			if (!(gfp_mask & __GFP_NOFAIL)) {
> +				/*
> +				 * The oom killer is not called for high-order
> +				 * allocations that may fail, so if no progress
> +				 * is being made, there are no other options and
> +				 * retrying is unlikely to help.
> +				 */
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					goto nopage;
> +				/*
> +				 * The oom killer is not called for lowmem
> +				 * allocations to prevent needlessly killing
> +				 * innocent tasks.
> +				 */
> +				if (high_zoneidx < ZONE_NORMAL)
> +					goto nopage;
> +			}
>  
>  			goto restart;
>  		}

pulled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 12/18] oom: extract panic helper function
  2010-06-06 22:34 ` [patch 12/18] oom: extract panic helper function David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> There are various points in the oom killer where the kernel must
> determine whether to panic or not.  It's better to extract this to a
> helper function to remove all the confusion as to its semantics.
> 
> Also fix a call to dump_header() where tasklist_lock is not read-
> locked, as required.
> 
> There's no functional change with this patch.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/oom.h |    1 +
>  mm/oom_kill.c       |   53 +++++++++++++++++++++++++++-----------------------
>  2 files changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -22,6 +22,7 @@ enum oom_constraint {
>  	CONSTRAINT_NONE,
>  	CONSTRAINT_CPUSET,
>  	CONSTRAINT_MEMORY_POLICY,
> +	CONSTRAINT_MEMCG,
>  };
>  
>  extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -505,17 +505,40 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	return oom_kill_task(victim);
>  }
>  
> +/*
> + * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> + */
> +static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
> +				int order)
> +{
> +	if (likely(!sysctl_panic_on_oom))
> +		return;
> +	if (sysctl_panic_on_oom != 2) {
> +		/*
> +		 * panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
> +		 * does not panic for cpuset, mempolicy, or memcg allocation
> +		 * failures.
> +		 */
> +		if (constraint != CONSTRAINT_NONE)
> +			return;
> +	}
> +	read_lock(&tasklist_lock);
> +	dump_header(NULL, gfp_mask, order, NULL);
> +	read_unlock(&tasklist_lock);
> +	panic("Out of memory: %s panic_on_oom is enabled\n",
> +		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  {
>  	unsigned long points = 0;
>  	struct task_struct *p;
>  
> -	if (sysctl_panic_on_oom == 2)
> -		panic("out of memory(memcg). panic_on_oom is selected.\n");
> +	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, 0);
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
> +	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
>  	if (!p || PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -616,8 +639,8 @@ retry:
>  
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!p) {
> -		read_unlock(&tasklist_lock);
>  		dump_header(NULL, gfp_mask, order, NULL);
> +		read_unlock(&tasklist_lock);
>  		panic("Out of memory and no killable processes...\n");
>  	}
>  
> @@ -639,9 +662,7 @@ void pagefault_out_of_memory(void)
>  		/* Got some memory back in the last second. */
>  		return;
>  
> -	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> -
> +	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
>  	read_lock(&tasklist_lock);
>  	/* unknown gfp_mask and order */
>  	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> @@ -688,29 +709,13 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		return;
>  	}
>  
> -	if (sysctl_panic_on_oom == 2) {
> -		dump_header(NULL, gfp_mask, order, NULL);
> -		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> -	}
> -
>  	/*
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	check_panic_on_oom(constraint, gfp_mask, order);
>  	read_lock(&tasklist_lock);
> -	if (unlikely(sysctl_panic_on_oom)) {
> -		/*
> -		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> -		 * should not panic for cpuset or mempolicy induced memory
> -		 * failures.
> -		 */
> -		if (constraint == CONSTRAINT_NONE) {
> -			dump_header(NULL, gfp_mask, order, NULL);
> -			read_unlock(&tasklist_lock);
> -			panic("Out of memory: panic_on_oom is enabled\n");
> -		}
> -	}
>  	__out_of_memory(gfp_mask, order, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  

pulled.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 13/18] oom: remove special handling for pagefault ooms
  2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 18:57     ` David Rientjes
  2010-06-08 21:27   ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> It is possible to remove the special pagefault oom handler by simply oom
> locking all system zones and then calling directly into out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory freeing
> so it's not necessary to needlessly kill another task.  The context in
> which the pagefault is allocating memory is unknown to the oom killer, so
> this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   86 +++++++++++++++++++++++++++++++++++++-------------------
>  1 files changed, 57 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -615,6 +615,44 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /*
> + * Try to acquire the oom killer lock for all system zones.  Returns zero if a
> + * parallel oom killing is taking place, otherwise locks all zones and returns
> + * non-zero.
> + */
> +static int try_set_system_oom(void)
> +{
> +	struct zone *zone;
> +	int ret = 1;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		if (zone_is_oom_locked(zone)) {
> +			ret = 0;
> +			goto out;
> +		}
> +	for_each_populated_zone(zone)
> +		zone_set_flag(zone, ZONE_OOM_LOCKED);
> +out:
> +	spin_unlock(&zone_scan_lock);
> +	return ret;
> +}
> +
> +/*
> + * Clears ZONE_OOM_LOCKED for all system zones so that failed allocation
> + * attempts or page faults may now recall the oom killer, if necessary.
> + */
> +static void clear_system_oom(void)
> +{
> +	struct zone *zone;
> +
> +	spin_lock(&zone_scan_lock);
> +	for_each_populated_zone(zone)
> +		zone_clear_flag(zone, ZONE_OOM_LOCKED);
> +	spin_unlock(&zone_scan_lock);
> +}
> +
> +
> +/*
>   * Must be called with tasklist_lock held for read.
>   */
>  static void __out_of_memory(gfp_t gfp_mask, int order,
> @@ -649,33 +687,6 @@ retry:
>  		goto retry;
>  }
>  
> -/*
> - * pagefault handler calls into here because it is out of memory but
> - * doesn't know exactly how or why.
> - */
> -void pagefault_out_of_memory(void)
> -{
> -	unsigned long freed = 0;
> -
> -	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> -	if (freed > 0)
> -		/* Got some memory back in the last second. */
> -		return;
> -
> -	check_panic_on_oom(CONSTRAINT_NONE, 0, 0);
> -	read_lock(&tasklist_lock);
> -	/* unknown gfp_mask and order */
> -	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> -	read_unlock(&tasklist_lock);
> -
> -	/*
> -	 * Give "p" a good chance of killing itself before we
> -	 * retry to allocate memory.
> -	 */
> -	if (!test_thread_flag(TIF_MEMDIE))
> -		schedule_timeout_uninterruptible(1);
> -}
> -
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
>   * @zonelist: zonelist pointer
> @@ -692,7 +703,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
> -	enum oom_constraint constraint;
> +	enum oom_constraint constraint = CONSTRAINT_NONE;
>  
>  	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
>  	if (freed > 0)
> @@ -713,7 +724,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> +	if (zonelist)
> +		constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	check_panic_on_oom(constraint, gfp_mask, order);
>  	read_lock(&tasklist_lock);
>  	__out_of_memory(gfp_mask, order, constraint, nodemask);
> @@ -726,3 +738,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	if (!test_thread_flag(TIF_MEMDIE))
>  		schedule_timeout_uninterruptible(1);
>  }
> +
> +/*
> + * The pagefault handler calls here because it is out of memory, so kill a
> + * memory-hogging task.  If a populated zone has ZONE_OOM_LOCKED set, a parallel
> + * oom killing is already in progress so do nothing.  If a task is found with
> + * TIF_MEMDIE set, it has been killed so do nothing and allow it to exit.
> + */
> +void pagefault_out_of_memory(void)
> +{
> +	if (try_set_system_oom()) {
> +		out_of_memory(NULL, 0, 0, NULL);
> +		clear_system_oom();
> +	}
> +	if (!test_thread_flag(TIF_MEMDIE))
> +		schedule_timeout_uninterruptible(1);
> +}

this one is already there in my patch kit.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 14/18] oom: move sysctl declarations to oom.h
  2010-06-06 22:34 ` [patch 14/18] oom: move sysctl declarations to oom.h David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> The three oom killer sysctl variables (sysctl_oom_dump_tasks,
> sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better
> declared in include/linux/oom.h rather than kernel/sysctl.c.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/oom.h |    5 +++++
>  kernel/sysctl.c     |    4 +---
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -44,5 +44,10 @@ static inline void oom_killer_enable(void)
>  {
>  	oom_killer_disabled = false;
>  }
> +
> +/* sysctls */
> +extern int sysctl_oom_dump_tasks;
> +extern int sysctl_oom_kill_allocating_task;
> +extern int sysctl_panic_on_oom;
>  #endif /* __KERNEL__*/
>  #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -55,6 +55,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/kprobes.h>
>  #include <linux/pipe_fs_i.h>
> +#include <linux/oom.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/processor.h>
> @@ -87,9 +88,6 @@
>  /* External variables not in a header file. */
>  extern int sysctl_overcommit_memory;
>  extern int sysctl_overcommit_ratio;
> -extern int sysctl_panic_on_oom;
> -extern int sysctl_oom_kill_allocating_task;
> -extern int sysctl_oom_dump_tasks;
>  extern int max_threads;
>  extern int core_uses_pid;
>  extern int suid_dumpable;

pulled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-06 22:35 ` [patch 18/18] oom: deprecate oom_adj tunable David Rientjes
@ 2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 19:00     ` David Rientjes
  2010-06-08 23:18     ` Andrew Morton
  0 siblings, 2 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-08 11:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> +	/*
> +	 * Warn that /proc/pid/oom_adj is deprecated, see
> +	 * Documentation/feature-removal-schedule.txt.
> +	 */
> +	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
> +			"please use /proc/%d/oom_score_adj instead.\n",
> +			current->comm, task_pid_nr(current),
> +			task_pid_nr(task), task_pid_nr(task));
>  	task->signal->oom_adj = oom_adjust;

Sorry, we can't accept this. oom_adj is one of most freqently used
tuning knob. putting this one makes a lot of confusion.

In addition, this knob is used from some applications (please google
by google code search or something else). that said, an enduser can't
stop the warning. that makes a lot of frustration. NO.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:47     ` David Rientjes
  2010-06-14 11:08       ` KOSAKI Motohiro
  2010-06-08 20:12     ` Andrew Morton
  1 sibling, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > It's possible to livelock the page allocator if a thread has mm->mmap_sem
> > and fails to make forward progress because the oom killer selects another
> > thread sharing the same ->mm to kill that cannot exit until the semaphore
> > is dropped.
> > 
> > The oom killer will not kill multiple tasks at the same time; each oom
> > killed task must exit before another task may be killed.  Thus, if one
> > thread is holding mm->mmap_sem and cannot allocate memory, all threads
> > sharing the same ->mm are blocked from exiting as well.  In the oom kill
> > case, that means the thread holding mm->mmap_sem will never free
> > additional memory since it cannot get access to memory reserves and the
> > thread that depends on it with access to memory reserves cannot exit
> > because it cannot acquire the semaphore.  Thus, the page allocators
> > livelocks.
> > 
> > When the oom killer is called and current happens to have a pending
> > SIGKILL, this patch automatically gives it access to memory reserves and
> > returns.  Upon returning to the page allocator, its allocation will
> > hopefully succeed so it can quickly exit and free its memory.  If not, the
> > page allocator will fail the allocation if it is not __GFP_NOFAIL.
> > 
> > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		/* Got some memory back in the last second. */
> >  		return;
> >  
> > +	/*
> > +	 * If current has a pending SIGKILL, then automatically select it.  The
> > +	 * goal is to allow it to allocate so that it may quickly exit and free
> > +	 * its memory.
> > +	 */
> > +	if (fatal_signal_pending(current)) {
> > +		set_thread_flag(TIF_MEMDIE);
> > +		return;
> > +	}
> > +
> >  	if (sysctl_panic_on_oom == 2) {
> >  		dump_header(NULL, gfp_mask, order, NULL);
> >  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> 
> Sorry, I had found this patch works incorrect. I don't pulled.
> 

You're taking back your ack?

Why does this not work?  It's not killing a potentially immune task, the 
task is already dying.  We're simply giving it access to memory reserves 
so that it may quickly exit and die.  OOM_DISABLE does not imply that a 
task cannot exit on its own or be killed by another application or user, 
we simply don't want to needlessly kill another task when current is dying 
in the first place without being able to allocate memory.

Please reconsider your thought.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:48     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > It's unnecessary to SIGKILL a task that is already PF_EXITING and can
> > actually cause a NULL pointer dereference of the sighand if it has already
> > been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
> > reserves and can quickly exit as the comment implies.
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> >  	 */
> >  	if (p->flags & PF_EXITING) {
> > -		__oom_kill_task(p, 0);
> > +		set_tsk_thread_flag(p, TIF_MEMDIE);
> >  		return 0;
> >  	}
> >  
> 
> I don't pulled PF_EXITING related thing.
> 

What are you pulling?  You're not a maintainer!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:51     ` David Rientjes
  2010-06-08 19:27       ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:51 UTC (permalink / raw)
  To: KOSAKI Motohiro, Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -184,14 +184,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  		points /= 4;
> >  
> >  	/*
> > -	 * If p's nodes don't overlap ours, it may still help to kill p
> > -	 * because p may have allocated or otherwise mapped memory on
> > -	 * this node before. However it will be less likely.
> > -	 */
> > -	if (!has_intersects_mems_allowed(p))
> > -		points /= 8;
> > -
> > -	/*
> >  	 * Adjust the score by oom_adj.
> >  	 */
> >  	if (oom_adj) {
> > @@ -277,6 +269,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  			continue;
> >  		if (mem && !task_in_mem_cgroup(p, mem))
> >  			continue;
> > +		if (!has_intersects_mems_allowed(p))
> > +			continue;
> >  
> >  		/*
> >  		 * This task already has access to memory reserves and is
> 
> pulled. but I'll merge my fix. and append historical remark.
> 

Andrew, are you the maintainer for these fixes or is KOSAKI?

I've been posting this particular patch for at least three months with 
five acks:

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

and now he's saying he'll merge his own fix and rewrite the changelog and 
pull it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 08/18] oom: sacrifice child with highest badness score for parent
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 18:53     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  			    unsigned long points, struct mem_cgroup *mem,
> >  			    const char *message)
> >  {
> > +	struct task_struct *victim = p;
> >  	struct task_struct *c;
> >  	struct task_struct *t = p;
> > +	unsigned long victim_points = 0;
> > +	struct timespec uptime;
> >  
> >  	if (printk_ratelimit())
> >  		dump_header(p, gfp_mask, order, mem);
> > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  		return 0;
> >  	}
> >  
> > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > -					message, task_pid_nr(p), p->comm, points);
> > +	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> > +		message, task_pid_nr(p), p->comm, points);
> >  
> > -	/* Try to kill a child first */
> > +	/* Try to sacrifice the worst child first */
> > +	do_posix_clock_monotonic_gettime(&uptime);
> >  	do {
> > +		unsigned long cpoints;
> > +
> >  		list_for_each_entry(c, &t->children, sibling) {
> >  			if (c->mm == p->mm)
> >  				continue;
> >  			if (mem && !task_in_mem_cgroup(c, mem))
> >  				continue;
> > -			if (!oom_kill_task(c))
> > -				return 0;
> > +
> > +			/* badness() returns 0 if the thread is unkillable */
> > +			cpoints = badness(c, uptime.tv_sec);
> > +			if (cpoints > victim_points) {
> > +				victim = c;
> > +				victim_points = cpoints;
> > +			}
> >  		}
> >  	} while_each_thread(p, t);
> >  
> > -	return oom_kill_task(p);
> > +	return oom_kill_task(victim);
> >  }
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> 
> better version already is there in my patch kit.
> 

Would you like to review this one?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 10/18] oom: enable oom tasklist dump by default
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 18:56     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -511,7 +511,7 @@ information may not be desired.
> >  If this is set to non-zero, this information is shown whenever the
> >  OOM killer actually kills a memory-hogging task.
> >  
> > -The default value is 0.
> > +The default value is 1 (enabled).
> >  
> >  ==============================================================
> >  
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ef048c1..833de48 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -32,7 +32,7 @@
> >  
> >  int sysctl_panic_on_oom;
> >  int sysctl_oom_kill_allocating_task;
> > -int sysctl_oom_dump_tasks;
> > +int sysctl_oom_dump_tasks = 1;
> >  static DEFINE_SPINLOCK(zone_scan_lock);
> >  /* #define DEBUG */
> >  
> 
> pulled.
> 

What the heck?  You're not a maintainer, what are you pulling?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 13/18] oom: remove special handling for pagefault ooms
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 18:57     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 18:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> this one is already there in my patch kit.
> 

I think you need a reality check in your position as a kernel hacker and 
not a kernel maintainer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 19:00     ` David Rientjes
  2010-06-08 23:18     ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 19:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > +	/*
> > +	 * Warn that /proc/pid/oom_adj is deprecated, see
> > +	 * Documentation/feature-removal-schedule.txt.
> > +	 */
> > +	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
> > +			"please use /proc/%d/oom_score_adj instead.\n",
> > +			current->comm, task_pid_nr(current),
> > +			task_pid_nr(task), task_pid_nr(task));
> >  	task->signal->oom_adj = oom_adjust;
> 
> Sorry, we can't accept this. oom_adj is one of most freqently used
> tuning knob. putting this one makes a lot of confusion.
> 

We?  Who are you representing?

The deprecation of this tunable was suggested by Andrew since it is 
replaced with a more powerful and finer-grained tunable, oom_score_adj.  
The deprecation date is two years from now which gives plenty of 
opportunity for users to use the new, well-documented interface.

> In addition, this knob is used from some applications (please google
> by google code search or something else). that said, an enduser can't
> stop the warning. that makes a lot of frustration. NO.
> 

They can report it over the two year period and hopefully get it fixed 
up, this isn't a BUG(), it's a printk_once().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 18:51     ` David Rientjes
@ 2010-06-08 19:27       ` Andrew Morton
  2010-06-13 11:24         ` KOSAKI Motohiro
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 19:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010 11:51:32 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> Andrew, are you the maintainer for these fixes or is KOSAKI?

I am, thanks.  Kosaki-san, you're making this harder than it should be.
Please either ack David's patches or promptly work with him on
finalising them.

I realise that you have additional oom-killer patches but it's too
complex to try to work on two patch series concurrently.  So let's
concentrate on get David's work sorted out and merged and then please
rebase yours on the result.

I certainly don't have the time or inclination to go through two
patchsets and work out what the similarities and differences are so
I'll be concentrating on David's ones first.  The order in which we
do this doesn't really matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
  2010-06-07 12:12   ` Balbir Singh
@ 2010-06-08 19:33   ` Andrew Morton
  2010-06-08 23:40     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 19:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:00 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> is not true due to use_mm().
> 
> Change the code to check PF_KTHREAD.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    9 +++------
>  1 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  	for_each_process(p) {
>  		unsigned long points;
>  
> -		/*
> -		 * skip kernel threads and tasks which have already released
> -		 * their mm.
> -		 */
> +		/* skip tasks that have already released their mm */
>  		if (!p->mm)
>  			continue;
> -		/* skip the init task */
> -		if (is_global_init(p))
> +		/* skip the init task and kthreads */
> +		if (is_global_init(p) || (p->flags & PF_KTHREAD))
>  			continue;
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;

Applied, thanks.  A minor bugfix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
  2010-06-07 12:58   ` Balbir Singh
@ 2010-06-08 19:42   ` Andrew Morton
  2010-06-08 20:14     ` Oleg Nesterov
  2010-06-08 23:50     ` David Rientjes
  1 sibling, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 19:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:03 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> Almost all ->mm == NUL checks in oom_kill.c are wrong.
> 
> The current code assumes that the task without ->mm has already
> released its memory and ignores the process. However this is not
> necessarily true when this process is multithreaded, other live
> sub-threads can use this ->mm.
> 
> - Remove the "if (!p->mm)" check in select_bad_process(), it is
>   just wrong.
> 
> - Add the new helper, find_lock_task_mm(), which finds the live
>   thread which uses the memory and takes task_lock() to pin ->mm
> 
> - change oom_badness() to use this helper instead of just checking
>   ->mm != NULL.
> 
> - As David pointed out, select_bad_process() must never choose the
>   task without ->mm, but no matter what oom_badness() returns the
>   task can be chosen if nothing else has been found yet.
> 
>   Change oom_badness() to return int, change it to return -1 if
>   find_lock_task_mm() fails, and change select_bad_process() to
>   check points >= 0.
> 
> Note! This patch is not enough, we need more changes.
> 
> 	- oom_badness() was fixed, but oom_kill_task() still ignores
> 	  the task without ->mm
> 
> 	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
> 	  and it also needs other changes to actually find the first
> 	  first-descendant children
> 
> This will be addressed later.
> 
> [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

I assume from the above that we should have a Signed-off-by:kosaki
here.  I didn't make that change yet - please advise.


>  mm/oom_kill.c |   74 +++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 43 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  	return 0;
>  }
>  
> +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> +{
> +	struct task_struct *t = p;
> +
> +	do {
> +		task_lock(t);
> +		if (likely(t->mm))
> +			return t;
> +		task_unlock(t);
> +	} while_each_thread(p, t);
> +
> +	return NULL;
> +}

What pins `p'?  Ah, caller must hold tasklist_lock.

>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>  unsigned long badness(struct task_struct *p, unsigned long uptime)
>  {
>  	unsigned long points, cpu_time, run_time;
> -	struct mm_struct *mm;
>  	struct task_struct *child;
> +	struct task_struct *c, *t;
>  	int oom_adj = p->signal->oom_adj;
>  	struct task_cputime task_time;
>  	unsigned long utime;
> @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	if (oom_adj == OOM_DISABLE)
>  		return 0;
>  
> -	task_lock(p);
> -	mm = p->mm;
> -	if (!mm) {
> -		task_unlock(p);
> +	p = find_lock_task_mm(p);
> +	if (!p)
>  		return 0;
> -	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
>  	 */
> -	points = mm->total_vm;
> +	points = p->mm->total_vm;
>  
>  	/*
>  	 * After this unlock we can no longer dereference local variable `mm'

This comment is stale.  Replace with p->mm.

> @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 * child is eating the vast majority of memory, adding only half
>  	 * to the parents will make the child our kill candidate of choice.
>  	 */
> -	list_for_each_entry(child, &p->children, sibling) {
> -		task_lock(child);
> -		if (child->mm != mm && child->mm)
> -			points += child->mm->total_vm/2 + 1;
> -		task_unlock(child);
> -	}
> +	t = p;
> +	do {
> +		list_for_each_entry(c, &t->children, sibling) {
> +			child = find_lock_task_mm(c);
> +			if (child) {
> +				if (child->mm != p->mm)
> +					points += child->mm->total_vm/2 + 1;

What if 1000 children share the same mm?  Doesn't this give a grossly
wrong result?

> +				task_unlock(child);
> +			}
> +		}
> +	} while_each_thread(p, t);
>  
>  	/*
>  	 * CPU time is in tens of seconds and run time is in thousands
> @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  	for_each_process(p) {
>  		unsigned long points;
>  
> -		/* skip tasks that have already released their mm */
> -		if (!p->mm)
> -			continue;
>  		/* skip the init task and kthreads */
>  		if (is_global_init(p) || (p->flags & PF_KTHREAD))
>  			continue;
> @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
>  		return;
>  	}
>  
> -	task_lock(p);
> -	if (!p->mm) {
> -		WARN_ON(1);
> -		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
> -			task_pid_nr(p), p->comm);
> -		task_unlock(p);
> +	p = find_lock_task_mm(p);
> +	if (!p)
>  		return;
> -	}
>  
>  	if (verbose)
>  		printk(KERN_ERR "Killed process %d (%s) "
> @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			    const char *message)
>  {
>  	struct task_struct *c;
> +	struct task_struct *t = p;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
> @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  					message, task_pid_nr(p), p->comm, points);
>  
>  	/* Try to kill a child first */

It'd be nice to improve the comments a bit.  This one tells us the
"what" (which is usually obvious) but didn't tell us "why", which is
often the unobvious.

> -	list_for_each_entry(c, &p->children, sibling) {
> -		if (c->mm == p->mm)
> -			continue;
> -		if (mem && !task_in_mem_cgroup(c, mem))
> -			continue;
> -		if (!oom_kill_task(c))
> -			return 0;
> -	}
> +	do {
> +		list_for_each_entry(c, &t->children, sibling) {
> +			if (c->mm == p->mm)
> +				continue;
> +			if (mem && !task_in_mem_cgroup(c, mem))
> +				continue;
> +			if (!oom_kill_task(c))
> +				return 0;
> +		}
> +	} while_each_thread(p, t);
> +
>  	return oom_kill_task(p);
>  }

I'll apply this for now..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 03/18] oom: dump_tasks use find_lock_task_mm too
  2010-06-06 22:34 ` [patch 03/18] oom: dump_tasks use find_lock_task_mm too David Rientjes
@ 2010-06-08 19:55   ` Andrew Morton
  2010-06-09  0:06     ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 19:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:12 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> dump_task() should use find_lock_task_mm() too. It is necessary for
> protecting task-exiting race.

A full description of the race would help people understand the code
and the change.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   39 +++++++++++++++++++++------------------
>  1 files changed, 21 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>   */
>  static void dump_tasks(const struct mem_cgroup *mem)

The comment over this function needs to be updated to describe the role
of incoming argument `mem'.

>  {
> -	struct task_struct *g, *p;
> +	struct task_struct *p;
> +	struct task_struct *task;
>  
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
>  	       "name\n");
> -	do_each_thread(g, p) {
> -		struct mm_struct *mm;
> -
> -		if (mem && !task_in_mem_cgroup(p, mem))
> +	for_each_process(p) {

The switch from do_each_thread() to for_each_process() is
unchangelogged.  It looks like a little cleanup to me.

> +		/*
> +		 * We don't have is_global_init() check here, because the old
> +		 * code do that. printing init process is not big matter. But
> +		 * we don't hope to make unnecessary compatibility breaking.
> +		 */

When merging others' patches, please do review and if necessary fix or
enhance the comments and the changelog.  I don't think people take
offense.


Also, I don't think it's really valuable to document *changes* within
the code comments.  This comment is referring to what the old code did
versus the new code.  Generally it's best to just document the code as
it presently stands and leave the documentation of the delta to the
changelog.

That's not always true, of course - we should document oddball code
which is left there for userspace-visible back-compatibility reasons.


> +		if (p->flags & PF_KTHREAD)
>  			continue;
> -		if (!thread_group_leader(p))
> +		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
>  
> -		task_lock(p);
> -		mm = p->mm;
> -		if (!mm) {
> +		task = find_lock_task_mm(p);
> +		if (!task) {
>  			/*
> -			 * total_vm and rss sizes do not exist for tasks with no
> -			 * mm so there's no need to report them; they can't be
> -			 * oom killed anyway.
> +			 * Probably oom vs task-exiting race was happen and ->mm
> +			 * have been detached. thus there's no need to report
> +			 * them; they can't be oom killed anyway.
>  			 */

OK, that hinted at the race but still didn't really tell readers what it is.

> -			task_unlock(p);
>  			continue;
>  		}
> +
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
> -		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
> -		       p->comm);
> -		task_unlock(p);
> -	} while_each_thread(g, p);
> +		       task->pid, __task_cred(task)->uid, task->tgid,
> +		       task->mm->total_vm, get_mm_rss(task->mm),
> +		       (int)task_cpu(task), task->signal->oom_adj, p->comm);

No need to cast the task_cpu() return value - just use %u.

> +		task_unlock(task);
> +	}
>  }
>  
>  static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 04/18] oom: PF_EXITING check should take mm into account
  2010-06-06 22:34 ` [patch 04/18] oom: PF_EXITING check should take mm into account David Rientjes
@ 2010-06-08 20:00   ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:15 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> From: Oleg Nesterov <oleg@redhat.com>
> 
> select_bad_process() checks PF_EXITING to detect the task which is going
> to release its memory, but the logic is very wrong.
> 
> 	- a single process P with the dead group leader disables
> 	  select_bad_process() completely, it will always return
> 	  ERR_PTR() while P can live forever
> 
> 	- if the PF_EXITING task has already released its ->mm
> 	  it doesn't make sense to expect it is goiing to free
> 	  more memory (except task_struct/etc)
> 
> Change the code to ignore the PF_EXITING tasks without ->mm.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -300,7 +300,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		 * the process of exiting and releasing its resources.
>  		 * Otherwise we could get an easy OOM deadlock.
>  		 */
> -		if (p->flags & PF_EXITING) {
> +		if ((p->flags & PF_EXITING) && p->mm) {
>  			if (p != current)
>  				return ERR_PTR(-1UL);

Looks good to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 20:08   ` Andrew Morton
  2010-06-09  0:14     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's possible to livelock the page allocator if a thread has mm->mmap_sem

What is the state of this thread?  Trying to allocate memory, I assume.  

> and fails to make forward progress because the oom killer selects another
> thread sharing the same ->mm to kill that cannot exit until the semaphore
> is dropped.
> 
> The oom killer will not kill multiple tasks at the same time; each oom
> killed task must exit before another task may be killed.

This sounds like a quite risky design.  The possibility that we'll
cause other dead/livelocks similar to this one seems pretty high.  It
applies to all sleeping locks in the entire kernel, doesn't it?

If so: it's unfortunate that the kernel doesn't dsitinguish between
D-state-for-locks and D-state-for-disk-io.  Otherwise we could just
skip over D-state-for-locks processes.

Or maybe I'm wrong ;)

>  Thus, if one
> thread is holding mm->mmap_sem and cannot allocate memory, all threads
> sharing the same ->mm are blocked from exiting as well.  In the oom kill
> case, that means the thread holding mm->mmap_sem will never free
> additional memory since it cannot get access to memory reserves and the
> thread that depends on it with access to memory reserves cannot exit
> because it cannot acquire the semaphore.  Thus, the page allocators
> livelocks.
> 
> When the oom killer is called and current happens to have a pending
> SIGKILL, this patch automatically gives it access to memory reserves and
> returns.  Upon returning to the page allocator, its allocation will
> hopefully succeed so it can quickly exit and free its memory.  If not, the
> page allocator will fail the allocation if it is not __GFP_NOFAIL.

You said "hopefully".

Does it actually work?  Any real-world testing results?  If so, they'd
be a useful addition to the changelog.

> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -650,6 +650,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		/* Got some memory back in the last second. */
>  		return;
>  
> +	/*
> +	 * If current has a pending SIGKILL, then automatically select it.  The
> +	 * goal is to allow it to allocate so that it may quickly exit and free
> +	 * its memory.
> +	 */
> +	if (fatal_signal_pending(current)) {
> +		set_thread_flag(TIF_MEMDIE);
> +		return;
> +	}
> +
>  	if (sysctl_panic_on_oom == 2) {
>  		dump_header(NULL, gfp_mask, order, NULL);
>  		panic("out of memory. Compulsory panic_on_oom is selected.\n");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 18:47     ` David Rientjes
@ 2010-06-08 20:12     ` Andrew Morton
  2010-06-13 11:24       ` KOSAKI Motohiro
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue,  8 Jun 2010 20:41:57 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > +
> >  	if (sysctl_panic_on_oom == 2) {
> >  		dump_header(NULL, gfp_mask, order, NULL);
> >  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> 
> Sorry, I had found this patch works incorrect. I don't pulled.

Saying "it doesn't work and I'm not telling you why" is unhelpful.  In
fact it's the opposite of helpful because it blocks merging of the fix
and doesn't give us any way to move forward.

So what can I do?  Hard.

What I shall do is to merge the patch in the hope that someone else will
discover the undescribed problem and we will fix it then.  That's very
inefficient.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-08 19:42   ` Andrew Morton
@ 2010-06-08 20:14     ` Oleg Nesterov
  2010-06-08 20:17       ` Oleg Nesterov
  2010-06-08 23:50     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-08 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On 06/08, Andrew Morton wrote:
>
> On Sun, 6 Jun 2010 15:34:03 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
>
> I assume from the above that we should have a Signed-off-by:kosaki
> here.  I didn't make that change yet - please advise.

Yes. The patch mixes 2 changes: find_lock_task_mm patch + "do not forget
about the sub-thread's children". The changelog doesn't match the actual
changes.

> > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	 * child is eating the vast majority of memory, adding only half
> >  	 * to the parents will make the child our kill candidate of choice.
> >  	 */
> > -	list_for_each_entry(child, &p->children, sibling) {
> > -		task_lock(child);
> > -		if (child->mm != mm && child->mm)
> > -			points += child->mm->total_vm/2 + 1;
> > -		task_unlock(child);
> > -	}
> > +	t = p;
> > +	do {
> > +		list_for_each_entry(c, &t->children, sibling) {
> > +			child = find_lock_task_mm(c);
> > +			if (child) {
> > +				if (child->mm != p->mm)
> > +					points += child->mm->total_vm/2 + 1;
>
> What if 1000 children share the same mm?  Doesn't this give a grossly
> wrong result?

Can't answer. Obviusly it is hard to explain what is the "right" result here.
But otoh, without this change we can't account children. Kosaki sent this
as a separate change.

> > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  	for_each_process(p) {
> >  		unsigned long points;
> >
> > -		/* skip tasks that have already released their mm */
> > -		if (!p->mm)
> > -			continue;

We shouldn't remove this without removing OR updating the PF_EXITING check
below. That is why we had another patch.

This change alone allows to trivially disable oom-kill. If we have a process
with the dead leader, select_bad_process() will always return -1.

We either need another patch from Kosaki's series

	- if (p->flags & PF_EXITING)
	+ if (p->flags & PF_EXITING && p->mm)

or remove this check (David objects).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 20:17   ` Andrew Morton
  2010-06-08 20:26   ` Oleg Nesterov
  2 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:22 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's unnecessary to SIGKILL a task that is already PF_EXITING and can
> actually cause a NULL pointer dereference of the sighand if it has already
> been detached.  Instead, simply set TIF_MEMDIE so it has access to memory
> reserves and can quickly exit as the comment implies.
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -458,7 +458,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
>  	if (p->flags & PF_EXITING) {
> -		__oom_kill_task(p, 0);
> +		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		return 0;
>  	}

Well, we lose a lot of other stuff here.  We can set TIF_MEMDIE on the
is_global_init() task (how can that get PF_EXITING?).  We don't print
the "Killed process %d" info.  We don't bump the task's timeslice.

These are unchangelogged alterations and I for one can't tell whether
or not they were deliberate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-08 20:14     ` Oleg Nesterov
@ 2010-06-08 20:17       ` Oleg Nesterov
  2010-06-08 21:34         ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-08 20:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On 06/08, Oleg Nesterov wrote:
>
> On 06/08, Andrew Morton wrote:
> >
> > > -		/* skip tasks that have already released their mm */
> > > -		if (!p->mm)
> > > -			continue;
>
> We shouldn't remove this without removing OR updating the PF_EXITING check
> below. That is why we had another patch.
>
> This change alone allows to trivially disable oom-kill. If we have a process
> with the dead leader, select_bad_process() will always return -1.
>
> We either need another patch from Kosaki's series
>
> 	- if (p->flags & PF_EXITING)
> 	+ if (p->flags & PF_EXITING && p->mm)

OOPS, sorry.

I didn't understand you are going to merge this change too.

Probably oom-pf_exiting-check-should-take-mm-into-account.patch should
go ahead of this one for bisecting.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 20:23   ` Andrew Morton
  2010-06-09  0:25     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:25 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
> 
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
> 
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.  To use a sane heuristic, we must ensure that killing a
> task would likely free memory for current and avoid needlessly killing
> others at all costs just because their potential memory freeing is
> unknown.  It is better to kill current than another task needlessly.

This is all a bit arbitrary, isn't it?  The key word here is "rarely". 
If indeed this task had allocated gobs of memory from `current's nodes
and then sneakily switched nodes, this will be a big regression!

So..  It's not completely clear to me how we justify this decision. 
Are we erring too far on the side of keep-tasks-running?  Is failing to
clear the oom a lot bigger problem than killing an innocent task?  I
think so.  In which case we should err towards slaughtering the
innocent?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 20:17   ` Andrew Morton
@ 2010-06-08 20:26   ` Oleg Nesterov
  2010-06-09  6:32     ` David Rientjes
  2 siblings, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-08 20:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

To clarify, I am not going to review this patch ;)
As I said many times I can only understand what oom_kill.c does,
but now why.

On 06/06, David Rientjes wrote:
>
> It's unnecessary to SIGKILL a task that is already PF_EXITING

This probably needs some explanation. PF_EXITING doesn't necessarily
mean this process is exiting.

> and can
> actually cause a NULL pointer dereference of the sighand

Yes. Another reason to avoid force_sig().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 08/18] oom: sacrifice child with highest badness score for parent
  2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 20:33   ` Andrew Morton
  2010-06-09  0:30     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 20:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:28 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.  Unfortunately,
> this often kills in a seemingly random fashion based on the ordering of
> the selected task's child list.  Additionally, it is not guaranteed at all
> to free a large amount of memory that we need to prevent additional oom
> killing in the very near future.
> 
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists.  The worst child is indicated with the
> highest badness() score.  This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
> 
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful and
> then the parent if none were found while the new code simply kills the
> most memory-hogging task or the parent.  Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or has
> a /proc/pid/oom_adj of OOM_DISABLE.  badness() returns 0 for both cases,
> so the final oom_kill_task() will always succeed.
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Nick Piggin <npiggin@suse.de>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   23 +++++++++++++++++------
>  1 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  			    unsigned long points, struct mem_cgroup *mem,
>  			    const char *message)
>  {
> +	struct task_struct *victim = p;
>  	struct task_struct *c;
>  	struct task_struct *t = p;
> +	unsigned long victim_points = 0;
> +	struct timespec uptime;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem);
> @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		return 0;
>  	}
>  
> -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> -					message, task_pid_nr(p), p->comm, points);
> +	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> +		message, task_pid_nr(p), p->comm, points);

fyi, access to another task's ->comm is racy against prctl().  Fixable
with get_task_comm().  But that takes task_lock(), which is risky in
this code.  The world wouldn't end if we didn't fix this ;)

> -	/* Try to kill a child first */
> +	/* Try to sacrifice the worst child first */
> +	do_posix_clock_monotonic_gettime(&uptime);
>  	do {
> +		unsigned long cpoints;

This could be local to the list_for_each_entry() block.

What does "cpoints" mean?

>  		list_for_each_entry(c, &t->children, sibling) {

I'm surprised we don't have a sched.h helper for this.  Maybe it's not
a very common thing to do.

>  			if (c->mm == p->mm)
>  				continue;
>  			if (mem && !task_in_mem_cgroup(c, mem))
>  				continue;
> -			if (!oom_kill_task(c))
> -				return 0;
> +
> +			/* badness() returns 0 if the thread is unkillable */
> +			cpoints = badness(c, uptime.tv_sec);
> +			if (cpoints > victim_points) {
> +				victim = c;
> +				victim_points = cpoints;
> +			}
>  		}
>  	} while_each_thread(p, t);
>  
> -	return oom_kill_task(p);
> +	return oom_kill_task(victim);
>  }

And this function is secretly called under tasklist_lock, which is what
pins *victim, yes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 21:08   ` Andrew Morton
  2010-06-08 21:17     ` Oleg Nesterov
  2010-06-09  0:46     ` David Rientjes
  2010-06-08 23:43   ` Andrew Morton
  2 siblings, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 21:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:31 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.
> 
> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.
> 
>
> ...
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/module.h>
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>  #include <linux/security.h>
>  
>  int sysctl_panic_on_oom;
> @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
> + * @tsk: task struct of which task to consider
> + * @mask: nodemask passed to page allocator for mempolicy ooms

The comment uses kerneldoc annotation but isn't a kerneldoc comment.

>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> +					const nodemask_t *mask)
>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
>  
> -	t = tsk;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> -			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> -
> -	return 0;
> +		if (mask) {
> +			/*
> +			 * If this is a mempolicy constrained oom, tsk's
> +			 * cpuset is irrelevant.  Only return true if its
> +			 * mempolicy intersects current, otherwise it may be
> +			 * needlessly killed.
> +			 */
> +			if (mempolicy_nodemask_intersects(tsk, mask))
> +				return true;

The comment refers to `current' but the code does not?

> +		} else {
> +			/*
> +			 * This is not a mempolicy constrained oom, so only
> +			 * check the mems of tsk's cpuset.
> +			 */

The comment doesn't refer to `current', but the code does.  Confused.

> +			if (cpuset_mems_allowed_intersects(current, tsk))
> +				return true;
> +		}
> +		tsk = next_thread(tsk);

hm, next_thread() uses list_entry_rcu().  What are the locking rules
here?  It's one of both of rcu_read_lock() and read_lock(&tasklist_lock),
I think?

> +	} while (tsk != start);
> +	return false;
>  }

This is all bloat and overhead for non-NUMA builds.  I doubt if gcc is
able to eliminate the task_struct walk (although I didn't check).

The function isn't oom-killer-specific at all - give it a better name
then move it to mempolicy.c or similar?  If so, the text "oom"
shouldn't appear in the comments.

>
> ...
>
> @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 */
>  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
> -
> -	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
> -	case CONSTRAINT_NONE:
> -		if (sysctl_panic_on_oom) {
> +	if (unlikely(sysctl_panic_on_oom)) {
> +		/*
> +		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> +		 * should not panic for cpuset or mempolicy induced memory
> +		 * failures.
> +		 */

This wasn't changelogged?

> +		if (constraint == CONSTRAINT_NONE) {
>  			dump_header(NULL, gfp_mask, order, NULL);
> -			panic("out of memory. panic_on_oom is selected\n");
> +			read_unlock(&tasklist_lock);
> +			panic("Out of memory: panic_on_oom is enabled\n");
>  		}
> -		/* Fall-through */
> -	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> -		break;
>  	}
> -
> +	__out_of_memory(gfp_mask, order, constraint, nodemask);
>  	read_unlock(&tasklist_lock);
>  
>  	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 10/18] oom: enable oom tasklist dump by default
  2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 21:13   ` Andrew Morton
  2010-06-09  0:52     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 21:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:35 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
> very helpful information in diagnosing why a user's task has been killed.
> It emits useful information such as each eligible thread's memory usage
> that can determine why the system is oom, so it should be enabled by
> default.

Unclear.  On a large system the poor thing will now spend half an hour
squirting junk out the diagnostic port.  Probably interspersed with the
occasional whine from the softlockup detector.  And for many
applications, spending a long time stuck in the kernel printing
diagnostics is equivalent to an outage.

I guess people can turn it off again if this happens, but they'll get
justifiably grumpy at us.  I wonder if this change is too
developer-friendly and insufficiently operator-friendly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-08 21:08   ` Andrew Morton
@ 2010-06-08 21:17     ` Oleg Nesterov
  2010-06-09  0:46     ` David Rientjes
  1 sibling, 0 replies; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-08 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On 06/08, Andrew Morton wrote:
>
> On Sun, 6 Jun 2010 15:34:31 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > +			if (cpuset_mems_allowed_intersects(current, tsk))
> > +				return true;
> > +		}
> > +		tsk = next_thread(tsk);
>
> hm, next_thread() uses list_entry_rcu().  What are the locking rules
> here?  It's one of both of rcu_read_lock() and read_lock(&tasklist_lock),
> I think?

Yes, next_thread() is safe under tasklist/rcu/siglock.

> > +	} while (tsk != start);
> > +	return false;
> >  }
>
> This is all bloat and overhead for non-NUMA builds.  I doubt if gcc is
> able to eliminate the task_struct walk (although I didn't check).

I'd also suggest while_each_thread() instead if next_thread() +
"tsk != start", but this is really minor nit.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 11/18] oom: avoid oom killer for lowmem allocations
  2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 21:19   ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm


> oom: avoid oom killer for lowmem allocations

I think the terminology is poor.  My 256MB test box only has lowmem! 
In the past we've used the term "lower zone" here, which is I think
what you want?

On Sun, 6 Jun 2010 15:34:38 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> If memory has been depleted in lowmem zones even with the protection
> afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> killing current users will help.  The memory is either reclaimable (or
> migratable) already, in which case we should not invoke the oom killer at
> all, or it is pinned by an application for I/O.  Killing such an
> application may leave the hardware in an unspecified state and there is no
> guarantee that it will be able to make a timely exit.

Killing an application can leave hardware in an unspecified state?  How
so?  That means a ^C kills the box!

> Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
> not used so that the task can perhaps recover or try again later.
> 
> Previously, the heuristic provided some protection for those tasks with
> CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> killing tasks for the purposes of ISA allocations.
> 
> high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
> return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |   29 ++++++++++++++++++++---------
>  1 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1759,6 +1759,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		/* The OOM killer will not help higher order allocs */
>  		if (order > PAGE_ALLOC_COSTLY_ORDER)
>  			goto out;
> +		/* The OOM killer does not needlessly kill tasks for lowmem */

a) terminology is scary

b) comment doesn't explain _why_, which is the most important thing
   to explain.

> +		if (high_zoneidx < ZONE_NORMAL)
> +			goto out;
>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> @@ -2052,15 +2055,23 @@ rebalance:
>  			if (page)
>  				goto got_pg;
>  
> -			/*
> -			 * The OOM killer does not trigger for high-order
> -			 * ~__GFP_NOFAIL allocations so if no progress is being
> -			 * made, there are no other options and retrying is
> -			 * unlikely to help.
> -			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> -						!(gfp_mask & __GFP_NOFAIL))
> -				goto nopage;
> +			if (!(gfp_mask & __GFP_NOFAIL)) {
> +				/*
> +				 * The oom killer is not called for high-order
> +				 * allocations that may fail, so if no progress
> +				 * is being made, there are no other options and
> +				 * retrying is unlikely to help.
> +				 */
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					goto nopage;
> +				/*
> +				 * The oom killer is not called for lowmem
> +				 * allocations to prevent needlessly killing
> +				 * innocent tasks.
> +				 */

s/lowmem/somethingelse/

> +				if (high_zoneidx < ZONE_NORMAL)
> +					goto nopage;
> +			}
>  
>  			goto restart;
>  		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 13/18] oom: remove special handling for pagefault ooms
  2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
  2010-06-08 11:42   ` KOSAKI Motohiro
@ 2010-06-08 21:27   ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 21:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:44 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It is possible to remove the special pagefault oom handler

It'd be useful to describe what services that handler provides and to
then describe how these services are retained in the new version.

> by simply oom
> locking all system zones and then calling directly into out_of_memory().
> 
> All populated zones must have ZONE_OOM_LOCKED set, otherwise there is a
> parallel oom killing in progress that will lead to eventual memory freeing
> so it's not necessary to needlessly kill another task.

Should that have read "otherwise if there is"?

(the code comments actually clarify all this)

>  The context in
> which the pagefault is allocating memory is unknown to the oom killer, so
> this is done on a system-wide level.
> 
> If a task has already been oom killed and hasn't fully exited yet, this
> will be a no-op since select_bad_process() recognizes tasks across the
> system with TIF_MEMDIE set.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-08 20:17       ` Oleg Nesterov
@ 2010-06-08 21:34         ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 21:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010 22:17:39 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/08, Oleg Nesterov wrote:
> >
> > On 06/08, Andrew Morton wrote:
> > >
> > > > -		/* skip tasks that have already released their mm */
> > > > -		if (!p->mm)
> > > > -			continue;
> >
> > We shouldn't remove this without removing OR updating the PF_EXITING check
> > below. That is why we had another patch.
> >
> > This change alone allows to trivially disable oom-kill. If we have a process
> > with the dead leader, select_bad_process() will always return -1.
> >
> > We either need another patch from Kosaki's series
> >
> > 	- if (p->flags & PF_EXITING)
> > 	+ if (p->flags & PF_EXITING && p->mm)
> 
> OOPS, sorry.
> 
> I didn't understand you are going to merge this change too.
> 
> Probably oom-pf_exiting-check-should-take-mm-into-account.patch should
> go ahead of this one for bisecting.

OK, thanks, I did that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 22:58   ` Andrew Morton
  2010-06-17  5:32     ` David Rientjes
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 22:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:54 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions.  The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.

It's not obvious from this description that then end result is better! 
Have you any testcases or scenarios which got improved?

> Instead of basing the heuristic on mm->total_vm for each task, the task's
> rss and swap space is used instead.  This is a better indication of the
> amount of memory that will be freeable if the oom killed task is chosen
> and subsequently exits.

Again, why should we optimise for the amount of memory which a killing
will yield (if that's what you mean).  We only need to free enough
memory to unblock the oom condition then proceed.

The last thing we want to do is to kill a process which has consumed
1000 CPU hours, or which is providing some system-critical service or
whatever.  Amount-of-memory-freeable is a relatively minor criterion.

>  This helps specifically in cases where KDE or
> GNOME is chosen for oom kill on desktop systems instead of a memory
> hogging task.

It helps how?  Examples and test cases?

> The baseline for the heuristic is a proportion of memory that each task is
> currently using in memory plus swap compared to the amount of "allowable"
> memory.

What does "swap" mean?  swapspace includes swap-backed swapcache,
un-swap-backed swapcache and non-resident swap.  Which of all these is
being used here and for what reason?

>  "Allowable," in this sense, means the system-wide resources for
> unconstrained oom conditions, the set of mempolicy nodes, the mems
> attached to current's cpuset, or a memory controller's limit.  The
> proportion is given on a scale of 0 (never kill) to 1000 (always kill),
> roughly meaning that if a task has a badness() score of 500 that the task
> consumes approximately 50% of allowable memory resident in RAM or in swap
> space.

So is a new aim of this code to also free up swap space?  Confused.

> The proportion is always relative to the amount of "allowable" memory and
> not the total amount of RAM systemwide so that mempolicies and cpusets may
> operate in isolation; they shall not need to know the true size of the
> machine on which they are running if they are bound to a specific set of
> nodes or mems, respectively.
> 
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs.  In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.
> 
> Because of the change in the badness() heuristic's baseline, it is also
> necessary to introduce a new user interface to tune it.  It's not possible
> to redefine the meaning of /proc/pid/oom_adj with a new scale since the
> ABI cannot be changed for backward compatability.  Instead, a new tunable,
> /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
> be used to polarize the heuristic such that certain tasks are never
> considered for oom kill while others may always be considered.  The value
> is added directly into the badness() score so a value of -500, for
> example, means to discount 50% of its memory consumption in comparison to
> other tasks either on the system, bound to the mempolicy, in the cpuset,
> or sharing the same memory controller.
> 
> /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> these per-task tunables will rescale the value of the other to an
> equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> a bitshift on the badness score, it now shares the same linear growth as
> /proc/pid/oom_score_adj but with different granularity.  This is required
> so the ABI is not broken with userspace applications and allows oom_adj to
> be deprecated for future removal.

It was a mistake to add oom_adj in the first place.  Because it's a
user-visible knob which us tied to a particular in-kernel
implementation.  As we're seeing now, the presence of that knob locks
us into a particular implementation.

Given that oom_score_adj is just a rescaled version of oom_adj
(correct?), I guess things haven't got a lot worse on that front as a
result of these changes.


General observation regarding the patch description: I'm not seeing a
lot of reason for merging the patch!  What value does it bring to our
users?  What problems got solved?

Some of Kosaki's observations sounded fairly serious so I'll go into
wait-and-see mode on this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 23:02     ` Andrew Morton
  2010-06-13 11:24       ` KOSAKI Motohiro
  2010-06-17  5:14       ` David Rientjes
  2010-06-17  5:12     ` David Rientjes
  1 sibling, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 23:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue,  8 Jun 2010 20:41:56 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

>
> ...
>
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -4,6 +4,8 @@
> >   *  Copyright (C)  1998,2000  Rik van Riel
> >   *	Thanks go out to Claus Fischer for some serious inspiration and
> >   *	for goading me into coding this file...
> > + *  Copyright (C)  2010  Google, Inc.
> > + *	Rewritten by David Rientjes
> 
> don't put it.
> 

Seems OK to me.  It's a fairly substantial change and people have added
their (c) in the past for smaller kernel changes.  I guess one could even
do this for a one-liner.

>
> ...
>
> >  	/*
> > -	 * Niced processes are most likely less important, so double
> > -	 * their badness points.
> > +	 * The memory controller may have a limit of 0 bytes, so avoid a divide
> > +	 * by zero if necessary.
> >  	 */
> > -	if (task_nice(p) > 0)
> > -		points *= 2;
> 
> You removed 
>   - run time check
>   - cpu time check
>   - nice check
> 
> but no described the reason. reviewers are puzzled. How do we review
> this though we don't get your point? please write
> 
>  - What benerit is there?
>  - Why do you think no bad effect?
>  - How confirm do you?

yup.

> 
> > +	if (!totalpages)
> > +		totalpages = 1;
> >  
> >  	/*
> > -	 * Superuser processes are usually more important, so we make it
> > -	 * less likely that we kill those.
> > +	 * The baseline for the badness score is the proportion of RAM that each
> > +	 * task's rss and swap space use.
> >  	 */
> > -	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > -	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
> > -		points /= 4;
> > +	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
> > +			totalpages;
> > +	task_unlock(p);
> >  
> >  	/*
> > -	 * We don't want to kill a process with direct hardware access.
> > -	 * Not only could that mess up the hardware, but usually users
> > -	 * tend to only have this flag set on applications they think
> > -	 * of as important.
> > +	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> > +	 * implementation used by LSMs.
> >  	 */
> > -	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> > -		points /= 4;
> > +	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > +		points -= 30;
> 
> 
> CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive
> process. but killing interactive process only cause force logout. but
> killing system daemon can makes more catastrophic disaster.
> 
> 
> Last of all, I'll pulled this one. but only do cherry-pick.
> 

This change was unchangelogged, I don't know what it's for and I don't
understand your comment about it.

Apart from that, I'm doing great!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 17/18] oom: add forkbomb penalty to badness heuristic
  2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
@ 2010-06-08 23:15   ` Andrew Morton
  1 sibling, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 23:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Sun, 6 Jun 2010 15:34:58 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> Add a forkbomb penalty for processes that fork an excessively large
> number of children to penalize that group of tasks and not others.  A
> threshold is configurable from userspace to determine how many first-
> generation execve children (those with their own address spaces) a task
> may have before it is considered a forkbomb.  This can be tuned by
> altering the value in /proc/sys/vm/oom_forkbomb_thres, which defaults to
> 1000.
> 
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
> 
> 	(average rss of children) * (# of 1st generation execve children)
> 	-----------------------------------------------------------------
> 			oom_forkbomb_thres
> 
> is assessed.  So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks.  A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
> 
> Even though a particular task may be designated a forkbomb and selected as
> the victim, the oom killer will still kill the 1st generation execve child
> with the highest badness() score in its place.  The avoids killing
> important servers or system daemons.  When a web server forks a very large
> number of threads for client connections, for example, it is much better
> to kill one of those threads than to kill the server and make it
> unresponsive.
> 

- "oom_forkbomb_thresh" or "oom_forkbomb_threshold", please.

- No new proc knobs!  They lock us into implementation details.

- Let's go outside the box: forkbomb is just a workload.  Why does
  one particular workload need special-casing in the oom-killer?  If
  the oom-kill was working well then when a forkbomb causes an oom, the
  oom-killer would kill whatever is necessary to unlock the system and
  will then let things proceed.

  IOW, if the oom-killer can't handle this particular workload
  gracefully without special-casing then it isn't working well enough.

  Now, maybe there is an argument that a forkbomb is sufficiently
  damaging to warrant adding special-case handling in the kernel.  But
  if so, it should be detected and handled at sys_fork()
  (RLIMIT_NPROC?), not in the oom-killer.  Or, better, the kernel
  should be fixed so that whatever damage the forkbomb causes doesn't
  get caused any more.

  (otoh, the oom-killer is already stuffed full of heuristics and
  this is just another one.  But it should work correctly without it,
  dammit!)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-08 11:42   ` KOSAKI Motohiro
  2010-06-08 19:00     ` David Rientjes
@ 2010-06-08 23:18     ` Andrew Morton
  2010-06-13 11:24       ` KOSAKI Motohiro
  1 sibling, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 23:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue,  8 Jun 2010 20:42:02 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > +	/*
> > +	 * Warn that /proc/pid/oom_adj is deprecated, see
> > +	 * Documentation/feature-removal-schedule.txt.
> > +	 */
> > +	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
> > +			"please use /proc/%d/oom_score_adj instead.\n",
> > +			current->comm, task_pid_nr(current),
> > +			task_pid_nr(task), task_pid_nr(task));
> >  	task->signal->oom_adj = oom_adjust;
> 
> Sorry, we can't accept this. oom_adj is one of most freqently used
> tuning knob. putting this one makes a lot of confusion.
> 
> In addition, this knob is used from some applications (please google
> by google code search or something else). that said, an enduser can't
> stop the warning. that makes a lot of frustration. NO.
> 

I think it's OK.  We made a mistake in adding oom_adj in the first
place and now we get to live with the consequences.

We'll be stuck with oom_adj for the next 200 years if we don't tell
people to stop using it, and a printk_once() is a good way of doing
that.

It could be that in two years time we decide that we can't remove oom_adj
yet because too many people are still using it.  Maybe it will take ten
years - but unless we add the above printk, oom_adj will remain
forever.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-08 19:33   ` Andrew Morton
@ 2010-06-08 23:40     ` David Rientjes
  2010-06-08 23:52       ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-08 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > From: Oleg Nesterov <oleg@redhat.com>
> > 
> > select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
> > is not true due to use_mm().
> > 
> > Change the code to check PF_KTHREAD.
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |    9 +++------
> >  1 files changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -256,14 +256,11 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  	for_each_process(p) {
> >  		unsigned long points;
> >  
> > -		/*
> > -		 * skip kernel threads and tasks which have already released
> > -		 * their mm.
> > -		 */
> > +		/* skip tasks that have already released their mm */
> >  		if (!p->mm)
> >  			continue;
> > -		/* skip the init task */
> > -		if (is_global_init(p))
> > +		/* skip the init task and kthreads */
> > +		if (is_global_init(p) || (p->flags & PF_KTHREAD))
> >  			continue;
> >  		if (mem && !task_in_mem_cgroup(p, mem))
> >  			continue;
> 
> Applied, thanks.  A minor bugfix.
> 

Thanks!  I didn't see it added to -mm, though, so I'll assume it's being 
queued for 2.6.35-rc3 instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 21:08   ` Andrew Morton
@ 2010-06-08 23:43   ` Andrew Morton
  2010-06-09  0:40     ` David Rientjes
  2 siblings, 1 reply; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 23:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm, Andrea Arcangeli

On Sun, 6 Jun 2010 15:34:31 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom killer presently kills current whenever there is no more memory
> free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> current is a memory-hogging task or that killing it will free any
> substantial amount of memory, however.

Well OK.  But we don't necesarily *want* to "free a substantial amount
of memory".  We want to resolve the oom within `current'.  That's the
sole responsibility of the oom-killer.  It doesn't have to free up
large amounts of additional memory in the expectation that sometime in
the future some other task will get an oom as well.  if the oom-killer
is working well, we can defer those actions until the problem actually
occurs.

Plus: if `current' isn't using much memory then it's probably a
short-lived or not-very-important process anyway.

> In such situations, it is better to scan the tasklist for nodes that are
> allowed to allocate on current's set of nodes and kill the task with the
> highest badness() score.  This ensures that the most memory-hogging task,
> or the one configured by the user with /proc/pid/oom_adj, is always
> selected in such scenarios.

Well... *why* is it better?  Needs more justification/explanation IMO.

A long time ago Andrea changed the oom-killer so that it basically
always killed `current', iirc.  I think that shipped in the Suse
kernel.  Maybe it was only in the case where `current' got an oom when
satisfying a pagefault, I forget the details.  But according to Andrea,
this design provided a simple and practical solution to ooms.

So I think this policy change would benefit from a more convincing
justification.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives
  2010-06-08 19:42   ` Andrew Morton
  2010-06-08 20:14     ` Oleg Nesterov
@ 2010-06-08 23:50     ` David Rientjes
  1 sibling, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-08 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > From: Oleg Nesterov <oleg@redhat.com>
> > 
> > Almost all ->mm == NUL checks in oom_kill.c are wrong.
> > 
> > The current code assumes that the task without ->mm has already
> > released its memory and ignores the process. However this is not
> > necessarily true when this process is multithreaded, other live
> > sub-threads can use this ->mm.
> > 
> > - Remove the "if (!p->mm)" check in select_bad_process(), it is
> >   just wrong.
> > 
> > - Add the new helper, find_lock_task_mm(), which finds the live
> >   thread which uses the memory and takes task_lock() to pin ->mm
> > 
> > - change oom_badness() to use this helper instead of just checking
> >   ->mm != NULL.
> > 
> > - As David pointed out, select_bad_process() must never choose the
> >   task without ->mm, but no matter what oom_badness() returns the
> >   task can be chosen if nothing else has been found yet.
> > 
> >   Change oom_badness() to return int, change it to return -1 if
> >   find_lock_task_mm() fails, and change select_bad_process() to
> >   check points >= 0.
> > 
> > Note! This patch is not enough, we need more changes.
> > 
> > 	- oom_badness() was fixed, but oom_kill_task() still ignores
> > 	  the task without ->mm
> > 
> > 	- oom_forkbomb_penalty() should use find_lock_task_mm() too,
> > 	  and it also needs other changes to actually find the first
> > 	  first-descendant children
> > 
> > This will be addressed later.
> > 
> > [kosaki.motohiro@jp.fujitsu.com: use in badness(), __oom_kill_task()]
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> I assume from the above that we should have a Signed-off-by:kosaki
> here.  I didn't make that change yet - please advise.
> 

Oops, that was accidently dropped, sorry about that.  I folded two of his 
patches into this one since it introduces find_lock_task_mm() and it needs 
to be used in the places KOSAKI fixed as well.  His original patches are 
at

	http://marc.info/?l=linux-mm&m=127537136419677
	http://marc.info/?l=linux-mm&m=127537153619893

along with his sign-off.

> 
> >  mm/oom_kill.c |   74 +++++++++++++++++++++++++++++++++------------------------
> >  1 files changed, 43 insertions(+), 31 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -52,6 +52,20 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >  	return 0;
> >  }
> >  
> > +static struct task_struct *find_lock_task_mm(struct task_struct *p)
> > +{
> > +	struct task_struct *t = p;
> > +
> > +	do {
> > +		task_lock(t);
> > +		if (likely(t->mm))
> > +			return t;
> > +		task_unlock(t);
> > +	} while_each_thread(p, t);
> > +
> > +	return NULL;
> > +}
> 
> What pins `p'?  Ah, caller must hold tasklist_lock.
> 

I'll add a comment about this in a followup patch, it should remove the 
the confusion others have had about the naming of the function as well, 
which I think is good but could use some explanation.

> >  /**
> >   * badness - calculate a numeric value for how bad this task has been
> >   * @p: task struct of which task we should calculate
> > @@ -74,8 +88,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >  unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  {
> >  	unsigned long points, cpu_time, run_time;
> > -	struct mm_struct *mm;
> >  	struct task_struct *child;
> > +	struct task_struct *c, *t;
> >  	int oom_adj = p->signal->oom_adj;
> >  	struct task_cputime task_time;
> >  	unsigned long utime;
> > @@ -84,17 +98,14 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	if (oom_adj == OOM_DISABLE)
> >  		return 0;
> >  
> > -	task_lock(p);
> > -	mm = p->mm;
> > -	if (!mm) {
> > -		task_unlock(p);
> > +	p = find_lock_task_mm(p);
> > +	if (!p)
> >  		return 0;
> > -	}
> >  
> >  	/*
> >  	 * The memory size of the process is the basis for the badness.
> >  	 */
> > -	points = mm->total_vm;
> > +	points = p->mm->total_vm;
> >  
> >  	/*
> >  	 * After this unlock we can no longer dereference local variable `mm'
> 
> This comment is stale.  Replace with p->mm.
> 

Indeed, find_lock_task_mm() returns with task_lock() held for p->mm here 
so the deference is always safe.  I'll send a followup.

> > @@ -115,12 +126,17 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	 * child is eating the vast majority of memory, adding only half
> >  	 * to the parents will make the child our kill candidate of choice.
> >  	 */
> > -	list_for_each_entry(child, &p->children, sibling) {
> > -		task_lock(child);
> > -		if (child->mm != mm && child->mm)
> > -			points += child->mm->total_vm/2 + 1;
> > -		task_unlock(child);
> > -	}
> > +	t = p;
> > +	do {
> > +		list_for_each_entry(c, &t->children, sibling) {
> > +			child = find_lock_task_mm(c);
> > +			if (child) {
> > +				if (child->mm != p->mm)
> > +					points += child->mm->total_vm/2 + 1;
> 
> What if 1000 children share the same mm?  Doesn't this give a grossly
> wrong result?
> 

It does, and that's why there has been large criticism about this 
particular part of the heuristic over the past few months.  It gets 
removed in my badness() rewrite, but the change here is concerned solely 
about the use_mm() race so closes a gap that currently exists.

> > +				task_unlock(child);
> > +			}
> > +		}
> > +	} while_each_thread(p, t);
> >  
> >  	/*
> >  	 * CPU time is in tens of seconds and run time is in thousands
> > @@ -256,9 +272,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  	for_each_process(p) {
> >  		unsigned long points;
> >  
> > -		/* skip tasks that have already released their mm */
> > -		if (!p->mm)
> > -			continue;
> >  		/* skip the init task and kthreads */
> >  		if (is_global_init(p) || (p->flags & PF_KTHREAD))
> >  			continue;
> > @@ -385,14 +398,9 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> >  		return;
> >  	}
> >  
> > -	task_lock(p);
> > -	if (!p->mm) {
> > -		WARN_ON(1);
> > -		printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
> > -			task_pid_nr(p), p->comm);
> > -		task_unlock(p);
> > +	p = find_lock_task_mm(p);
> > +	if (!p)
> >  		return;
> > -	}
> >  
> >  	if (verbose)
> >  		printk(KERN_ERR "Killed process %d (%s) "
> > @@ -437,6 +445,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  			    const char *message)
> >  {
> >  	struct task_struct *c;
> > +	struct task_struct *t = p;
> >  
> >  	if (printk_ratelimit())
> >  		dump_header(p, gfp_mask, order, mem);
> > @@ -454,14 +463,17 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  					message, task_pid_nr(p), p->comm, points);
> >  
> >  	/* Try to kill a child first */
> 
> It'd be nice to improve the comments a bit.  This one tells us the
> "what" (which is usually obvious) but didn't tell us "why", which is
> often the unobvious.
> 

This gets modified in 
oom-sacrifice-child-with-highest-badness-score-for-parent.patch, so I'll 
expand upon it there and post a followup patch since it's already merged.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads
  2010-06-08 23:40     ` David Rientjes
@ 2010-06-08 23:52       ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-06-08 23:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010 16:40:07 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> > 
> > Applied, thanks.  A minor bugfix.
> > 
> 
> Thanks!  I didn't see it added to -mm, though,

doh.

<adds it>

> so I'll assume it's being 
> queued for 2.6.35-rc3 instead.

Linus is being all strict - "regression and oops fixes only", and I
don't think a fix of this magnitude passes the test.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 03/18] oom: dump_tasks use find_lock_task_mm too
  2010-06-08 19:55   ` Andrew Morton
@ 2010-06-09  0:06     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:06 UTC (permalink / raw)
  To: Andrew Morton, KOSAKI Motohiro
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > dump_task() should use find_lock_task_mm() too. It is necessary for
> > protecting task-exiting race.
> 
> A full description of the race would help people understand the code
> and the change.
> 

Ok, here's a description of it that you can add to KOSAKI's changelog if 
you'd like:

dump_tasks() currently filters any task that does not have an attached 
->mm since it incorrectly assumes that it must either be in process of 
exiting and has detached its memory or that it's a kernel thread; 
multithreaded tasks may actually have subthreads that have a valid ->mm 
pointer and thus those threads should actually be displayed.  This change 
finds those threads, if they exist, and emit its information along with 
the rest of the candidate tasks for kill.

> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/oom_kill.c |   39 +++++++++++++++++++++------------------
> >  1 files changed, 21 insertions(+), 18 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -336,35 +336,38 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >   */
> >  static void dump_tasks(const struct mem_cgroup *mem)
> 
> The comment over this function needs to be updated to describe the role
> of incoming argument `mem'.
> 

Ok, I can take care of this as another comment cleanup in a followup 
patch.

> >  {
> > -	struct task_struct *g, *p;
> > +	struct task_struct *p;
> > +	struct task_struct *task;
> >  
> >  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
> >  	       "name\n");
> > -	do_each_thread(g, p) {
> > -		struct mm_struct *mm;
> > -
> > -		if (mem && !task_in_mem_cgroup(p, mem))
> > +	for_each_process(p) {
> 
> The switch from do_each_thread() to for_each_process() is
> unchangelogged.  It looks like a little cleanup to me.
> 
> > +		/*
> > +		 * We don't have is_global_init() check here, because the old
> > +		 * code do that. printing init process is not big matter. But
> > +		 * we don't hope to make unnecessary compatibility breaking.
> > +		 */
> 
> When merging others' patches, please do review and if necessary fix or
> enhance the comments and the changelog.  I don't think people take
> offense.
> 

Ok, I wasn't sure of the etiquette and I didn't want anything else holding 
this work up.

> Also, I don't think it's really valuable to document *changes* within
> the code comments.  This comment is referring to what the old code did
> versus the new code.  Generally it's best to just document the code as
> it presently stands and leave the documentation of the delta to the
> changelog.
> 
> That's not always true, of course - we should document oddball code
> which is left there for userspace-visible back-compatibility reasons.
> 

Agreed, I think KOSAKI might be working on a patch that moves all of this 
tasklist filtering logic to a helper function and would probably fix this 
up.  KOSAKI?

> 
> > +		if (p->flags & PF_KTHREAD)
> >  			continue;
> > -		if (!thread_group_leader(p))
> > +		if (mem && !task_in_mem_cgroup(p, mem))
> >  			continue;
> >  
> > -		task_lock(p);
> > -		mm = p->mm;
> > -		if (!mm) {
> > +		task = find_lock_task_mm(p);
> > +		if (!task) {
> >  			/*
> > -			 * total_vm and rss sizes do not exist for tasks with no
> > -			 * mm so there's no need to report them; they can't be
> > -			 * oom killed anyway.
> > +			 * Probably oom vs task-exiting race was happen and ->mm
> > +			 * have been detached. thus there's no need to report
> > +			 * them; they can't be oom killed anyway.
> >  			 */
> 
> OK, that hinted at the race but still didn't really tell readers what it is.
> 

It's actually mostly incorrect, it does short-circuit the iteration when a 
task is found to have already exited or detached its memory while we're 
holding tasklist_lock, but the old comment was probably better.  The 
scenario where this condition will be true 99% of the time is when 
iterating through the tasklist and finding a kthread.  I'll fix this up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-08 20:08   ` Andrew Morton
@ 2010-06-09  0:14     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > It's possible to livelock the page allocator if a thread has mm->mmap_sem
> 
> What is the state of this thread?  Trying to allocate memory, I assume.  
> 

Right, which I agree is a bad scenario to be in but indeed does happen 
(and we have a workaround at Google that identifies these particular cases 
and kills the holder of the writelock on mm->mmap_sem).  We have one 
thread holding a readlock on mm->mmap_sem while trying to allocate memory 
so the oom killer becomes a no-op to prevent needless task killing while 
waiting for the killed task to exit, but that killed task can't exit 
because it requires a writelock on the same semaphore.

> > and fails to make forward progress because the oom killer selects another
> > thread sharing the same ->mm to kill that cannot exit until the semaphore
> > is dropped.
> > 
> > The oom killer will not kill multiple tasks at the same time; each oom
> > killed task must exit before another task may be killed.
> 
> This sounds like a quite risky design.  The possibility that we'll
> cause other dead/livelocks similar to this one seems pretty high.  It
> applies to all sleeping locks in the entire kernel, doesn't it?
> 

It applies to any writelock that is taken during the exitpath of an oom 
killed task if a thread holding a readlock is trying to allocate memory 
itself.  This is how it's always been done at least within the past few 
years and we haven't had a problem other than with mm->mmap_sem.  At one 
point we used an oom killer timeout to kill other tasks after a period of 
time had elapsed, but that hasn't been required since we've been killing 
the thread holding the writelock on mm->mmap_sem.

> >  Thus, if one
> > thread is holding mm->mmap_sem and cannot allocate memory, all threads
> > sharing the same ->mm are blocked from exiting as well.  In the oom kill
> > case, that means the thread holding mm->mmap_sem will never free
> > additional memory since it cannot get access to memory reserves and the
> > thread that depends on it with access to memory reserves cannot exit
> > because it cannot acquire the semaphore.  Thus, the page allocators
> > livelocks.
> > 
> > When the oom killer is called and current happens to have a pending
> > SIGKILL, this patch automatically gives it access to memory reserves and
> > returns.  Upon returning to the page allocator, its allocation will
> > hopefully succeed so it can quickly exit and free its memory.  If not, the
> > page allocator will fail the allocation if it is not __GFP_NOFAIL.
> 
> You said "hopefully".
> 

"hopefully" in this case means that the allocation better succeed or we've 
depleted all memory reserves and we're deadlocked, it doesn't mean that 
this is a speculative change that may or may not work.

> Does it actually work?  Any real-world testing results?  If so, they'd
> be a useful addition to the changelog.
> 

It certain does, and prevents needlessly killing another task when we know 
current is exiting.  The nice thing about that is that we don't need to do 
anything like checking if a child should be sacrified or if current is 
OOM_DISABLE: we already know it's dying so it should simply get access to 
memory reserves either to return and handle its pending SIGKILL or 
continue down the exitpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 20:23   ` Andrew Morton
@ 2010-06-09  0:25     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> > 
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> > 
> > Killing tasks outside of current's cpuset rarely would free memory for
> > current anyway.  To use a sane heuristic, we must ensure that killing a
> > task would likely free memory for current and avoid needlessly killing
> > others at all costs just because their potential memory freeing is
> > unknown.  It is better to kill current than another task needlessly.
> 
> This is all a bit arbitrary, isn't it?  The key word here is "rarely". 

"rarely" certainly is an arbitrary term in this case because it depends 
heavily on the memory usage of other cpuset's on the system.  Consider a 
cpuset with 16G of memory and a single task which consumes most of that 
memory.  Then consider a cpuset with a single 1G node and a task that ooms 
within it; the 16G task in the other cpuset gets killed.

There must either be a complete exclusion or inclusion of a task for 
candidacy if the scale of memory usage amongst our cpusets cannot be 
properly attributed with a single heuristic (such as divide by 4, divide 
by 8, etc).  To me, it never seems approprate to penalize another cpuset's 
tasks by the small chance that it may have allocated atomic memory 
elsewhere or the nodes have been recently changed.  The goal is to be more 
predictable about oom killing decisions without negatively impacting other 
cpusets, and this is a step in that direction.

> If indeed this task had allocated gobs of memory from `current's nodes
> and then sneakily switched nodes, this will be a big regression!
> 

It could be, but that's the fault of userspace for allocating a node that 
is almost full to a new cpuset and expecting it to be completely free.  In 
other words, we can arrange our cpusets with mems however we want but 
we need some guarantee that giving a cpuset completely free memory and 
then killing a task within it because another cpuset went oom doesn't 
happen.

> So..  It's not completely clear to me how we justify this decision. 
> Are we erring too far on the side of keep-tasks-running?  Is failing to
> clear the oom a lot bigger problem than killing an innocent task?  I
> think so.  In which case we should err towards slaughtering the
> innocent?
> 

The one thing we know is that if the victim's mems_allowed is truly 
disjoint from current that there's no guarantee we'll be freeing memory at 
all.  And if we free any, it's the result of the GFP_ATOMIC allocations 
that are allowed anywhere or was previously allocated on one of current's 
mems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 08/18] oom: sacrifice child with highest badness score for parent
  2010-06-08 20:33   ` Andrew Morton
@ 2010-06-09  0:30     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -441,8 +441,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  			    unsigned long points, struct mem_cgroup *mem,
> >  			    const char *message)
> >  {
> > +	struct task_struct *victim = p;
> >  	struct task_struct *c;
> >  	struct task_struct *t = p;
> > +	unsigned long victim_points = 0;
> > +	struct timespec uptime;
> >  
> >  	if (printk_ratelimit())
> >  		dump_header(p, gfp_mask, order, mem);
> > @@ -456,22 +459,30 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  		return 0;
> >  	}
> >  
> > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > -					message, task_pid_nr(p), p->comm, points);
> > +	pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> > +		message, task_pid_nr(p), p->comm, points);
> 
> fyi, access to another task's ->comm is racy against prctl().  Fixable
> with get_task_comm().  But that takes task_lock(), which is risky in
> this code.  The world wouldn't end if we didn't fix this ;)
> 

I'll look into doing that, thanks!

> > -	/* Try to kill a child first */
> > +	/* Try to sacrifice the worst child first */
> > +	do_posix_clock_monotonic_gettime(&uptime);
> >  	do {
> > +		unsigned long cpoints;
> 
> This could be local to the list_for_each_entry() block.
> 

Ok.

> What does "cpoints" mean?
> 

child points :)  I'll send an incremental patch.

> >  		list_for_each_entry(c, &t->children, sibling) {
> 
> I'm surprised we don't have a sched.h helper for this.  Maybe it's not
> a very common thing to do.
> 
> >  			if (c->mm == p->mm)
> >  				continue;
> >  			if (mem && !task_in_mem_cgroup(c, mem))
> >  				continue;
> > -			if (!oom_kill_task(c))
> > -				return 0;
> > +
> > +			/* badness() returns 0 if the thread is unkillable */
> > +			cpoints = badness(c, uptime.tv_sec);
> > +			if (cpoints > victim_points) {
> > +				victim = c;
> > +				victim_points = cpoints;
> > +			}
> >  		}
> >  	} while_each_thread(p, t);
> >  
> > -	return oom_kill_task(p);
> > +	return oom_kill_task(victim);
> >  }
> 
> And this function is secretly called under tasklist_lock, which is what
> pins *victim, yes?
> 

All of the out_of_memory() helper functions are called under 
tasklist_lock, which is what makes all these iterations safe.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-08 23:43   ` Andrew Morton
@ 2010-06-09  0:40     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm, Andrea Arcangeli

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > The oom killer presently kills current whenever there is no more memory
> > free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> > current is a memory-hogging task or that killing it will free any
> > substantial amount of memory, however.
> 
> Well OK.  But we don't necesarily *want* to "free a substantial amount
> of memory".  We want to resolve the oom within `current'.  That's the
> sole responsibility of the oom-killer.  It doesn't have to free up
> large amounts of additional memory in the expectation that sometime in
> the future some other task will get an oom as well.  if the oom-killer
> is working well, we can defer those actions until the problem actually
> occurs.
> 

The oom killer has always attempted to kill a task that frees a large 
amount of memory: look at goal #2 in today's badness() heuristic (we 
recover a large amount of memory).  By doing this, we avoid endless loops 
where anything we fork or our bash shell is constantly being oom killed or 
a large number of tasks that only free minimal amounts of memory get 
killed.  The current behavior of killing current rarely works as a single 
remedy without being followed up by additional kills or user intervention.

> Plus: if `current' isn't using much memory then it's probably a
> short-lived or not-very-important process anyway.
> 

That potentially prevents anything bound to that mempolicy from ever 
getting forked.

> > In such situations, it is better to scan the tasklist for nodes that are
> > allowed to allocate on current's set of nodes and kill the task with the
> > highest badness() score.  This ensures that the most memory-hogging task,
> > or the one configured by the user with /proc/pid/oom_adj, is always
> > selected in such scenarios.
> 
> Well... *why* is it better?  Needs more justification/explanation IMO.
> 

This unifies mempolicy oom conditions with the same behavior of cpuset or 
memcg oom conditions: we want to utilize the badness() heuristic to kill 
the best candidate task and not nuke tons of processes for little benefit 
or, for instance, kill all other tasks sharing those same mempolicy nodes 
at the benefit of a memory hogger.  Userspace has the ability to influence 
this heuristic (and even more powerfully with my heuristic rewrite coming 
later in this series) so it can better tune how the kernel reacts to 
mempolicy ooms, which is a key objective of this work.  Simply killing 
current leaves no userspace intervention and can kill meaningful (and 
innocent) tasks which loses work for no reason.

> A long time ago Andrea changed the oom-killer so that it basically
> always killed `current', iirc.  I think that shipped in the Suse
> kernel.

You can do that for the entire oom killer by enabling 
/proc/sys/vm/oom_kill_allocating_task.  SGI wanted that to avoid these 
lengthy tasklist scans.

> Maybe it was only in the case where `current' got an oom when
> satisfying a pagefault, I forget the details.  But according to Andrea,
> this design provided a simple and practical solution to ooms.
> 

Right, VM_FAULT_OOM always killed current and that was recently changed to 
invoke the pagefault oom handler.  Nick has now converted the remaining 
architectures which were not using it to do so, so there is actually no 
difference for pagefaults anymore.  In an earlier revision of this 
rewrite, I wanted pagefault ooms to try killing current first if it were 
killable and then backup to the tasklist scan and heuristic use, but that 
was argued against for not conforming to other memory allocation failures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 09/18] oom: select task from tasklist for mempolicy ooms
  2010-06-08 21:08   ` Andrew Morton
  2010-06-08 21:17     ` Oleg Nesterov
@ 2010-06-09  0:46     ` David Rientjes
  1 sibling, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > The oom killer presently kills current whenever there is no more memory
> > free or reclaimable on its mempolicy's nodes.  There is no guarantee that
> > current is a memory-hogging task or that killing it will free any
> > substantial amount of memory, however.
> > 
> > In such situations, it is better to scan the tasklist for nodes that are
> > allowed to allocate on current's set of nodes and kill the task with the
> > highest badness() score.  This ensures that the most memory-hogging task,
> > or the one configured by the user with /proc/pid/oom_adj, is always
> > selected in such scenarios.
> > 
> >
> > ...
> >
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/module.h>
> >  #include <linux/notifier.h>
> >  #include <linux/memcontrol.h>
> > +#include <linux/mempolicy.h>
> >  #include <linux/security.h>
> >  
> >  int sysctl_panic_on_oom;
> > @@ -36,20 +37,36 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> >  /* #define DEBUG */
> >  
> >  /*
> > - * Is all threads of the target process nodes overlap ours?
> > + * Do all threads of the target process overlap our allowed nodes?
> > + * @tsk: task struct of which task to consider
> > + * @mask: nodemask passed to page allocator for mempolicy ooms
> 
> The comment uses kerneldoc annotation but isn't a kerneldoc comment.
> 

I'll fix it.

> >   */
> > -static int has_intersects_mems_allowed(struct task_struct *tsk)
> > +static bool has_intersects_mems_allowed(struct task_struct *tsk,
> > +					const nodemask_t *mask)
> >  {
> > -	struct task_struct *t;
> > +	struct task_struct *start = tsk;
> >  
> > -	t = tsk;
> >  	do {
> > -		if (cpuset_mems_allowed_intersects(current, t))
> > -			return 1;
> > -		t = next_thread(t);
> > -	} while (t != tsk);
> > -
> > -	return 0;
> > +		if (mask) {
> > +			/*
> > +			 * If this is a mempolicy constrained oom, tsk's
> > +			 * cpuset is irrelevant.  Only return true if its
> > +			 * mempolicy intersects current, otherwise it may be
> > +			 * needlessly killed.
> > +			 */
> > +			if (mempolicy_nodemask_intersects(tsk, mask))
> > +				return true;
> 
> The comment refers to `current' but the code does not?
> 

mempolicy_nodemask_intersects() compares tsk's mempolicy to current's, we 
don't need to pass current into the function (and we optimize for that 
since we don't need to do task_lock(current): nothing else can change its 
mempolicy).

> > +		} else {
> > +			/*
> > +			 * This is not a mempolicy constrained oom, so only
> > +			 * check the mems of tsk's cpuset.
> > +			 */
> 
> The comment doesn't refer to `current', but the code does.  Confused.
> 

This simply compares the cpuset mems_allowed of both tasks passed into the 
function.

> > +			if (cpuset_mems_allowed_intersects(current, tsk))
> > +				return true;
> > +		}
> > +		tsk = next_thread(tsk);
> 
> hm, next_thread() uses list_entry_rcu().  What are the locking rules
> here?  It's one of both of rcu_read_lock() and read_lock(&tasklist_lock),
> I think?
> 

Oleg addressed this in his response.

> > +	} while (tsk != start);
> > +	return false;
> >  }
> 
> This is all bloat and overhead for non-NUMA builds.  I doubt if gcc is
> able to eliminate the task_struct walk (although I didn't check).
> 
> The function isn't oom-killer-specific at all - give it a better name
> then move it to mempolicy.c or similar?  If so, the text "oom"
> shouldn't appear in the comments.
> 

It's the only place where we want to filter tasks based on whether they 
share mempolicy nodes or cpuset mems, though, so I think it's 
appropriately placed in mm/oom_kill.c.  I agree that we can add a
#ifndef CONFIG_NUMA variant and I'll do so, thanks.

> >
> > ...
> >
> > @@ -676,24 +699,19 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  	 */
> >  	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> >  	read_lock(&tasklist_lock);
> > -
> > -	switch (constraint) {
> > -	case CONSTRAINT_MEMORY_POLICY:
> > -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> > -				"No available memory (MPOL_BIND)");
> > -		break;
> > -
> > -	case CONSTRAINT_NONE:
> > -		if (sysctl_panic_on_oom) {
> > +	if (unlikely(sysctl_panic_on_oom)) {
> > +		/*
> > +		 * panic_on_oom only affects CONSTRAINT_NONE, the kernel
> > +		 * should not panic for cpuset or mempolicy induced memory
> > +		 * failures.
> > +		 */
> 
> This wasn't changelogged?
> 

It's not a functional change, sysctl_panic_on_oom == 2 is already handled 
earlier in the function.  This was intended to elaborate on why we're only 
concerned about CONSTRAINT_NONE here since the switch statement was 
removed.

> > +		if (constraint == CONSTRAINT_NONE) {
> >  			dump_header(NULL, gfp_mask, order, NULL);
> > -			panic("out of memory. panic_on_oom is selected\n");
> > +			read_unlock(&tasklist_lock);
> > +			panic("Out of memory: panic_on_oom is enabled\n");
> >  		}
> > -		/* Fall-through */
> > -	case CONSTRAINT_CPUSET:
> > -		__out_of_memory(gfp_mask, order);
> > -		break;
> >  	}
> > -
> > +	__out_of_memory(gfp_mask, order, constraint, nodemask);
> >  	read_unlock(&tasklist_lock);
> >  
> >  	/*
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 10/18] oom: enable oom tasklist dump by default
  2010-06-08 21:13   ` Andrew Morton
@ 2010-06-09  0:52     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-09  0:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
> > very helpful information in diagnosing why a user's task has been killed.
> > It emits useful information such as each eligible thread's memory usage
> > that can determine why the system is oom, so it should be enabled by
> > default.
> 
> Unclear.  On a large system the poor thing will now spend half an hour
> squirting junk out the diagnostic port.  Probably interspersed with the
> occasional whine from the softlockup detector.  And for many
> applications, spending a long time stuck in the kernel printing
> diagnostics is equivalent to an outage.
> 
> I guess people can turn it off again if this happens, but they'll get
> justifiably grumpy at us.  I wonder if this change is too
> developer-friendly and insufficiently operator-friendly.
> 

This is one of the main reasons why I wanted to unify both 
oom_kill_allocating_task and oom_dump_tasks into a single sysctl: 
oom_kill_quick, but that was nacked.  Both of the former sysctls have the 
same audience: those that want to avoid lengthy tasklist scans, namely 
companies like SGI, by enabling the first and disabling the second.  If we 
were to extend the oom killer in the future and need to add special 
handling for these customers, it would have been easy with the unified 
sysctl, but I'm not going to wage that war again.

I think this is more helpful than harmful, however, solely because it 
gives users a better indication of what caused their system to be oom in 
the first place and can be disabled at runtime.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-08 20:26   ` Oleg Nesterov
@ 2010-06-09  6:32     ` David Rientjes
  2010-06-09 16:25       ` Oleg Nesterov
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-09  6:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Oleg Nesterov wrote:

> > It's unnecessary to SIGKILL a task that is already PF_EXITING
> 
> This probably needs some explanation. PF_EXITING doesn't necessarily
> mean this process is exiting.
> 

I hope that my sentence didn't imply that it was, the point is that 
sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit, 
it's already along the right path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-09  6:32     ` David Rientjes
@ 2010-06-09 16:25       ` Oleg Nesterov
  2010-06-09 19:44         ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-09 16:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On 06/08, David Rientjes wrote:
>
> On Tue, 8 Jun 2010, Oleg Nesterov wrote:
>
> > > It's unnecessary to SIGKILL a task that is already PF_EXITING
> >
> > This probably needs some explanation. PF_EXITING doesn't necessarily
> > mean this process is exiting.
>
> I hope that my sentence didn't imply that it was, the point is that
> sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit,
> it's already along the right path.

Well, probably this is right...

David, currently I do not know how the code looks with all patches
applied, could you please confirm there is no problem here? I am
looking at Linus's tree,

	mem_cgroup_out_of_memory:

		 p = select_bad_process();
		 oom_kill_process(p);

Now, again, select_bad_process() can return the dead group-leader
of the memory-hog-thread-group.

In that case set_tsk_thread_flag(TIF_MEMDIE) buys nothing, this
thread has aleady exited, but we do want to kill this process.

If this is not true due to other changes - great.

Otherwise, perhaps this needs

	- if (PF_EXITING)
	+ if (PF_EXITING && mm)

too?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-09 16:25       ` Oleg Nesterov
@ 2010-06-09 19:44         ` David Rientjes
  2010-06-09 20:14           ` Oleg Nesterov
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-09 19:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Wed, 9 Jun 2010, Oleg Nesterov wrote:

> > I hope that my sentence didn't imply that it was, the point is that
> > sending a SIGKILL to a PF_EXITING task isn't necessary to make it exit,
> > it's already along the right path.
> 
> Well, probably this is right...
> 
> David, currently I do not know how the code looks with all patches
> applied, could you please confirm there is no problem here? I am
> looking at Linus's tree,
> 
> 	mem_cgroup_out_of_memory:
> 
> 		 p = select_bad_process();
> 		 oom_kill_process(p);
> 

mem_cgroup_out_of_memory() does this under tasklist_lock:

retry:
	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
	if (!p || PTR_ERR(p) == -1UL)
		goto out;

	if (oom_kill_process(p, gfp_mask, 0, points, mem,
				"Memory cgroup out of memory"))
		goto retry;
out:
	...

> Now, again, select_bad_process() can return the dead group-leader
> of the memory-hog-thread-group.
> 

select_bad_process() already has:

	if ((p->flags & PF_EXITING) && p->mm) {
		if (p != current)
			return ERR_PTR(-1UL);

		chosen = p;
		*ppoints = ULONG_MAX;
	}

so we can disregard the check for p == current in this case since it would 
not be allocating memory without p->mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-09 19:44         ` David Rientjes
@ 2010-06-09 20:14           ` Oleg Nesterov
  2010-06-10  0:15             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-09 20:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On 06/09, David Rientjes wrote:
>
> On Wed, 9 Jun 2010, Oleg Nesterov wrote:
>
> > David, currently I do not know how the code looks with all patches
> > applied, could you please confirm there is no problem here? I am
> > looking at Linus's tree,
> >
> > 	mem_cgroup_out_of_memory:
> >
> > 		 p = select_bad_process();
> > 		 oom_kill_process(p);
> >
>
> mem_cgroup_out_of_memory() does this under tasklist_lock:
>
> retry:
> 	p = select_bad_process(&points, mem, CONSTRAINT_MEMCG, NULL);
> 	if (!p || PTR_ERR(p) == -1UL)
> 		goto out;
>
> 	if (oom_kill_process(p, gfp_mask, 0, points, mem,
> 				"Memory cgroup out of memory"))
> 		goto retry;
> out:
> 	...
>
> > Now, again, select_bad_process() can return the dead group-leader
> > of the memory-hog-thread-group.
> >
>
> select_bad_process() already has:
>
> 	if ((p->flags & PF_EXITING) && p->mm) {
> 		if (p != current)
> 			return ERR_PTR(-1UL);
>
> 		chosen = p;
> 		*ppoints = ULONG_MAX;
> 	}
>
> so we can disregard the check for p == current

Not sure I understand... We can just ignore this check, in this case
p->mm == NULL.

> in this case since it would
> not be allocating memory without p->mm.

This thread will not allocate the memory, yes. But its sub-threads can.
And select_bad_process() can constantly return the same (dead) thread P,
badness() inspects ->mm under find_lock_task_mm() which finds the thread
with the valid ->mm.

OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task)
was fixed or not, but currently it also looks at task->mm and thus have
the same boring problem: it is trivial to make the memory-hog process
invisible to oom. Unless I missed something, of course.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-09 20:14           ` Oleg Nesterov
@ 2010-06-10  0:15             ` KAMEZAWA Hiroyuki
  2010-06-10  1:21               ` Oleg Nesterov
  0 siblings, 1 reply; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-10  0:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Balbir Singh, KOSAKI Motohiro, linux-mm

On Wed, 9 Jun 2010 22:14:30 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> > in this case since it would
> > not be allocating memory without p->mm.
> 
> This thread will not allocate the memory, yes. But its sub-threads can.
> And select_bad_process() can constantly return the same (dead) thread P,
> badness() inspects ->mm under find_lock_task_mm() which finds the thread
> with the valid ->mm.
> 
> OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task)
> was fixed or not, but currently it also looks at task->mm and thus have
> the same boring problem: it is trivial to make the memory-hog process
> invisible to oom. Unless I missed something, of course.
> 

HmHm...your concern is that there is a case when mem_cgroup_out_of_memory()
can't kill anything ? Now, memcg doesn't return -ENOMEM in usual case.
So, it loops until there are some available memory under its limit.
Then, if memory_cgroup_out_of_memory() can kill a process in several trial,
we'll not have terrible problem. (even if it's slow.)

Hmm. What I can't understand is whether there is a case when PF_EXITING
thread never exit. If so, we need some care (in memcg?)

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-10  0:15             ` KAMEZAWA Hiroyuki
@ 2010-06-10  1:21               ` Oleg Nesterov
  2010-06-10  1:43                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-10  1:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Balbir Singh, KOSAKI Motohiro, linux-mm

On 06/10, KAMEZAWA Hiroyuki wrote:
>
> On Wed, 9 Jun 2010 22:14:30 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > > in this case since it would
> > > not be allocating memory without p->mm.
> >
> > This thread will not allocate the memory, yes. But its sub-threads can.
> > And select_bad_process() can constantly return the same (dead) thread P,
> > badness() inspects ->mm under find_lock_task_mm() which finds the thread
> > with the valid ->mm.
> >
> > OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task)
> > was fixed or not, but currently it also looks at task->mm and thus have
> > the same boring problem: it is trivial to make the memory-hog process
> > invisible to oom. Unless I missed something, of course.
>
> HmHm...your concern is that there is a case when mem_cgroup_out_of_memory()
> can't kill anything ?

Or it can kill the wrong task. But once again, I am only speculating
looking at the current code.

> Now, memcg doesn't return -ENOMEM in usual case.
> So, it loops until there are some available memory under its limit.
> Then, if memory_cgroup_out_of_memory() can kill a process in several trial,
> we'll not have terrible problem. (even if it's slow.)
>
> Hmm. What I can't understand is whether there is a case when PF_EXITING
> thread never exit. If so, we need some care (in memcg?)

	void *thread_func(void *)
	{
		for (;;)
			malloc();
	}

	int main(void)
	{
		pthread_create(..., thread_func, ...);
		pthread_exit();
	}

This process runs with the dead group-leader (PF_EXITING is set, ->mm == NULL).
mem_cgroup_out_of_memory()->select_bad_process() can't see it due to
task_in_mem_cgroup() check.

Afaics

	- task_in_mem_cgroup() should use find_lock_task_mm() too

	- oom_kill_process() should check "PF_EXITING && p->mm",
	  like select_bad_process() does.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-10  1:21               ` Oleg Nesterov
@ 2010-06-10  1:43                 ` KAMEZAWA Hiroyuki
  2010-06-10  1:51                   ` Oleg Nesterov
  0 siblings, 1 reply; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-10  1:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Balbir Singh, KOSAKI Motohiro, linux-mm

On Thu, 10 Jun 2010 03:21:01 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 06/10, KAMEZAWA Hiroyuki wrote:
> >
> > On Wed, 9 Jun 2010 22:14:30 +0200
> > Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > > in this case since it would
> > > > not be allocating memory without p->mm.
> > >
> > > This thread will not allocate the memory, yes. But its sub-threads can.
> > > And select_bad_process() can constantly return the same (dead) thread P,
> > > badness() inspects ->mm under find_lock_task_mm() which finds the thread
> > > with the valid ->mm.
> > >
> > > OK. Probably this doesn't matter. I don't know if task_in_mem_cgroup(task)
> > > was fixed or not, but currently it also looks at task->mm and thus have
> > > the same boring problem: it is trivial to make the memory-hog process
> > > invisible to oom. Unless I missed something, of course.
> >
> > HmHm...your concern is that there is a case when mem_cgroup_out_of_memory()
> > can't kill anything ?
> 
> Or it can kill the wrong task. But once again, I am only speculating
> looking at the current code.
> 
> > Now, memcg doesn't return -ENOMEM in usual case.
> > So, it loops until there are some available memory under its limit.
> > Then, if memory_cgroup_out_of_memory() can kill a process in several trial,
> > we'll not have terrible problem. (even if it's slow.)
> >
> > Hmm. What I can't understand is whether there is a case when PF_EXITING
> > thread never exit. If so, we need some care (in memcg?)
> 
> 	void *thread_func(void *)
> 	{
> 		for (;;)
> 			malloc();
> 	}
> 
> 	int main(void)
> 	{
> 		pthread_create(..., thread_func, ...);
> 		pthread_exit();
> 	}
> 
> This process runs with the dead group-leader (PF_EXITING is set, ->mm == NULL).
> mem_cgroup_out_of_memory()->select_bad_process() can't see it due to
> task_in_mem_cgroup() check.
> 
> Afaics
> 
> 	- task_in_mem_cgroup() should use find_lock_task_mm() too
> 
> 	- oom_kill_process() should check "PF_EXITING && p->mm",
> 	  like select_bad_process() does.
> 

Hm. I'd like to look into that when the next mmotm is shipped.
(too many pactches in flight..)

The problem is
  
  for (walking each 'process') 
	if (task_in_mem_cgroup(p, memcg))

 can't check 'p' containes threads belongs to given memcg because p->mm can
 be NULL. So, task_in_mem_cgroup should call find_lock_task_mm() when
 getting "mm" struct.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 06/18] oom: avoid sending exiting tasks a SIGKILL
  2010-06-10  1:43                 ` KAMEZAWA Hiroyuki
@ 2010-06-10  1:51                   ` Oleg Nesterov
  0 siblings, 0 replies; 104+ messages in thread
From: Oleg Nesterov @ 2010-06-10  1:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Nick Piggin,
	Balbir Singh, KOSAKI Motohiro, linux-mm

On 06/10, KAMEZAWA Hiroyuki wrote:
>
> > Afaics
> >
> > 	- task_in_mem_cgroup() should use find_lock_task_mm() too
> >
> > 	- oom_kill_process() should check "PF_EXITING && p->mm",
> > 	  like select_bad_process() does.
> >
>
> Hm. I'd like to look into that when the next mmotm is shipped.
> (too many pactches in flight..)

Me too ;)

> The problem is
>
>   for (walking each 'process')
> 	if (task_in_mem_cgroup(p, memcg))
>
>  can't check 'p' containes threads belongs to given memcg because p->mm can
>  be NULL. So, task_in_mem_cgroup should call find_lock_task_mm() when
>  getting "mm" struct.

Yes, this is what I meant. And after we do this change we should tweak
oom_kill_process() too, otherwise we have another problem.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-08 23:02     ` Andrew Morton
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  2010-06-17  5:14       ` David Rientjes
  1 sibling, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> > >   *  Copyright (C)  1998,2000  Rik van Riel
> > >   *	Thanks go out to Claus Fischer for some serious inspiration and
> > >   *	for goading me into coding this file...
> > > + *  Copyright (C)  2010  Google, Inc.
> > > + *	Rewritten by David Rientjes
> > 
> > don't put it.
> > 
> 
> Seems OK to me.  It's a fairly substantial change and people have added
> their (c) in the past for smaller kernel changes.  I guess one could even
> do this for a one-liner.

If you are OK, I have no objection. I'm not lawyer.
But, at least in japan, usually include co-developers to author notice.
(of cource, it's not me...)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-08 19:27       ` Andrew Morton
@ 2010-06-13 11:24         ` KOSAKI Motohiro
  2010-07-02 22:35           ` Andrew Morton
  0 siblings, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

Sorry for the delay.

> On Tue, 8 Jun 2010 11:51:32 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Andrew, are you the maintainer for these fixes or is KOSAKI?
> 
> I am, thanks.  Kosaki-san, you're making this harder than it should be.
> Please either ack David's patches or promptly work with him on
> finalising them.

Thanks, Andrew, David. I agree with you. I don't find any end users harm
and regressions in latest David's patch series. So, I'm glad to join his work.

Unfortunatelly, I don't have enough time now. then, I expect my next review
is not quite soon. but I'll promise I'll do.

thanks.


> 
> I realise that you have additional oom-killer patches but it's too
> complex to try to work on two patch series concurrently.  So let's
> concentrate on get David's work sorted out and merged and then please
> rebase yours on the result.
> 
> I certainly don't have the time or inclination to go through two
> patchsets and work out what the similarities and differences are so
> I'll be concentrating on David's ones first.  The order in which we
> do this doesn't really matter.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-08 23:18     ` Andrew Morton
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  2010-06-17  3:36         ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> On Tue,  8 Jun 2010 20:42:02 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > +	/*
> > > +	 * Warn that /proc/pid/oom_adj is deprecated, see
> > > +	 * Documentation/feature-removal-schedule.txt.
> > > +	 */
> > > +	printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
> > > +			"please use /proc/%d/oom_score_adj instead.\n",
> > > +			current->comm, task_pid_nr(current),
> > > +			task_pid_nr(task), task_pid_nr(task));
> > >  	task->signal->oom_adj = oom_adjust;
> > 
> > Sorry, we can't accept this. oom_adj is one of most freqently used
> > tuning knob. putting this one makes a lot of confusion.
> > 
> > In addition, this knob is used from some applications (please google
> > by google code search or something else). that said, an enduser can't
> > stop the warning. that makes a lot of frustration. NO.
> > 
> 
> I think it's OK.  We made a mistake in adding oom_adj in the first
> place and now we get to live with the consequences.
> 
> We'll be stuck with oom_adj for the next 200 years if we don't tell
> people to stop using it, and a printk_once() is a good way of doing
> that.
> 
> It could be that in two years time we decide that we can't remove oom_adj
> yet because too many people are still using it.  Maybe it will take ten
> years - but unless we add the above printk, oom_adj will remain
> forever.

But oom_score_adj have no benefit form end-uses view. That's problem.
Please consider to make end-user friendly good patch at first.

I mean, I'm not against better knob deprecate old one. but I require
'better' mean end-users better.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-08 20:12     ` Andrew Morton
@ 2010-06-13 11:24       ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-13 11:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> On Tue,  8 Jun 2010 20:41:57 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > +
> > >  	if (sysctl_panic_on_oom == 2) {
> > >  		dump_header(NULL, gfp_mask, order, NULL);
> > >  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> > 
> > Sorry, I had found this patch works incorrect. I don't pulled.
> 
> Saying "it doesn't work and I'm not telling you why" is unhelpful.  In
> fact it's the opposite of helpful because it blocks merging of the fix
> and doesn't give us any way to move forward.
> 
> So what can I do?  Hard.
> 
> What I shall do is to merge the patch in the hope that someone else will
> discover the undescribed problem and we will fix it then.  That's very
> inefficient.

Please see 5 minute before positng e-mail. thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 05/18] oom: give current access to memory reserves if it has been killed
  2010-06-08 18:47     ` David Rientjes
@ 2010-06-14 11:08       ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-14 11:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> > > +	/*
> > > +	 * If current has a pending SIGKILL, then automatically select it.  The
> > > +	 * goal is to allow it to allocate so that it may quickly exit and free
> > > +	 * its memory.
> > > +	 */
> > > +	if (fatal_signal_pending(current)) {
> > > +		set_thread_flag(TIF_MEMDIE);
> > > +		return;
> > > +	}
> > > +
> > >  	if (sysctl_panic_on_oom == 2) {
> > >  		dump_header(NULL, gfp_mask, order, NULL);
> > >  		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> > 
> > Sorry, I had found this patch works incorrect. I don't pulled.
> > 
> 
> You're taking back your ack?
> 
> Why does this not work?  It's not killing a potentially immune task, the 
> task is already dying.  We're simply giving it access to memory reserves 
> so that it may quickly exit and die.  OOM_DISABLE does not imply that a 
> task cannot exit on its own or be killed by another application or user, 
> we simply don't want to needlessly kill another task when current is dying 
> in the first place without being able to allocate memory.
> 
> Please reconsider your thought.

Oh, I didn't talk about OOM_DISABLE. probably my explanation was too
poor.

My point is, the above code assume SIGKILL is good sign of the task is
going exit soon. but It is not always true. Only if the task is regular
userland process, it's true. kernel module author freely makes very strange
kernel thread.

note: Linux is one of most popular generic purpose OS in the world and
we have million out of funny drivers.

Plus, If false positive occur, setting TIF_MEMDIE is very dangerous because
if there is TIF_MEMDIE task, our kernl don't send next OOM-Kill. It mean
the systam can reach dead lock. In the other hand, false negative is relatively
safe. It cause one innocent task kill. but the system doesn't cause lockup.

Then, we have strongly motivation to avoid false positive. I hope you add 
some conservative check.

I don't disagree your patch concept. I only worry about the dangerousness.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-13 11:24       ` KOSAKI Motohiro
@ 2010-06-17  3:36         ` David Rientjes
  2010-06-21 11:45           ` KOSAKI Motohiro
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-17  3:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Sun, 13 Jun 2010, KOSAKI Motohiro wrote:

> But oom_score_adj have no benefit form end-uses view. That's problem.
> Please consider to make end-user friendly good patch at first.
> 

Of course it does, it actually has units whereas oom_adj only grows or 
shrinks the badness score exponentially.  oom_score_adj's units are well 
understood: on a machine with 4G of memory, 250 means we're trying to 
prejudice it by 1G of memory so that can be used by other tasks, -250 
means other tasks should be prejudiced by 1G in comparison to this task, 
etc.  It's actually quite powerful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-08 11:41   ` KOSAKI Motohiro
  2010-06-08 23:02     ` Andrew Morton
@ 2010-06-17  5:12     ` David Rientjes
  2010-06-21 11:45       ` KOSAKI Motohiro
  1 sibling, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-17  5:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, KOSAKI Motohiro wrote:

> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -63,6 +63,7 @@
> >  #include <linux/namei.h>
> >  #include <linux/mnt_namespace.h>
> >  #include <linux/mm.h>
> > +#include <linux/swap.h>
> >  #include <linux/rcupdate.h>
> >  #include <linux/kallsyms.h>
> >  #include <linux/stacktrace.h>
> > @@ -428,16 +429,18 @@ static const struct file_operations proc_lstats_operations = {
> >  #endif
> >  
> >  /* The badness from the OOM killer */
> > -unsigned long badness(struct task_struct *p, unsigned long uptime);
> >  static int proc_oom_score(struct task_struct *task, char *buffer)
> >  {
> >  	unsigned long points = 0;
> > -	struct timespec uptime;
> >  
> > -	do_posix_clock_monotonic_gettime(&uptime);
> >  	read_lock(&tasklist_lock);
> >  	if (pid_alive(task))
> > -		points = badness(task, uptime.tv_sec);
> > +		points = oom_badness(task->group_leader,
> > +					global_page_state(NR_INACTIVE_ANON) +
> > +					global_page_state(NR_ACTIVE_ANON) +
> > +					global_page_state(NR_INACTIVE_FILE) +
> > +					global_page_state(NR_ACTIVE_FILE) +
> > +					total_swap_pages);
> 
> Sorry I can't ack this. again and again, I try to explain why this is wrong
> (hopefully last)
> 
> 1) incompatibility
>    oom_score is one of ABI. then, we can't change this. from enduser view,
>    this change is no merit. In general, an incompatibility is allowed on very
>    limited situation such as that an end-user get much benefit than compatibility.
>    In other word, old style ABI doesn't works fine from end user view.
>    But, in this case, it isn't.
> 

There is no incompatibility here, /proc/pid/oom_score has no meaningful 
units because of the old heuristic.  The _only_ thing it represents is a 
score in comparison with other eligible tasks to decide which task to 
kill.  Thus, oom_score by itself means nothing if not compared to other 
eligible tasks.

Although deprecated, /proc/pid/oom_adj still changes 
/proc/pid/oom_score_adj with a different scale (-17 maps to -1000 and +15 
maps to +1000), so there is absolutely no userspace imcompatibility with 
this change.

> 2) technically incorrect
>    this math is not correct math. this is not represented "allowed memory".
>    example, 1) this is not accumulated mlocked memory, but it can be freed
>    task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not.
>    if not, task killing doesn't free SYSV IPC memory.

Ah, very good point.  We should be using totalram_pages + total_swap_pages 
here to represent global normalization, memcg limit for CONSTRAINT_MEMCG, 
and a total of node_spanned_pages for mempolicy nodes or cpuset mems for 
CONSTAINT_MEMORY_POLICY and CONSTRAINT_CPUSET, respectively.  I'll make 
that switch in the next revision, thanks!

>    In additon, 3) This normalization doesn't works on asymmetric numa. 
>    total pages and oom are not related almostly.

What this does is represents the heuristic baseline, rss and swap, as a 
proportion depending on the type of oom constraint.  This works when 
comparing eligible tasks amongst each other because the the task with the 
highest rss and swap is the one we (normally) want to kill, minus the 3% 
privilege given to root and outside influence of /proc/pid/oom_score_adj.

We want to represent this as a proportion and not as a shear value simply 
because the task may be attached to a cpuset, a memcg, or bound to a 
mempolicy out from under the task's knowledge.  That is, we compare tasks 
sharing the same constraint for oom kill and normalize the heuristic based 
on that.  We don't want to expose a userspace interface that takes memory 
quantities directly since the task may be bound to a mempolicy, for 
instance, later and the oom_score_adj is then rendered obsolete.

> 4) scalability. if the 
>    system 10TB memory, 1 point oom score mean 10GB memory consumption.

Well, sure, a 10TB system would have a large granularity such as that :)  
But in such cases we don't necessarily care if one task is using 5GB more 
than another task using 1TB, for example.

> >  	read_unlock(&tasklist_lock);
> >  	return sprintf(buffer, "%lu\n", points);
> >  }
> > @@ -1042,7 +1045,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
> >  	}
> >  
> >  	task->signal->oom_adj = oom_adjust;
> > -
> > +	/*
> > +	 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
> > +	 * value is always attainable.
> > +	 */
> > +	if (task->signal->oom_adj == OOM_ADJUST_MAX)
> > +		task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
> > +	else
> > +		task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
> > +								-OOM_DISABLE;
> >  	unlock_task_sighand(task, &flags);
> >  	put_task_struct(task);
> 
> Generically, I wasn't against the feature for rare use-case. but sorry,
> as far as I investigated, I haven't find any actual user. then, I don't
> put ack, because my reviewing basically stand on 1) how much user use this
> 2) how strongly required this from an users 3) how much side effect is there
> etc etc. not cool or not.

oom_score_adj is much more powerful than oom_adj simply because it (i) is 
in units that are understood, not a bitshift on a widely unpredictable 
heuristic, and (ii) the granularity is _much_ finer than oom_adj.  We have 
many use cases for this internally especially when we bind tasks to 
cpusets or memcg and they change in size.

> > @@ -1055,6 +1066,82 @@ static const struct file_operations proc_oom_adjust_operations = {
> >  	.llseek		= generic_file_llseek,
> >  };
> >  
> > +static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
> > +					size_t count, loff_t *ppos)
> > +{
> > +	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
> > +	char buffer[PROC_NUMBUF];
> > +	int oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	unsigned long flags;
> > +	size_t len;
> > +
> > +	if (!task)
> > +		return -ESRCH;
> > +	if (lock_task_sighand(task, &flags)) {
> > +		oom_score_adj = task->signal->oom_score_adj;
> > +		unlock_task_sighand(task, &flags);
> > +	}
> > +	put_task_struct(task);
> > +	len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
> > +	return simple_read_from_buffer(buf, count, ppos, buffer, len);
> > +}
> > +
> > +static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
> > +					size_t count, loff_t *ppos)
> > +{
> > +	struct task_struct *task;
> > +	char buffer[PROC_NUMBUF];
> > +	unsigned long flags;
> > +	long oom_score_adj;
> > +	int err;
> > +
> > +	memset(buffer, 0, sizeof(buffer));
> > +	if (count > sizeof(buffer) - 1)
> > +		count = sizeof(buffer) - 1;
> > +	if (copy_from_user(buffer, buf, count))
> > +		return -EFAULT;
> > +
> > +	err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
> > +	if (err)
> > +		return -EINVAL;
> > +	if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> > +			oom_score_adj > OOM_SCORE_ADJ_MAX)
> > +		return -EINVAL;
> > +
> > +	task = get_proc_task(file->f_path.dentry->d_inode);
> > +	if (!task)
> > +		return -ESRCH;
> > +	if (!lock_task_sighand(task, &flags)) {
> > +		put_task_struct(task);
> > +		return -ESRCH;
> > +	}
> > +	if (oom_score_adj < task->signal->oom_score_adj &&
> > +			!capable(CAP_SYS_RESOURCE)) {
> > +		unlock_task_sighand(task, &flags);
> > +		put_task_struct(task);
> > +		return -EACCES;
> > +	}
> > +
> > +	task->signal->oom_score_adj = oom_score_adj;
> > +	/*
> > +	 * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
> > +	 * always attainable.
> > +	 */
> > +	if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		task->signal->oom_adj = OOM_DISABLE;
> > +	else
> > +		task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
> > +							OOM_SCORE_ADJ_MAX;
> > +	unlock_task_sighand(task, &flags);
> > +	put_task_struct(task);
> > +	return count;
> > +}
> > +
> > +static const struct file_operations proc_oom_score_adj_operations = {
> > +	.read		= oom_score_adj_read,
> > +	.write		= oom_score_adj_write,
> > +};
> > +
> >  #ifdef CONFIG_AUDITSYSCALL
> >  #define TMPBUFLEN 21
> >  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> > @@ -2627,6 +2714,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  #endif
> >  	INF("oom_score",  S_IRUGO, proc_oom_score),
> >  	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> > +	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
> >  #ifdef CONFIG_AUDITSYSCALL
> >  	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > @@ -2961,6 +3049,7 @@ static const struct pid_entry tid_base_stuff[] = {
> >  #endif
> >  	INF("oom_score", S_IRUGO, proc_oom_score),
> >  	REG("oom_adj",   S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
> > +	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
> >  #ifdef CONFIG_AUDITSYSCALL
> >  	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >  	REG("sessionid",  S_IRUSR, proc_sessionid_operations),
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -4,6 +4,8 @@
> >   *  Copyright (C)  1998,2000  Rik van Riel
> >   *	Thanks go out to Claus Fischer for some serious inspiration and
> >   *	for goading me into coding this file...
> > + *  Copyright (C)  2010  Google, Inc.
> > + *	Rewritten by David Rientjes
> 
> don't put it.
> 
> 
> 
> >   *
> >   *  The routines in this file are used to kill a process when
> >   *  we're seriously out of memory. This gets called from __alloc_pages()
> > @@ -34,7 +36,6 @@ int sysctl_panic_on_oom;
> >  int sysctl_oom_kill_allocating_task;
> >  int sysctl_oom_dump_tasks = 1;
> >  static DEFINE_SPINLOCK(zone_scan_lock);
> > -/* #define DEBUG */
> >  
> >  /*
> >   * Do all threads of the target process overlap our allowed nodes?
> > @@ -84,139 +85,72 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
> >  }
> >  
> >  /**
> > - * badness - calculate a numeric value for how bad this task has been
> > + * oom_badness - heuristic function to determine which candidate task to kill
> >   * @p: task struct of which task we should calculate
> > - * @uptime: current uptime in seconds
> > + * @totalpages: total present RAM allowed for page allocation
> >   *
> > - * The formula used is relatively simple and documented inline in the
> > - * function. The main rationale is that we want to select a good task
> > - * to kill when we run out of memory.
> > - *
> > - * Good in this context means that:
> > - * 1) we lose the minimum amount of work done
> > - * 2) we recover a large amount of memory
> > - * 3) we don't kill anything innocent of eating tons of memory
> > - * 4) we want to kill the minimum amount of processes (one)
> > - * 5) we try to kill the process the user expects us to kill, this
> > - *    algorithm has been meticulously tuned to meet the principle
> > - *    of least surprise ... (be careful when you change it)
> > + * The heuristic for determining which task to kill is made to be as simple and
> > + * predictable as possible.  The goal is to return the highest value for the
> > + * task consuming the most memory to avoid subsequent oom failures.
> >   */
> > -
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
> >  {
> > -	unsigned long points, cpu_time, run_time;
> > -	struct task_struct *child;
> > -	struct task_struct *c, *t;
> > -	int oom_adj = p->signal->oom_adj;
> > -	struct task_cputime task_time;
> > -	unsigned long utime;
> > -	unsigned long stime;
> > -
> > -	if (oom_adj == OOM_DISABLE)
> > -		return 0;
> > +	int points;
> >  
> >  	p = find_lock_task_mm(p);
> >  	if (!p)
> >  		return 0;
> >  
> >  	/*
> > -	 * The memory size of the process is the basis for the badness.
> > -	 */
> > -	points = p->mm->total_vm;
> > -
> > -	/*
> > -	 * After this unlock we can no longer dereference local variable `mm'
> > -	 */
> > -	task_unlock(p);
> > -
> > -	/*
> > -	 * swapoff can easily use up all memory, so kill those first.
> > +	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
> > +	 * need to be executed for something that cannot be killed.
> >  	 */
> > -	if (p->flags & PF_OOM_ORIGIN)
> > -		return ULONG_MAX;
> > -
> > -	/*
> > -	 * Processes which fork a lot of child processes are likely
> > -	 * a good choice. We add half the vmsize of the children if they
> > -	 * have an own mm. This prevents forking servers to flood the
> > -	 * machine with an endless amount of children. In case a single
> > -	 * child is eating the vast majority of memory, adding only half
> > -	 * to the parents will make the child our kill candidate of choice.
> > -	 */
> > -	t = p;
> > -	do {
> > -		list_for_each_entry(c, &t->children, sibling) {
> > -			child = find_lock_task_mm(c);
> > -			if (child) {
> > -				if (child->mm != p->mm)
> > -					points += child->mm->total_vm/2 + 1;
> > -				task_unlock(child);
> > -			}
> > -		}
> > -	} while_each_thread(p, t);
> > +	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +		task_unlock(p);
> > +		return 0;
> > +	}
> >  
> >  	/*
> > -	 * CPU time is in tens of seconds and run time is in thousands
> > -         * of seconds. There is no particular reason for this other than
> > -         * that it turned out to work very well in practice.
> > +	 * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
> > +	 * priority for oom killing.
> >  	 */
> > -	thread_group_cputime(p, &task_time);
> > -	utime = cputime_to_jiffies(task_time.utime);
> > -	stime = cputime_to_jiffies(task_time.stime);
> > -	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
> > -
> > -
> > -	if (uptime >= p->start_time.tv_sec)
> > -		run_time = (uptime - p->start_time.tv_sec) >> 10;
> > -	else
> > -		run_time = 0;
> > -
> > -	if (cpu_time)
> > -		points /= int_sqrt(cpu_time);
> > -	if (run_time)
> > -		points /= int_sqrt(int_sqrt(run_time));
> > +	if (p->flags & PF_OOM_ORIGIN) {
> > +		task_unlock(p);
> > +		return 1000;
> > +	}
> >  
> >  	/*
> > -	 * Niced processes are most likely less important, so double
> > -	 * their badness points.
> > +	 * The memory controller may have a limit of 0 bytes, so avoid a divide
> > +	 * by zero if necessary.
> >  	 */
> > -	if (task_nice(p) > 0)
> > -		points *= 2;
> 
> You removed 
>   - run time check
>   - cpu time check
>   - nice check
> 
> but no described the reason. reviewers are puzzled. How do we review
> this though we don't get your point? please write
> 

The comment for oom_badness() reflects these changes: our goal is to make 
the heuristic as simple and _predictable_ as possible, we can't allow 
runtime and cputime, for example, to avoid freeing more memory by biasing 
against those tasks.  A long cputime does not indicate the importance of a 
task, nor does it avoid subsequent oom kills in the future because we've 
freed less memory by killing other tasks as a result.

>  - What benerit is there?

It's predictable and users understand exactly what the heuristic is.

>  - Why do you think no bad effect?

These heursitics seem to have been misplaced from the beginning and there 
was a _lot_ of desire to remove them dating back a couple years: we simply 
can't convert runtime or nice levels into potential for memory freeing.  
It's much better to have a sane and predictable heuristic that will react 
in similar circumstances to do exactly what the oom killer intends to do: 
oom kill a task that will free a large amount of memory to avoid 
subsequent failures that will result in an even greater amount of work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-08 23:02     ` Andrew Morton
  2010-06-13 11:24       ` KOSAKI Motohiro
@ 2010-06-17  5:14       ` David Rientjes
  2010-06-21 11:45         ` KOSAKI Motohiro
  1 sibling, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-17  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > > +	if (!totalpages)
> > > +		totalpages = 1;
> > >  
> > >  	/*
> > > -	 * Superuser processes are usually more important, so we make it
> > > -	 * less likely that we kill those.
> > > +	 * The baseline for the badness score is the proportion of RAM that each
> > > +	 * task's rss and swap space use.
> > >  	 */
> > > -	if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > > -	    has_capability_noaudit(p, CAP_SYS_RESOURCE))
> > > -		points /= 4;
> > > +	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
> > > +			totalpages;
> > > +	task_unlock(p);
> > >  
> > >  	/*
> > > -	 * We don't want to kill a process with direct hardware access.
> > > -	 * Not only could that mess up the hardware, but usually users
> > > -	 * tend to only have this flag set on applications they think
> > > -	 * of as important.
> > > +	 * Root processes get 3% bonus, just like the __vm_enough_memory()
> > > +	 * implementation used by LSMs.
> > >  	 */
> > > -	if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> > > -		points /= 4;
> > > +	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > > +		points -= 30;
> > 
> > 
> > CAP_SYS_ADMIN seems no good idea. CAP_SYS_ADMIN imply admin's interactive
> > process. but killing interactive process only cause force logout. but
> > killing system daemon can makes more catastrophic disaster.
> > 
> > 
> > Last of all, I'll pulled this one. but only do cherry-pick.
> > 
> 
> This change was unchangelogged, I don't know what it's for and I don't
> understand your comment about it.
> 

It was in the changelog (recall that the badness() function represents a 
proportion of available memory used by a task, so subtracting 30 is the 
equivalent of 3% of available memory):

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-08 22:58   ` Andrew Morton
@ 2010-06-17  5:32     ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-17  5:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Nick Piggin, Oleg Nesterov, Balbir Singh,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > This a complete rewrite of the oom killer's badness() heuristic which is
> > used to determine which task to kill in oom conditions.  The goal is to
> > make it as simple and predictable as possible so the results are better
> > understood and we end up killing the task which will lead to the most
> > memory freeing while still respecting the fine-tuning from userspace.
> 
> It's not obvious from this description that then end result is better! 

I think it's fairly obvious that predictablility is an important part of 
any heuristic that will determine whether your task survives or dies.

> Have you any testcases or scenarios which got improved?
> 

Yes, as cited below in the changelog with the KDE example.

> > Instead of basing the heuristic on mm->total_vm for each task, the task's
> > rss and swap space is used instead.  This is a better indication of the
> > amount of memory that will be freeable if the oom killed task is chosen
> > and subsequently exits.
> 
> Again, why should we optimise for the amount of memory which a killing
> will yield (if that's what you mean).  We only need to free enough
> memory to unblock the oom condition then proceed.
> 

That's what the oom killer has always done simply because we want to avoid 
subsequent oom conditions in the near future that will require additional 
tasks to be killed.  It seems far better to kill a large memory-hogging 
task[*] than ten smaller tasks that total the same amount of memory usage.

 [*] And, with this rewrite, "memory-hogging" can be defined for the first
     time from userspace with a tunable, oom_score_adj, that actually has
     units so that within a cpuset, for example, we can bias a task by
     25% of available memory or bias other tasks against it by 25%.  For
     the first time ever, we can say "this task should be able to use 25%
     more memory than other tasks without getting killed first."

> The last thing we want to do is to kill a process which has consumed
> 1000 CPU hours, or which is providing some system-critical service or
> whatever.  Amount-of-memory-freeable is a relatively minor criterion.
> 

What would you suggest otherwise?  Cputime?  Then we may never be able to 
fork our bash shell or ssh into our machines.

> >  This helps specifically in cases where KDE or
> > GNOME is chosen for oom kill on desktop systems instead of a memory
> > hogging task.
> 
> It helps how?  Examples and test cases?
> 

Because KDE and GNOME typically have very large mm->total_vm values but 
the amount of resident memory in RAM is consumed by other tasks, even 
memory leakers.  mm->total_vm is agreed to be a very poor heursitic 
baseline by just about everyone.

> > The baseline for the heuristic is a proportion of memory that each task is
> > currently using in memory plus swap compared to the amount of "allowable"
> > memory.
> 
> What does "swap" mean?  swapspace includes swap-backed swapcache,
> un-swap-backed swapcache and non-resident swap.  Which of all these is
> being used here and for what reason?
> 

This is swap cache, the number of swap entries for the task which could be 
freeable if the task is killed that could subsequently be used for page 
allocations that triggered the oom killer.  We want to add hints to the 
oom killer so that memory which cannot be used on blockable memory 
allocations may be freed so we don't call into the oom killer again in the 
near future.

> > /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> > units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> > these per-task tunables will rescale the value of the other to an
> > equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> > a bitshift on the badness score, it now shares the same linear growth as
> > /proc/pid/oom_score_adj but with different granularity.  This is required
> > so the ABI is not broken with userspace applications and allows oom_adj to
> > be deprecated for future removal.
> 
> It was a mistake to add oom_adj in the first place.  Because it's a
> user-visible knob which us tied to a particular in-kernel
> implementation.  As we're seeing now, the presence of that knob locks
> us into a particular implementation.
> 

Agreed.

> Given that oom_score_adj is just a rescaled version of oom_adj
> (correct?), I guess things haven't got a lot worse on that front as a
> result of these changes.
> 

No, it's not a rescaled version at all, we merely rescale oom_adj to 
oom_score_adj units because everyone objected to removing oom_adj without 
deprecation first.  oom_score_adj has units: a proportion of memory 
available to the application, meaning how much of the system, memcg, 
cpuset, or mempolicy it should be biased or favored by.  Please see the 
change to Documentation/filesystems/proc.txt which explain this pretty 
elaborately.

> General observation regarding the patch description: I'm not seeing a
> lot of reason for merging the patch!  What value does it bring to our
> users?  What problems got solved?
> 

It significantly improves the oom killer's predictability, it protects 
vital system tasks like KDE and GNOME on the desktop, it allows users to 
tune each task with a bias or preference in units they understand to 
affect its score, and it allows that interface to remain constant and 
valid even when those tasks are subsequently attached to a cgroup or bound 
to a mempolicy (or their limits or set of allowed nodes are changed).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-17  3:36         ` David Rientjes
@ 2010-06-21 11:45           ` KOSAKI Motohiro
  2010-06-21 20:54             ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-21 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> On Sun, 13 Jun 2010, KOSAKI Motohiro wrote:
> 
> > But oom_score_adj have no benefit form end-uses view. That's problem.
> > Please consider to make end-user friendly good patch at first.
> > 
> 
> Of course it does, it actually has units whereas oom_adj only grows or 
> shrinks the badness score exponentially.  oom_score_adj's units are well 
> understood: on a machine with 4G of memory, 250 means we're trying to 
> prejudice it by 1G of memory so that can be used by other tasks, -250 
> means other tasks should be prejudiced by 1G in comparison to this task, 
> etc.  It's actually quite powerful.

And, no real user want such power.

When we consider desktop user case, End-users don't use oom_adj by themself.
their application are using it.  It mean now oom_adj behave as syscall like
system interface, unlike kernel knob. application developers also don't 
need oom_score_adj because application developers don't know end-users 
machine mem size.

Then, you will get the change's merit but end users will get the demerit.
That's out of balance.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-17  5:14       ` David Rientjes
@ 2010-06-21 11:45         ` KOSAKI Motohiro
  2010-06-21 20:47           ` David Rientjes
  0 siblings, 1 reply; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-21 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> > This change was unchangelogged, I don't know what it's for and I don't
> > understand your comment about it.
> > 
> 
> It was in the changelog (recall that the badness() function represents a 
> proportion of available memory used by a task, so subtracting 30 is the 
> equivalent of 3% of available memory):
> 
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs.  In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.

LSMs have obvious reason to tend to priotize admin's operation than root
privilege daemon. otherwise admins can't restore troubles.

But in this case, why do need priotize admin shell than daemons?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-17  5:12     ` David Rientjes
@ 2010-06-21 11:45       ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-21 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> > Sorry I can't ack this. again and again, I try to explain why this is wrong
> > (hopefully last)
> > 
> > 1) incompatibility
> >    oom_score is one of ABI. then, we can't change this. from enduser view,
> >    this change is no merit. In general, an incompatibility is allowed on very
> >    limited situation such as that an end-user get much benefit than compatibility.
> >    In other word, old style ABI doesn't works fine from end user view.
> >    But, in this case, it isn't.
> > 
> 
> There is no incompatibility here, /proc/pid/oom_score has no meaningful 
> units because of the old heuristic.  The _only_ thing it represents is a 
> score in comparison with other eligible tasks to decide which task to 
> kill.  Thus, oom_score by itself means nothing if not compared to other 
> eligible tasks.
> 
> Although deprecated, /proc/pid/oom_adj still changes 
> /proc/pid/oom_score_adj with a different scale (-17 maps to -1000 and +15 
> maps to +1000), so there is absolutely no userspace imcompatibility with 
> this change.

I sympathize your burden. Yes, oom_adj is suck.

but it is still an abi. we (kernel developers) can't define it as no 
meaningful. that's defined by userland folks.

If you want to change the world, you need to discuss userland folks.

> 
> > 2) technically incorrect
> >    this math is not correct math. this is not represented "allowed memory".
> >    example, 1) this is not accumulated mlocked memory, but it can be freed
> >    task kill 2) SHM_LOCKED memory freeablility depend on IPC_RMID did or not.
> >    if not, task killing doesn't free SYSV IPC memory.
> 
> Ah, very good point.  We should be using totalram_pages + total_swap_pages 
> here to represent global normalization, memcg limit for CONSTRAINT_MEMCG, 
> and a total of node_spanned_pages for mempolicy nodes or cpuset mems for 
> CONSTAINT_MEMORY_POLICY and CONSTRAINT_CPUSET, respectively.  I'll make 
> that switch in the next revision, thanks!

I can't understand. What problem do this solve?

> 
> >    In additon, 3) This normalization doesn't works on asymmetric numa. 
> >    total pages and oom are not related almostly.
> 
> What this does is represents the heuristic baseline, rss and swap, as a 
> proportion depending on the type of oom constraint.  This works when 
> comparing eligible tasks amongst each other because the the task with the 
> highest rss and swap is the one we (normally) want to kill, minus the 3% 
> privilege given to root and outside influence of /proc/pid/oom_score_adj.
> 
> We want to represent this as a proportion and not as a shear value simply 
> because the task may be attached to a cpuset, a memcg, or bound to a 
> mempolicy out from under the task's knowledge.  That is, we compare tasks 
> sharing the same constraint for oom kill and normalize the heuristic based 
> on that.  We don't want to expose a userspace interface that takes memory 
> quantities directly since the task may be bound to a mempolicy, for 
> instance, later and the oom_score_adj is then rendered obsolete.

Can't understand. Do you mean you suggest to ignore this issue?
I feel you talked unrelated thing.

Plus the fact is, If you think "We don't want to expose a userspace 
interface that takes memory quantities directly", it already did 5 years ago.
your proposal was too late 5 years. (look at andrea)


> > 4) scalability. if the 
> >    system 10TB memory, 1 point oom score mean 10GB memory consumption.
> 
> Well, sure, a 10TB system would have a large granularity such as that :)  
> But in such cases we don't necessarily care if one task is using 5GB more 
> than another task using 1TB, for example.

Probably not.

When we are thinking common DB server workload, DB process consume
almost memory, but it's OOM_DISABLEed. OOM victims are typically selected from
some assistant JVM process.


So, I don't think this is good idea. Instead, To enhance memcg oom notification
looks promising. 

And other piece of this patch looks promising rather than this. please
resend them. (of cource, test result too)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-21 11:45         ` KOSAKI Motohiro
@ 2010-06-21 20:47           ` David Rientjes
  2010-06-30  9:26             ` KOSAKI Motohiro
  0 siblings, 1 reply; 104+ messages in thread
From: David Rientjes @ 2010-06-21 20:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Mon, 21 Jun 2010, KOSAKI Motohiro wrote:

> > It was in the changelog (recall that the badness() function represents a 
> > proportion of available memory used by a task, so subtracting 30 is the 
> > equivalent of 3% of available memory):
> > 
> > Root tasks are given 3% extra memory just like __vm_enough_memory()
> > provides in LSMs.  In the event of two tasks consuming similar amounts of
> > memory, it is generally better to save root's task.
> 
> LSMs have obvious reason to tend to priotize admin's operation than root
> privilege daemon. otherwise admins can't restore troubles.
> 
> But in this case, why do need priotize admin shell than daemons?
> 

For the same reason.  We want to slightly bias admin shells and their 
processes from being oom killed because they are typically in the business 
of administering the machine and resolving issues that may arise.  It 
would be irresponsible to consider them to have the same killing 
preference as user tasks in the case of a tie.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 18/18] oom: deprecate oom_adj tunable
  2010-06-21 11:45           ` KOSAKI Motohiro
@ 2010-06-21 20:54             ` David Rientjes
  0 siblings, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-06-21 20:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Mon, 21 Jun 2010, KOSAKI Motohiro wrote:

> > Of course it does, it actually has units whereas oom_adj only grows or 
> > shrinks the badness score exponentially.  oom_score_adj's units are well 
> > understood: on a machine with 4G of memory, 250 means we're trying to 
> > prejudice it by 1G of memory so that can be used by other tasks, -250 
> > means other tasks should be prejudiced by 1G in comparison to this task, 
> > etc.  It's actually quite powerful.
> 
> And, no real user want such power.
> 

Google does, and I imagine other users will want to be able to normalize 
each task's memory usage against the others.  It's perfectly legitimate 
for one task to consume 3G while another consumes 1G and want to select 
the 1G task to kill.  Setting the 3G task's oom_score_adj value in this 
case to be -250, for example, depending on the memory capacity of the 
machine, makes much more sense than influencing it as a bitshift on 
top of a vastly unpredictable heuristic with oom_adj.  This seems rather 
trivial to understand.

> When we consider desktop user case, End-users don't use oom_adj by themself.
> their application are using it.  It mean now oom_adj behave as syscall like
> system interface, unlike kernel knob. application developers also don't 
> need oom_score_adj because application developers don't know end-users 
> machine mem size.
> 

I agree, oom_score_adj isn't targeted to the desktop nor is it targeted to 
application developers (unless they are setting it to OOM_SCORE_ADJ_MIN to 
disable oom killing for that task, for example).  It's targeted at 
sysadmins and daemons that partition a machine to run a number of 
concurrent jobs.  It's fine to use memcg, for example, to do such 
partitioning, but memcg can also cause oom conditions with the cgroup.  We 
want to be able to tell the kernel, through an interface such as this, 
that one task shouldn't killed because it's expected to use 3G of memory 
but should be killed when it's using 8G, for example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 16/18] oom: badness heuristic rewrite
  2010-06-21 20:47           ` David Rientjes
@ 2010-06-30  9:26             ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-06-30  9:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> On Mon, 21 Jun 2010, KOSAKI Motohiro wrote:
> 
> > > It was in the changelog (recall that the badness() function represents a 
> > > proportion of available memory used by a task, so subtracting 30 is the 
> > > equivalent of 3% of available memory):
> > > 
> > > Root tasks are given 3% extra memory just like __vm_enough_memory()
> > > provides in LSMs.  In the event of two tasks consuming similar amounts of
> > > memory, it is generally better to save root's task.
> > 
> > LSMs have obvious reason to tend to priotize admin's operation than root
> > privilege daemon. otherwise admins can't restore troubles.
> > 
> > But in this case, why do need priotize admin shell than daemons?
> > 
> 
> For the same reason.  We want to slightly bias admin shells and their 
> processes from being oom killed because they are typically in the business 
> of administering the machine and resolving issues that may arise.  It 
> would be irresponsible to consider them to have the same killing 
> preference as user tasks in the case of a tie.

Not same. Administrator freely login again. typically killing login
process makes to kill some processes in the same session. thus now they
have much memory. rest very few case, they can press SysRq+F as a last 
resort.

In the other hand, system daemon crash can makes all of system crash.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-06-13 11:24         ` KOSAKI Motohiro
@ 2010-07-02 22:35           ` Andrew Morton
  2010-07-04 22:08             ` David Rientjes
  2010-07-09  3:00             ` KOSAKI Motohiro
  0 siblings, 2 replies; 104+ messages in thread
From: Andrew Morton @ 2010-07-02 22:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Sun, 13 Jun 2010 20:24:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Sorry for the delay.
> 
> > On Tue, 8 Jun 2010 11:51:32 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > Andrew, are you the maintainer for these fixes or is KOSAKI?
> > 
> > I am, thanks.  Kosaki-san, you're making this harder than it should be.
> > Please either ack David's patches or promptly work with him on
> > finalising them.
> 
> Thanks, Andrew, David. I agree with you. I don't find any end users harm
> and regressions in latest David's patch series. So, I'm glad to join his work.

whew ;)

> Unfortunatelly, I don't have enough time now. then, I expect my next review
> is not quite soon. but I'll promise I'll do.

So where do we go from here?  I have about 12,000 oom-killer related
emails saved up in my todo folder, ready for me to read next time I
have an oom-killer session.

What would happen if I just deleted them all?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-07-02 22:35           ` Andrew Morton
@ 2010-07-04 22:08             ` David Rientjes
  2010-07-09  3:00             ` KOSAKI Motohiro
  1 sibling, 0 replies; 104+ messages in thread
From: David Rientjes @ 2010-07-04 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Rik van Riel, Nick Piggin, Oleg Nesterov,
	Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

On Fri, 2 Jul 2010, Andrew Morton wrote:

> So where do we go from here?  I have about 12,000 oom-killer related
> emails saved up in my todo folder, ready for me to read next time I
> have an oom-killer session.
> 

I'll be proposing my second revision of the badness heuristic rewrite in 
the next couple of days.  That said, I don't know of any other outstanding 
patches that haven't yet been merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [patch 07/18] oom: filter tasks not sharing the same cpuset
  2010-07-02 22:35           ` Andrew Morton
  2010-07-04 22:08             ` David Rientjes
@ 2010-07-09  3:00             ` KOSAKI Motohiro
  1 sibling, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-07-09  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, David Rientjes, Rik van Riel, Nick Piggin,
	Oleg Nesterov, Balbir Singh, KAMEZAWA Hiroyuki, linux-mm

> > Unfortunatelly, I don't have enough time now. then, I expect my next review
> > is not quite soon. but I'll promise I'll do.
> 
> So where do we go from here?  I have about 12,000 oom-killer related
> emails saved up in my todo folder, ready for me to read next time I1
> have an oom-killer session.

At least, all deadlock issue should be fixed. I don't know Michel's problem
is still there. plus I think all desktop related issue also sould be fixed.

but I'm not aggressive to include domain specific OOM tendency. It should
be cared user-land callback and userland daemon. because any usecase specific
change can be considered as regression from another usecase guys.

About David's patch, I dunnno. he didn't explain his patch makes which
change. If he will explained the worth and anybody agree it, it can be
merged. but otherwise.....



> What would happen if I just deleted them all?

Probably, no problem.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2010-07-09  3:00 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-06 22:33 [patch 00/18] oom killer rewrite David Rientjes
2010-06-06 22:34 ` [patch 01/18] oom: check PF_KTHREAD instead of !mm to skip kthreads David Rientjes
2010-06-07 12:12   ` Balbir Singh
2010-06-07 19:50     ` David Rientjes
2010-06-08 19:33   ` Andrew Morton
2010-06-08 23:40     ` David Rientjes
2010-06-08 23:52       ` Andrew Morton
2010-06-06 22:34 ` [patch 02/18] oom: introduce find_lock_task_mm() to fix !mm false positives David Rientjes
2010-06-07 12:58   ` Balbir Singh
2010-06-07 13:49     ` Minchan Kim
2010-06-07 19:49       ` David Rientjes
2010-06-08 19:42   ` Andrew Morton
2010-06-08 20:14     ` Oleg Nesterov
2010-06-08 20:17       ` Oleg Nesterov
2010-06-08 21:34         ` Andrew Morton
2010-06-08 23:50     ` David Rientjes
2010-06-06 22:34 ` [patch 03/18] oom: dump_tasks use find_lock_task_mm too David Rientjes
2010-06-08 19:55   ` Andrew Morton
2010-06-09  0:06     ` David Rientjes
2010-06-06 22:34 ` [patch 04/18] oom: PF_EXITING check should take mm into account David Rientjes
2010-06-08 20:00   ` Andrew Morton
2010-06-06 22:34 ` [patch 05/18] oom: give current access to memory reserves if it has been killed David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:47     ` David Rientjes
2010-06-14 11:08       ` KOSAKI Motohiro
2010-06-08 20:12     ` Andrew Morton
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-08 20:08   ` Andrew Morton
2010-06-09  0:14     ` David Rientjes
2010-06-06 22:34 ` [patch 06/18] oom: avoid sending exiting tasks a SIGKILL David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:48     ` David Rientjes
2010-06-08 20:17   ` Andrew Morton
2010-06-08 20:26   ` Oleg Nesterov
2010-06-09  6:32     ` David Rientjes
2010-06-09 16:25       ` Oleg Nesterov
2010-06-09 19:44         ` David Rientjes
2010-06-09 20:14           ` Oleg Nesterov
2010-06-10  0:15             ` KAMEZAWA Hiroyuki
2010-06-10  1:21               ` Oleg Nesterov
2010-06-10  1:43                 ` KAMEZAWA Hiroyuki
2010-06-10  1:51                   ` Oleg Nesterov
2010-06-06 22:34 ` [patch 07/18] oom: filter tasks not sharing the same cpuset David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:51     ` David Rientjes
2010-06-08 19:27       ` Andrew Morton
2010-06-13 11:24         ` KOSAKI Motohiro
2010-07-02 22:35           ` Andrew Morton
2010-07-04 22:08             ` David Rientjes
2010-07-09  3:00             ` KOSAKI Motohiro
2010-06-08 20:23   ` Andrew Morton
2010-06-09  0:25     ` David Rientjes
2010-06-06 22:34 ` [patch 08/18] oom: sacrifice child with highest badness score for parent David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 18:53     ` David Rientjes
2010-06-08 20:33   ` Andrew Morton
2010-06-09  0:30     ` David Rientjes
2010-06-06 22:34 ` [patch 09/18] oom: select task from tasklist for mempolicy ooms David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 21:08   ` Andrew Morton
2010-06-08 21:17     ` Oleg Nesterov
2010-06-09  0:46     ` David Rientjes
2010-06-08 23:43   ` Andrew Morton
2010-06-09  0:40     ` David Rientjes
2010-06-06 22:34 ` [patch 10/18] oom: enable oom tasklist dump by default David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-08 18:56     ` David Rientjes
2010-06-08 21:13   ` Andrew Morton
2010-06-09  0:52     ` David Rientjes
2010-06-06 22:34 ` [patch 11/18] oom: avoid oom killer for lowmem allocations David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-08 21:19   ` Andrew Morton
2010-06-06 22:34 ` [patch 12/18] oom: extract panic helper function David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-06 22:34 ` [patch 13/18] oom: remove special handling for pagefault ooms David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-08 18:57     ` David Rientjes
2010-06-08 21:27   ` Andrew Morton
2010-06-06 22:34 ` [patch 14/18] oom: move sysctl declarations to oom.h David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-06 22:34 ` [patch 15/18] oom: remove unnecessary code and cleanup David Rientjes
2010-06-06 22:34 ` [patch 16/18] oom: badness heuristic rewrite David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 23:02     ` Andrew Morton
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-17  5:14       ` David Rientjes
2010-06-21 11:45         ` KOSAKI Motohiro
2010-06-21 20:47           ` David Rientjes
2010-06-30  9:26             ` KOSAKI Motohiro
2010-06-17  5:12     ` David Rientjes
2010-06-21 11:45       ` KOSAKI Motohiro
2010-06-08 22:58   ` Andrew Morton
2010-06-17  5:32     ` David Rientjes
2010-06-06 22:34 ` [patch 17/18] oom: add forkbomb penalty to badness heuristic David Rientjes
2010-06-08 11:41   ` KOSAKI Motohiro
2010-06-08 23:15   ` Andrew Morton
2010-06-06 22:35 ` [patch 18/18] oom: deprecate oom_adj tunable David Rientjes
2010-06-08 11:42   ` KOSAKI Motohiro
2010-06-08 19:00     ` David Rientjes
2010-06-08 23:18     ` Andrew Morton
2010-06-13 11:24       ` KOSAKI Motohiro
2010-06-17  3:36         ` David Rientjes
2010-06-21 11:45           ` KOSAKI Motohiro
2010-06-21 20:54             ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.