[patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed.
@ 2009-05-10 22:07 David Rientjes
  2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
                   ` (10 more replies)
  0 siblings, 11 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2249 bytes --]

From: Arve Hjønnevåg <arve@android.com>

Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
instead the sum of rss. Neither method is accurate.

Also skip the process scan, if the amount of memory available is above
the largest threshold set.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
---
 drivers/staging/android/lowmemorykiller.c |   35 +++++++++++++++++-----------
 1 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -71,23 +71,30 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	}
 	if(nr_to_scan > 0)
 		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
+	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
+		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+		return rem;
+	}
+
 	read_lock(&tasklist_lock);
 	for_each_process(p) {
-		if(p->oomkilladj >= 0 && p->mm) {
-			tasksize = get_mm_rss(p->mm);
-			if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
-				if(selected == NULL ||
-				   p->oomkilladj > selected->oomkilladj ||
-				   (p->oomkilladj == selected->oomkilladj &&
-				    tasksize > selected_tasksize)) {
-					selected = p;
-					selected_tasksize = tasksize;
-					lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
-					             p->pid, p->comm, p->oomkilladj, tasksize);
-				}
-			}
-			rem += tasksize;
+		if (p->oomkilladj < min_adj || !p->mm)
+			continue;
+		tasksize = get_mm_rss(p->mm);
+		if (tasksize <= 0)
+			continue;
+		if (selected) {
+			if (p->oomkilladj < selected->oomkilladj)
+				continue;
+			if (p->oomkilladj == selected->oomkilladj &&
+			    tasksize <= selected_tasksize)
+				continue;
 		}
+		selected = p;
+		selected_tasksize = tasksize;
+		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
+		             p->pid, p->comm, p->oomkilladj, tasksize);
 	}
 	if(selected != NULL) {
 		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-12  9:23   ` Mel Gorman
  2009-05-10 22:07 ` [patch 03/11 -mmotm] oom: cleanup android low memory killer David Rientjes
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1897 bytes --]

From: Arve Hjønnevåg <arve@android.com>

This allows processes to be killed when the kernel evict cache pages in
an attempt to get more contiguous free memory.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
---
 drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	int min_adj = OOM_ADJUST_MAX + 1;
 	int selected_tasksize = 0;
 	int array_size = ARRAY_SIZE(lowmem_adj);
-	int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
+	int other_free = global_page_state(NR_FREE_PAGES);
+	int other_file = global_page_state(NR_FILE_PAGES);
 	if(lowmem_adj_size < array_size)
 		array_size = lowmem_adj_size;
 	if(lowmem_minfree_size < array_size)
 		array_size = lowmem_minfree_size;
 	for(i = 0; i < array_size; i++) {
-		if(other_free < lowmem_minfree[i]) {
+		if (other_free < lowmem_minfree[i] &&
+		    other_file < lowmem_minfree[i]) {
 			min_adj = lowmem_adj[i];
 			break;
 		}
 	}
 	if(nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
-	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
+	rem = global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
 	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
 		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
 		return rem;

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 03/11 -mmotm] oom: cleanup android low memory killer
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
  2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-10 22:07 ` [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer David Rientjes
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

Clean up the code in lowmem_shrink() for the Android low memory killer so
that it follows the kernel coding style.

It's unnecessary to check for p->oomkilladj >= min_adj if the selected
task's oomkilladj score is stored since get_mm_rss() will always be
greater than zero.

Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c |   39 +++++++++++++++++++---------
 1 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -57,58 +57,71 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	int i;
 	int min_adj = OOM_ADJUST_MAX + 1;
 	int selected_tasksize = 0;
+	int selected_oom_adj;
 	int array_size = ARRAY_SIZE(lowmem_adj);
 	int other_free = global_page_state(NR_FREE_PAGES);
 	int other_file = global_page_state(NR_FILE_PAGES);
-	if(lowmem_adj_size < array_size)
+
+	if (lowmem_adj_size < array_size)
 		array_size = lowmem_adj_size;
-	if(lowmem_minfree_size < array_size)
+	if (lowmem_minfree_size < array_size)
 		array_size = lowmem_minfree_size;
-	for(i = 0; i < array_size; i++) {
+	for (i = 0; i < array_size; i++) {
 		if (other_free < lowmem_minfree[i] &&
 		    other_file < lowmem_minfree[i]) {
 			min_adj = lowmem_adj[i];
 			break;
 		}
 	}
-	if(nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
+	if (nr_to_scan > 0)
+		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n",
+			     nr_to_scan, gfp_mask, other_free, other_file,
+			     min_adj);
 	rem = global_page_state(NR_ACTIVE_ANON) +
 		global_page_state(NR_ACTIVE_FILE) +
 		global_page_state(NR_INACTIVE_ANON) +
 		global_page_state(NR_INACTIVE_FILE);
 	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
-		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n",
+			     nr_to_scan, gfp_mask, rem);
 		return rem;
 	}
+	selected_oom_adj = min_adj;
 
 	read_lock(&tasklist_lock);
 	for_each_process(p) {
-		if (p->oomkilladj < min_adj || !p->mm)
+		int oom_adj;
+
+		if (!p->mm)
+			continue;
+		oom_adj = p->oomkilladj;
+		if (oom_adj < min_adj)
 			continue;
 		tasksize = get_mm_rss(p->mm);
 		if (tasksize <= 0)
 			continue;
 		if (selected) {
-			if (p->oomkilladj < selected->oomkilladj)
+			if (oom_adj < selected_oom_adj)
 				continue;
-			if (p->oomkilladj == selected->oomkilladj &&
+			if (oom_adj == selected_oom_adj &&
 			    tasksize <= selected_tasksize)
 				continue;
 		}
 		selected = p;
 		selected_tasksize = tasksize;
+		selected_oom_adj = oom_adj;
 		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
-		             p->pid, p->comm, p->oomkilladj, tasksize);
+		             p->pid, p->comm, oom_adj, tasksize);
 	}
-	if(selected != NULL) {
+	if (selected) {
 		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
 		             selected->pid, selected->comm,
-		             selected->oomkilladj, selected_tasksize);
+		             selected_oom_adj, selected_tasksize);
 		force_sig(SIGKILL, selected);
 		rem -= selected_tasksize;
 	}
-	lowmem_print(4, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+	lowmem_print(4, "lowmem_shrink %d, %x, return %d\n",
+		     nr_to_scan, gfp_mask, rem);
 	read_unlock(&tasklist_lock);
 	return rem;
 }

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
  2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
  2009-05-10 22:07 ` [patch 03/11 -mmotm] oom: cleanup android low memory killer David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

get_mm_rss() atomically dereferences the actual without checking for a
NULL pointer, which is possible since task_lock() is not held.

Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -92,12 +92,18 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	for_each_process(p) {
 		int oom_adj;
 
-		if (!p->mm)
+		task_lock(p);
+		if (!p->mm) {
+			task_unlock(p);
 			continue;
+		}
 		oom_adj = p->oomkilladj;
-		if (oom_adj < min_adj)
+		if (oom_adj < min_adj) {
+			task_unlock(p);
 			continue;
+		}
 		tasksize = get_mm_rss(p->mm);
+		task_unlock(p);
 		if (tasksize <= 0)
 			continue;
 		if (selected) {

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (2 preceding siblings ...)
  2009-05-10 22:07 ` [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-11 21:11   ` Andrew Morton
  2009-05-12  9:38   ` Mel Gorman
  2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
pointer for tasks that have detached mm's since task_lock() is not held
during the tasklist scan.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   24 +++++++++++++++---------
 1 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -284,22 +284,28 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
 	       "name\n");
 	do_each_thread(g, p) {
-		/*
-		 * total_vm and rss sizes do not exist for tasks with a
-		 * detached mm so there's no need to report them.
-		 */
-		if (!p->mm)
-			continue;
+		struct mm_struct *mm;
+
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 		if (!thread_group_leader(p))
 			continue;
 
 		task_lock(p);
+		mm = p->mm;
+		if (!mm) {
+			/*
+			 * total_vm and rss sizes do not exist for tasks with no
+			 * mm so there's no need to report them; they can't be
+			 * oom killed anyway.
+			 */
+			task_unlock(p);
+			continue;
+		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
-		       p->pid, __task_cred(p)->uid, p->tgid,
-		       p->mm->total_vm, get_mm_rss(p->mm), (int)task_cpu(p),
-		       p->oomkilladj, p->comm);
+		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
+		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
+		       p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (3 preceding siblings ...)
  2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-11  0:17   ` KOSAKI Motohiro
  2009-05-12  9:56   ` Mel Gorman
  2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares
the mm.  If a task were to be killed while attached to an mm that could
not be freed because another thread were set to OOM_DISABLE, it would
have needlessly been terminated since there is no potential for future
memory freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
in oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt        |   15 +++++++++----
 drivers/staging/android/lowmemorykiller.c |    2 +-
 fs/proc/base.c                            |   19 ++++++++++++++--
 include/linux/mm_types.h                  |    2 +
 include/linux/sched.h                     |    1 -
 mm/oom_kill.c                             |   32 ++++++++++++++++++----------
 6 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation.  The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value.  A high value will increase the liklihood of this process being
+killed by the oom-killer.  Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
 /proc/<pid>/oom_score shows process' current badness score.
 
 The following heuristics are then applied:
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 			task_unlock(p);
 			continue;
 		}
-		oom_adj = p->oomkilladj;
+		oom_adj = p->mm->oom_adj;
 		if (oom_adj < min_adj) {
 			task_unlock(p);
 			continue;
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 
 	if (!task)
 		return -ESRCH;
-	oom_adjust = task->oomkilladj;
+	task_lock(task);
+	if (task->mm)
+		oom_adjust = task->mm->oom_adj;
+	else
+		oom_adjust = OOM_DISABLE;
+	task_unlock(task);
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+	task_lock(task);
+	if (!task->mm) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->oomkilladj = oom_adjust;
+	task->mm->oom_adj = oom_adjust;
+	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -232,6 +232,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1149,7 +1149,6 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
-	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
+	int oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
+	oom_adj = mm->oom_adj;
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 8;
 
 	/*
-	 * Adjust the score by oomkilladj.
+	 * Adjust the score by oom_adj.
 	 */
-	if (p->oomkilladj) {
-		if (p->oomkilladj > 0) {
+	if (oom_adj) {
+		if (oom_adj > 0) {
 			if (!points)
 				points = 1;
-			points <<= p->oomkilladj;
+			points <<= oom_adj;
 		} else
-			points >>= -(p->oomkilladj);
+			points >>= -(oom_adj);
 	}
 
 #ifdef DEBUG
@@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->oomkilladj == OOM_DISABLE)
+		task_lock(p);
+		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
 			continue;
+		task_unlock(p);
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
-		       p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
 	 * Don't kill the process if any threads are set to OOM_DISABLE
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+		task_lock(q);
+		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(q);
 			return 1;
+		}
+		task_unlock(q);
 	} while_each_thread(g, q);
 
 	__oom_kill_task(p, 1);
@@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
-			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
-			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+			current->comm, gfp_mask, order,
+			current->mm ? current->mm->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (4 preceding siblings ...)
  2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-11 21:22   ` Andrew Morton
  2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

This moves the check for OOM_DISABLE to the badness heuristic while
holding task_lock().  Badness scores of 0 are now explicitly prohibited
from being oom killed and since the oom_adj value is a characteristic of
an mm and not a task, it is no longer necessary to check the oom_adj
value for threads sharing the same memory (except when simply issuing
SIGKILLs for threads in other thread groups).

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   40 ++++++++++------------------------------
 1 files changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -67,6 +67,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		return 0;
 	}
 	oom_adj = mm->oom_adj;
+	if (oom_adj == OOM_DISABLE) {
+		task_unlock(p);
+		return 0;
+	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -253,13 +257,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		task_lock(p);
-		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
-			continue;
-		task_unlock(p);
-
 		points = badness(p, uptime.tv_sec);
-		if (points > *ppoints || !chosen) {
+		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -352,32 +351,13 @@ static int oom_kill_task(struct task_struct *p)
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
 
+	task_lock(p);
 	mm = p->mm;
-
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-
-	if (mm == NULL)
+	if (!mm || mm->oom_adj == OOM_DISABLE) {
+		task_unlock(p);
 		return 1;
-
-	/*
-	 * Don't kill the process if any threads are set to OOM_DISABLE
-	 */
-	do_each_thread(g, q) {
-		task_lock(q);
-		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
-			task_unlock(q);
-			return 1;
-		}
-		task_unlock(q);
-	} while_each_thread(g, q);
-
+	}
+	task_unlock(p);
 	__oom_kill_task(p, 1);
 
 	/*

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (5 preceding siblings ...)
  2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-10 23:59   ` KOSAKI Motohiro
  2009-05-11 21:29   ` Andrew Morton
  2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

The oom killer must be invoked regardless of the order if the allocation
is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
free some memory.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
@@ -1765,11 +1765,13 @@ rebalance:
 				goto got_pg;
 
 			/*
-			 * The OOM killer does not trigger for high-order allocations
-			 * but if no progress is being made, there are no other
-			 * options and retrying is unlikely to help
+			 * The OOM killer does not trigger for high-order
+			 * ~__GFP_NOFAIL allocations so if no progress is being
+			 * made, there are no other options and retrying is
+			 * unlikely to help.
 			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER)
+			if (order > PAGE_ALLOC_COSTLY_ORDER &&
+						!(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
 
 			goto restart;

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 09/11 -mmotm] oom: return vm size of oom killed task
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (6 preceding siblings ...)
  2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-11 21:36   ` Andrew Morton
  2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

It's not optimal to continuously loop in the page allocator if the oom
killer fails to kill a task.  Thus, it's necessary to report how many
pages may be freed when the task finally exits to determine if any
progress has been made.

This also changes the TIF_MEMDIE exception in select_bad_process().  If
a task is found with this thread flag set, yet it has already detached
its memory, then an additional task is chosen since we are still out of
memory.

total_vm is used instead of the file and anon rss since this is what the
badness scoring heuristic is based on and it may be possible to oom kill
a task with no rss causing the page allocator to believe no progress has
been made.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/oom.h |    3 +-
 mm/oom_kill.c       |   76 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -26,7 +26,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern unsigned long out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+				   int order);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -204,12 +204,13 @@ static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+				unsigned long *freed, struct mem_cgroup *mem)
 {
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
 	*ppoints = 0;
+	*freed = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	do_each_thread(g, p) {
@@ -236,8 +237,14 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
-			return ERR_PTR(-1UL);
+		if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+			task_lock(p);
+			if (p->mm)
+				*freed = p->mm->total_vm;
+			task_unlock(p);
+			if (*freed)
+				return ERR_PTR(-1UL);
+		}
 
 		/*
 		 * This is in the process of releasing memory so wait for it
@@ -250,8 +257,14 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
 		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
+			if (p != current) {
+				task_lock(p);
+				if (p->mm)
+					*freed = p->mm->total_vm;
+				task_unlock(p);
+				if (*freed)
+					return ERR_PTR(-1UL);
+			}
 
 			chosen = p;
 			*ppoints = ULONG_MAX;
@@ -346,7 +359,7 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
 	force_sig(SIGKILL, p);
 }
 
-static int oom_kill_task(struct task_struct *p)
+static int oom_kill_task(struct task_struct *p, unsigned long *freed)
 {
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
@@ -357,6 +370,7 @@ static int oom_kill_task(struct task_struct *p)
 		task_unlock(p);
 		return 1;
 	}
+	*freed = mm->total_vm;
 	task_unlock(p);
 	__oom_kill_task(p, 1);
 
@@ -375,10 +389,12 @@ static int oom_kill_task(struct task_struct *p)
 
 static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    unsigned long points, struct mem_cgroup *mem,
-			    const char *message)
+			    unsigned long *freed, const char *message)
 {
 	struct task_struct *c;
 
+
+	*freed = 0;
 	if (printk_ratelimit()) {
 		task_lock(current);
 		printk(KERN_WARNING "%s invoked oom-killer: "
@@ -399,8 +415,14 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	if (p->flags & PF_EXITING) {
-		__oom_kill_task(p, 0);
-		return 0;
+		task_lock(p);
+		if (p->mm)
+			*freed = p->mm->total_vm;
+		task_unlock(p);
+		if (*freed) {
+			__oom_kill_task(p, 0);
+			return 0;
+		}
 	}
 
 	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
@@ -410,28 +432,29 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	list_for_each_entry(c, &p->children, sibling) {
 		if (c->mm == p->mm)
 			continue;
-		if (!oom_kill_task(c))
+		if (!oom_kill_task(c, freed))
 			return 0;
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(p, freed);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 {
 	unsigned long points = 0;
+	unsigned long freed;
 	struct task_struct *p;
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, &freed, mem);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
 	if (!p)
 		p = current;
 
-	if (oom_kill_process(p, gfp_mask, 0, points, mem,
+	if (oom_kill_process(p, gfp_mask, 0, points, mem, &freed,
 				"Memory cgroup out of memory"))
 		goto retry;
 out:
@@ -506,24 +529,25 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static unsigned long __out_of_memory(gfp_t gfp_mask, int order)
 {
 	struct task_struct *p;
 	unsigned long points;
+	unsigned long freed = 0;
 
 	if (sysctl_oom_kill_allocating_task)
 		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
-			return;
+			&freed, "Out of memory (oom_kill_allocating_task)"))
+			return freed;
 retry:
 	/*
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, &freed, NULL);
 
 	if (PTR_ERR(p) == -1UL)
-		return;
+		return freed;
 
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
@@ -531,9 +555,10 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, &freed,
 			     "Out of memory"))
 		goto retry;
+	return freed;
 }
 
 /*
@@ -582,8 +607,12 @@ rest_and_return:
  * killing a random task (bad), letting the system crash (worse)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
+ *
+ * Returns the number of pages that will be freed from a killed
+ * task, if any.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+unsigned long out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+			    int order)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -591,7 +620,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */
-		return;
+		return freed;
 
 	if (sysctl_panic_on_oom == 2)
 		panic("out of memory. Compulsory panic_on_oom is selected.\n");
@@ -605,7 +634,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
+		oom_kill_process(current, gfp_mask, order, 0, NULL, &freed,
 				"No available memory (MPOL_BIND)");
 		break;
 
@@ -614,7 +643,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 			panic("out of memory. panic_on_oom is selected\n");
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+		freed = __out_of_memory(gfp_mask, order);
 		break;
 	}
 
@@ -626,4 +655,5 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	 */
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
+	return freed;
 }

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (7 preceding siblings ...)
  2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-11 21:39   ` Andrew Morton
  2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
  2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
  10 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

If there are no interruptible system tasks other than kthreads, no task
should be chosen for oom kill since they won't respond to the SIGKILL
anyway.  Instead, we choose to simply fail page allocations if reclaim
cannot free memory and hope for the best.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -209,6 +209,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+	unsigned long nr_interruptible = 0;
 	*ppoints = 0;
 	*freed = 0;
 
@@ -225,6 +226,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		/* skip the init task */
 		if (is_global_init(p))
 			continue;
+		if (!(p->state & TASK_UNINTERRUPTIBLE))
+			nr_interruptible++;
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 
@@ -277,7 +280,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		}
 	} while_each_thread(g, p);
 
-	return chosen;
+	return nr_interruptible ? chosen : ERR_PTR(-1UL);
 }
 
 /**

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (8 preceding siblings ...)
  2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
@ 2009-05-10 22:07 ` David Rientjes
  2009-05-12 21:14   ` Misleading OOM messages Christoph Lameter
  2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
  10 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-10 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjønnevåg, linux-kernel

The oom killer now reports the potential number of pages that will be
freed when an oom kill task finally exits.  If the oom killer can't make
any progress in freeing memory, however, there is no reason to continue
looping endlessly in the page allocator.  Instead, the allocation is
failed if it is not __GFP_NOFAIL.

Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   35 ++++++++++++++++++++---------------
 1 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1524,7 +1524,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page;
 
@@ -1551,7 +1551,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	*did_some_progress += out_of_memory(zonelist, gfp_mask, order);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
@@ -1716,7 +1716,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-restart:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -1760,21 +1759,27 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					migratetype, &did_some_progress);
 			if (page)
 				goto got_pg;
 
-			/*
-			 * The OOM killer does not trigger for high-order
-			 * ~__GFP_NOFAIL allocations so if no progress is being
-			 * made, there are no other options and retrying is
-			 * unlikely to help.
-			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER &&
-						!(gfp_mask & __GFP_NOFAIL))
-				goto nopage;
-
-			goto restart;
+			if (!(gfp_mask & __GFP_NOFAIL)) {
+				/*
+				 * The OOM killer does not trigger for
+				 * high-order allocations so if no progress is
+				 * being made, there are no other options and
+				 * retrying is unlikely to help.
+				 */
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					goto nopage;
+
+				/*
+				 * If the oom killer could not free any memory,
+				 * there is no reason to endlessly loop.
+				 */
+				if (!did_some_progress)
+					goto nopage;
+			}
 		}
 	}
 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
@ 2009-05-10 23:59   ` KOSAKI Motohiro
  2009-05-11  0:24     ` David Rientjes
  2009-05-11 21:29   ` Andrew Morton
  1 sibling, 1 reply; 95+ messages in thread
From: KOSAKI Motohiro @ 2009-05-10 23:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

> The oom killer must be invoked regardless of the order if the allocation
> is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> free some memory.

This is intensional behavior. plus you change very widely caller bahavior.
if you don't have good test program, I nak this.


> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Signed-off-by: David Rientjes <rientjes@google.com>



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
@ 2009-05-11  0:17   ` KOSAKI Motohiro
  2009-05-11  0:26     ` David Rientjes
  2009-05-12  9:56   ` Mel Gorman
  1 sibling, 1 reply; 95+ messages in thread
From: KOSAKI Motohiro @ 2009-05-11  0:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel


> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares
> the mm.  If a task were to be killed while attached to an mm that could
> not be freed because another thread were set to OOM_DISABLE, it would
> have needlessly been terminated since there is no potential for future
> memory freeing.
> 
> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
> 
> This fixes a livelock if the oom killer chooses a task and another thread
> sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> because oom_kill_task() repeatedly returns 1 and refuses to kill the
> chosen task while select_bad_process() will repeatedly choose the same
> task during the next retry.
> 
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> in oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
> 
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.

David, staging driver shold be standalone.
Please don't change core kernel for staging.

if you merge it into mainline, I suggest to you help the driver graduate
staging.

Thanks.


> 
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: San Mehat <san@android.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt        |   15 +++++++++----
>  drivers/staging/android/lowmemorykiller.c |    2 +-
>  fs/proc/base.c                            |   19 ++++++++++++++--
>  include/linux/mm_types.h                  |    2 +
>  include/linux/sched.h                     |    1 -
>  mm/oom_kill.c                             |   32 ++++++++++++++++++----------
>  6 files changed, 49 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>  
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the liklihood of this process being
> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>  
>  The following heuristics are then applied:
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  			task_unlock(p);
>  			continue;
>  		}
> -		oom_adj = p->oomkilladj;
> +		oom_adj = p->mm->oom_adj;
>  		if (oom_adj < min_adj) {
>  			task_unlock(p);
>  			continue;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  
>  	if (!task)
>  		return -ESRCH;
> -	oom_adjust = task->oomkilladj;
> +	task_lock(task);
> +	if (task->mm)
> +		oom_adjust = task->mm->oom_adj;
> +	else
> +		oom_adjust = OOM_DISABLE;
> +	task_unlock(task);
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +	task_lock(task);
> +	if (!task->mm) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->oomkilladj = oom_adjust;
> +	task->mm->oom_adj = oom_adjust;
> +	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -232,6 +232,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1149,7 +1149,6 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> -	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> +	int oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> +	oom_adj = mm->oom_adj;
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 8;
>  
>  	/*
> -	 * Adjust the score by oomkilladj.
> +	 * Adjust the score by oom_adj.
>  	 */
> -	if (p->oomkilladj) {
> -		if (p->oomkilladj > 0) {
> +	if (oom_adj) {
> +		if (oom_adj > 0) {
>  			if (!points)
>  				points = 1;
> -			points <<= p->oomkilladj;
> +			points <<= oom_adj;
>  		} else
> -			points >>= -(p->oomkilladj);
> +			points >>= -(oom_adj);
>  	}
>  
>  #ifdef DEBUG
> @@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		if (p->oomkilladj == OOM_DISABLE)
> +		task_lock(p);
> +		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
>  			continue;
> +		task_unlock(p);
>  
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
> @@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -		       p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
>  	 * Don't kill the process if any threads are set to OOM_DISABLE
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +		task_lock(q);
> +		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +			task_unlock(q);
>  			return 1;
> +		}
> +		task_unlock(q);
>  	} while_each_thread(g, q);
>  
>  	__oom_kill_task(p, 1);
> @@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +			current->comm, gfp_mask, order,
> +			current->mm ? current->mm->oom_adj : OOM_DISABLE);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-10 23:59   ` KOSAKI Motohiro
@ 2009-05-11  0:24     ` David Rientjes
  2009-05-11  1:45       ` KOSAKI Motohiro
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11  0:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Zijlstra, Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, KOSAKI Motohiro wrote:

> > The oom killer must be invoked regardless of the order if the allocation
> > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > free some memory.
> 
> This is intensional behavior. plus you change very widely caller bahavior.
> if you don't have good test program, I nak this.
> 

What exactly are you objecting to?  You don't want the oom killer called 
for a __GFP_NOFAIL allocation above PAGE_ALLOC_COSTLY_ORDER that could not 
reclaim any memory and would prefer that it loop endlessly in the page 
allocator?  What's the purpose of that if the oom killer could free a very 
large memory hogging task?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-11  0:17   ` KOSAKI Motohiro
@ 2009-05-11  0:26     ` David Rientjes
  2009-05-11  1:47       ` KOSAKI Motohiro
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11  0:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, KOSAKI Motohiro wrote:

> 
> > The per-task oom_adj value is a characteristic of its mm more than the
> > task itself since it's not possible to oom kill any thread that shares
> > the mm.  If a task were to be killed while attached to an mm that could
> > not be freed because another thread were set to OOM_DISABLE, it would
> > have needlessly been terminated since there is no potential for future
> > memory freeing.
> > 
> > This patch moves oomkilladj (now more appropriately named oom_adj) from
> > struct task_struct to struct mm_struct.  This requires task_lock() on a
> > task to check its oom_adj value to protect against exec, but it's already
> > necessary to take the lock when dereferencing the mm to find the total VM
> > size for the badness heuristic.
> > 
> > This fixes a livelock if the oom killer chooses a task and another thread
> > sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> > because oom_kill_task() repeatedly returns 1 and refuses to kill the
> > chosen task while select_bad_process() will repeatedly choose the same
> > task during the next retry.
> > 
> > Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> > in oom_kill_task() to check for threads sharing the same memory will be
> > removed in the next patch in this series where it will no longer be
> > necessary.
> > 
> > Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> > these threads are immune from oom killing already.  They simply report an
> > oom_adj value of OOM_DISABLE.
> 
> David, staging driver shold be standalone.
> Please don't change core kernel for staging.
> 
> if you merge it into mainline, I suggest to you help the driver graduate
> staging.
> 

Umm, all my patches apply fine against mainline since the Android 
lowmemorykiller is already in Linus' tree.  This patch moves oomkilladj 
from struct task_struct to struct mm_struct and since that's used for 
procfs, the oom killer, and the Android lowmemorykiller, I need to modify 
all of them so the kernel will compile with 
CONFIG_ANDROID_LOW_MEMORY_KILLER.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  0:24     ` David Rientjes
@ 2009-05-11  1:45       ` KOSAKI Motohiro
  2009-05-11  7:40         ` Minchan Kim
  2009-05-11  8:45         ` David Rientjes
  0 siblings, 2 replies; 95+ messages in thread
From: KOSAKI Motohiro @ 2009-05-11  1:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

> On Mon, 11 May 2009, KOSAKI Motohiro wrote:
> 
> > > The oom killer must be invoked regardless of the order if the allocation
> > > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > > free some memory.
> > 
> > This is intensional behavior. plus you change very widely caller bahavior.
> > if you don't have good test program, I nak this.
> > 
> 
> What exactly are you objecting to?  You don't want the oom killer called 
> for a __GFP_NOFAIL allocation above PAGE_ALLOC_COSTLY_ORDER that could not 
> reclaim any memory and would prefer that it loop endlessly in the page 
> allocator?  What's the purpose of that if the oom killer could free a very 
> large memory hogging task?

My point is, if we change gfp-flags meaning, we should change
unintentional affected caller.

Do you oppose this?




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-11  0:26     ` David Rientjes
@ 2009-05-11  1:47       ` KOSAKI Motohiro
  2009-05-11  8:43         ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: KOSAKI Motohiro @ 2009-05-11  1:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

> > David, staging driver shold be standalone.
> > Please don't change core kernel for staging.
> > 
> > if you merge it into mainline, I suggest to you help the driver graduate
> > staging.
> > 
> 
> Umm, all my patches apply fine against mainline since the Android 
> lowmemorykiller is already in Linus' tree.  This patch moves oomkilladj 
> from struct task_struct to struct mm_struct and since that's used for 
> procfs, the oom killer, and the Android lowmemorykiller, I need to modify 
> all of them so the kernel will compile with 
> CONFIG_ANDROID_LOW_MEMORY_KILLER.

>From technical view, your explanation seems good although I haven't
review your patch.
but this patch break staging driver rule. I don't like it.

IOW, if greg pull this patch into his staging tree, it might make
patch conflict against -mm tree in nearly future.




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  1:45       ` KOSAKI Motohiro
@ 2009-05-11  7:40         ` Minchan Kim
  2009-05-11  8:49           ` David Rientjes
  2009-05-11  8:45         ` David Rientjes
  1 sibling, 1 reply; 95+ messages in thread
From: Minchan Kim @ 2009-05-11  7:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009 10:45:05 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Mon, 11 May 2009, KOSAKI Motohiro wrote:
> > 
> > > > The oom killer must be invoked regardless of the order if the allocation
> > > > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > > > free some memory.
> > > 
> > > This is intensional behavior. plus you change very widely caller bahavior.
> > > if you don't have good test program, I nak this.
> > > 
> > 
> > What exactly are you objecting to?  You don't want the oom killer called 
> > for a __GFP_NOFAIL allocation above PAGE_ALLOC_COSTLY_ORDER that could not 
> > reclaim any memory and would prefer that it loop endlessly in the page 
> > allocator?  What's the purpose of that if the oom killer could free a very 
> > large memory hogging task?
> 
> My point is, if we change gfp-flags meaning, we should change
> unintentional affected caller.
> 
> Do you oppose this?
>

I agree KOSAKI's opinion. 
We already have a different flags. 

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.
 *
 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.

When we use __GFP_NOFAIL, we always have to use it carefully.
If you change the meaning of __GFP_NOFAIL, the intension of them who have been used it carefully  may be lost. It's my concern.


-- 
Kinds Regards
Minchan Kim

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-11  1:47       ` KOSAKI Motohiro
@ 2009-05-11  8:43         ` David Rientjes
  2009-05-11 21:19           ` Andrew Morton
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11  8:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Zijlstra, Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, KOSAKI Motohiro wrote:

> From technical view, your explanation seems good although I haven't
> review your patch.
> but this patch break staging driver rule. I don't like it.
> 
> IOW, if greg pull this patch into his staging tree, it might make
> patch conflict against -mm tree in nearly future.
> 

If I am going to break code that is in mainline (like the Android low 
memory killer is) by moving a member of a data structure, then I'd like to 
provide a fix for it so that it works after my change.  I'll leave the 
merging to Andrew and Greg and if they'd prefer I do this patch in two 
parts, that's fine by me as long as they merge them at the same time.  I 
thought since this code is mainline that it would be easier to go through 
in one series.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  1:45       ` KOSAKI Motohiro
  2009-05-11  7:40         ` Minchan Kim
@ 2009-05-11  8:45         ` David Rientjes
  2009-05-11 16:03           ` Dave Hansen
  1 sibling, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11  8:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Zijlstra, Christoph Lameter, Dave Hansen, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, KOSAKI Motohiro wrote:

> > What exactly are you objecting to?  You don't want the oom killer called 
> > for a __GFP_NOFAIL allocation above PAGE_ALLOC_COSTLY_ORDER that could not 
> > reclaim any memory and would prefer that it loop endlessly in the page 
> > allocator?  What's the purpose of that if the oom killer could free a very 
> > large memory hogging task?
> 
> My point is, if we change gfp-flags meaning, we should change
> unintentional affected caller.
> 
> Do you oppose this?
> 

include/linux/gfp.h states this:

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.

That is the only desciption given to users of __GFP_NOFAIL, so they should 
be able to trust it.  The fact is that in mmotm it's possible for such an 
allocation to fail without even attempting to free some memory via the oom 
killer (and I disagree that killing a large memory hogging task will not 
allow large allocations such as those greater than PAGE_ALLOC_COSTLY_ORDER 
to succeed, which is a question of fragmentation and not purely VM size).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  7:40         ` Minchan Kim
@ 2009-05-11  8:49           ` David Rientjes
  2009-05-11 11:23             ` Minchan Kim
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11  8:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, Minchan Kim wrote:

> I agree KOSAKI's opinion. 
> We already have a different flags. 
> 
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
>  *
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.
> 
> When we use __GFP_NOFAIL, we always have to use it carefully.
> If you change the meaning of __GFP_NOFAIL, the intension of them who have been used it carefully  may be lost. It's my concern.
> 

You pointed out yourself that __GFP_NOFAIL allocations can fail by way of 
having alloc_pages() return NULL even without attempting to free memory by 
the oom killer for order > PAGE_ALLOC_COSTLY_ORDER.  The definition you 
posted above is unambiguous to me, it means we must retry infinitely.  And 
that's very stupid if we are going to neglect to free memory by killing a 
task and relying solely on reclaim which may not make any progress.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  8:49           ` David Rientjes
@ 2009-05-11 11:23             ` Minchan Kim
  0 siblings, 0 replies; 95+ messages in thread
From: Minchan Kim @ 2009-05-11 11:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, Dave Hansen,
	San Mehat, Arve Hjonnevag, linux-kernel

On Mon, May 11, 2009 at 5:49 PM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 11 May 2009, Minchan Kim wrote:
>
>> I agree KOSAKI's opinion.
>> We already have a different flags.
>>
>>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>>  * _might_ fail.  This depends upon the particular VM implementation.
>>  *
>>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>>  * cannot handle allocation failures.
>>
>> When we use __GFP_NOFAIL, we always have to use it carefully.
>> If you change the meaning of __GFP_NOFAIL, the intension of them who have been used it carefully  may be lost. It's my concern.
>>
>
> You pointed out yourself that __GFP_NOFAIL allocations can fail by way of
> having alloc_pages() return NULL even without attempting to free memory by
> the oom killer for order > PAGE_ALLOC_COSTLY_ORDER.  The definition you
> posted above is unambiguous to me, it means we must retry infinitely.  And
> that's very stupid if we are going to neglect to free memory by killing a
> task and relying solely on reclaim which may not make any progress.
>

Sorry for confusing. Ignore me, please.
I misunderstood your description as I said previous post. :)

-- 
Kinds regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11  8:45         ` David Rientjes
@ 2009-05-11 16:03           ` Dave Hansen
  2009-05-11 19:09             ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Dave Hansen @ 2009-05-11 16:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 2009-05-11 at 01:45 -0700, David Rientjes wrote:
> On Mon, 11 May 2009, KOSAKI Motohiro wrote:
> include/linux/gfp.h states this:
> 
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.
> 
> That is the only desciption given to users of __GFP_NOFAIL, so they should 
> be able to trust it.  The fact is that in mmotm it's possible for such an 
> allocation to fail without even attempting to free some memory via the oom 
> killer (and I disagree that killing a large memory hogging task will not 
> allow large allocations such as those greater than PAGE_ALLOC_COSTLY_ORDER 
> to succeed, which is a question of fragmentation and not purely VM size).

I assume that you've actually seen this behavior where OOM-killing a
task will free enough memory to allow a higher-order allocation to
succeed.

Could you explain a little more about why you think this scenario works
for you?  Are large contiguous areas of memory pinned by the task
getting which you want to get killed?  Why wasn't swapping effective
against this task?  Was the task itself taking up a large portion of
total memory?

-- Dave


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 16:03           ` Dave Hansen
@ 2009-05-11 19:09             ` David Rientjes
  2009-05-11 19:45               ` Dave Hansen
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11 19:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, Dave Hansen wrote:

> Could you explain a little more about why you think this scenario works
> for you?  Are large contiguous areas of memory pinned by the task
> getting which you want to get killed?  Why wasn't swapping effective
> against this task?  Was the task itself taking up a large portion of
> total memory?
> 

We frequently do cpuset-constrained oom kills where the lionshare of 
memory on a set of nodes is allocated by a single task or a group of 
threads all sharing the same memory.  Swapping is largely effective but at 
this point in the code path it's obviously not making any progress in 
freeing pages.  So this change fixes two issues:

 - __GFP_NOFAIL allocations should not be allowed to return NULL, and

 - we should prevent looping endlessly in the page allocator if reclaim
   cannot free the requisite amount of memory.

There is no reason that the oom killer would not be able to kill a task 
that could free 64K of contiguous memory, especially for those that 
mlock() their memory.  You could argue that any __GFP_NOFAIL allocation 
above order 3 is insane and should not kill tasks, but that's an issue 
higher up the stack.  If you'd like to identify such instances, we could 
emit a warning message here and a stack trace.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 19:09             ` David Rientjes
@ 2009-05-11 19:45               ` Dave Hansen
  2009-05-11 20:21                 ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Dave Hansen @ 2009-05-11 19:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 2009-05-11 at 12:09 -0700, David Rientjes wrote:
> On Mon, 11 May 2009, Dave Hansen wrote:
> > Could you explain a little more about why you think this scenario works
> > for you?  Are large contiguous areas of memory pinned by the task
> > getting which you want to get killed?  Why wasn't swapping effective
> > against this task?  Was the task itself taking up a large portion of
> > total memory?
> 
> We frequently do cpuset-constrained oom kills where the lionshare of 
> memory on a set of nodes is allocated by a single task or a group of 
> threads all sharing the same memory.  Swapping is largely effective but at 
> this point in the code path it's obviously not making any progress in 
> freeing pages.

So, there are three conditions that have to be met before we get to this
behavior:
1. order > PAGE_ALLOC_COSTLY_ORDER
2. __GFP_NOFAIL set
3. no progress being made

The two hunks of this patch really do different things to me.
Personally, I'd split them up.  

This patch's first hunk effectively says, "When we have __GFP_NOFAIL
set, it is now OK to OOM tasks for 'order > PAGE_ALLOC_COSTLY_ORDER'".
That's a little bit goofy to me.  Either OOMs help large-order allocs or
they don't.  __GFP_NOFAIL doesn't change how effective an OOM is.

> So this change fixes two issues:
> 
>  - __GFP_NOFAIL allocations should not be allowed to return NULL, and
> 
>  - we should prevent looping endlessly in the page allocator if reclaim
>    cannot free the requisite amount of memory.
> 
> There is no reason that the oom killer would not be able to kill a task 
> that could free 64K of contiguous memory, especially for those that 
> mlock() their memory.  You could argue that any __GFP_NOFAIL allocation 
> above order 3 is insane and should not kill tasks, but that's an issue 
> higher up the stack.  If you'd like to identify such instances, we could 
> emit a warning message here and a stack trace.

Yeah, adding a new warning or enhancing the existing "page allocation
failure" warning would be nice.

Aren't you a bit worried that people will keep adding new 'goto nopage'
cases in here?  Would it be more effective to just put a:

        if (gfp_mask & __GFP_NOFAIL)
                goto restart;

at the very end of the function?  That should keep people from screwing
this up in the future a bit better.

Thanks for going to the trouble of trying to sort all this code out a
bit better.  I know it is a mess.  Could you try and beef up the
descriptions a bit in the next pass?  Some of the stuff about where
you're encountering these situations would be really helpful, especially
with 'git blame' these days.  

-- Dave

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 19:45               ` Dave Hansen
@ 2009-05-11 20:21                 ` David Rientjes
  0 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-11 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KOSAKI Motohiro, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Zijlstra, Christoph Lameter, San Mehat,
	Arve Hjonnevag, linux-kernel

On Mon, 11 May 2009, Dave Hansen wrote:

> So, there are three conditions that have to be met before we get to this
> behavior:
> 1. order > PAGE_ALLOC_COSTLY_ORDER
> 2. __GFP_NOFAIL set
> 3. no progress being made
> 
> The two hunks of this patch really do different things to me.
> Personally, I'd split them up.  
> 
> This patch's first hunk effectively says, "When we have __GFP_NOFAIL
> set, it is now OK to OOM tasks for 'order > PAGE_ALLOC_COSTLY_ORDER'".
> That's a little bit goofy to me.  Either OOMs help large-order allocs or
> they don't.  __GFP_NOFAIL doesn't change how effective an OOM is.
> 

Right, but the problem is that we don't know what the oom killer did 
(until you get to patches 10 and 11 in this series).  It would be easier 
to use the __GFP_REPEAT retry logic to determine whether to continue 
looping, but we currently bypass that because reclaim failed and we don't 
know what the oom killer did.  My later patches change that, so 
__GFP_REPEAT actually works in this case.

> Yeah, adding a new warning or enhancing the existing "page allocation
> failure" warning would be nice.
> 

Ok, it's a good idea that I never considered.

> Aren't you a bit worried that people will keep adding new 'goto nopage'
> cases in here?  Would it be more effective to just put a:
> 
>         if (gfp_mask & __GFP_NOFAIL)
>                 goto restart;
> 
> at the very end of the function?  That should keep people from screwing
> this up in the future a bit better.
> 

goto nopage is the only thing we have to worry about since the retry logic 
does get invoked otherwise (and any order less than 
PAGE_ALLOC_COSTLY_ORDER is actually implicitly __GFP_NOFAIL there, 
regardless of reclaim progress).  We do actually need to bypass that logic 
for a couple of cases (GFP_THISNODE and GFP_ATOMIC) where reclaim is 
inherently denied.  We lack BUG_ON()'s for such cases undoubtedly because 
of the performance impact in the very hot path.

> Thanks for going to the trouble of trying to sort all this code out a
> bit better.  I know it is a mess.  Could you try and beef up the
> descriptions a bit in the next pass?  Some of the stuff about where
> you're encountering these situations would be really helpful, especially
> with 'git blame' these days.  
> 

Ok, thanks for the comments.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
@ 2009-05-11 21:11   ` Andrew Morton
  2009-05-11 21:28     ` David Rientjes
  2009-05-12  9:38   ` Mel Gorman
  1 sibling, 1 reply; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Sun, 10 May 2009 15:07:16 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
> pointer for tasks that have detached mm's since task_lock() is not held
> during the tasklist scan.

ok.  But a better changelog would have told us how the patch fixes the
problem?


This patch series is partially core MM and partially drivers/staging. 
I don't normally handle staging things, so I'd need to be cherrypicking
from this lot.  Is there any interdependency between the two things?



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-11  8:43         ` David Rientjes
@ 2009-05-11 21:19           ` Andrew Morton
  0 siblings, 0 replies; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, gregkh, npiggin, mel, a.p.zijlstra, cl, dave,
	san, arve, linux-kernel

On Mon, 11 May 2009 01:43:10 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 11 May 2009, KOSAKI Motohiro wrote:
> 
> > From technical view, your explanation seems good although I haven't
> > review your patch.
> > but this patch break staging driver rule. I don't like it.
> > 
> > IOW, if greg pull this patch into his staging tree, it might make
> > patch conflict against -mm tree in nearly future.
> > 
> 
> If I am going to break code that is in mainline (like the Android low 
> memory killer is) by moving a member of a data structure, then I'd like to 
> provide a fix for it so that it works after my change.  I'll leave the 
> merging to Andrew and Greg and if they'd prefer I do this patch in two 
> parts, that's fine by me as long as they merge them at the same time.  I 
> thought since this code is mainline that it would be easier to go through 
> in one series.

Yep, this all has to be done as a single commit to avoid creating
points in the commit sequence where the kernel doesn't build.

As I'm cherrypicking my way around the android driver changes I
reworked this patch to apply on plain old mainline.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock
  2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
@ 2009-05-11 21:22   ` Andrew Morton
  0 siblings, 0 replies; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Sun, 10 May 2009 15:07:19 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
>

The changelog perplexes me.

> This moves the check for OOM_DISABLE to the badness heuristic while
> holding task_lock().

why?

>  Badness scores of 0 are now explicitly prohibited
> from being oom killed

why?

> and since the oom_adj value is a characteristic of
> an mm and not a task, it is no longer necessary to check the oom_adj
> value for threads sharing the same memory (except when simply issuing
> SIGKILLs for threads in other thread groups).

ok...


The title says that this patch fixes a livelock.  But the changelog
didn't describe that livelock and didn't tell us how the patch fixed
it.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-11 21:11   ` Andrew Morton
@ 2009-05-11 21:28     ` David Rientjes
  2009-05-11 21:41       ` Greg KH
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11 21:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> On Sun, 10 May 2009 15:07:16 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
> > pointer for tasks that have detached mm's since task_lock() is not held
> > during the tasklist scan.
> 
> ok.  But a better changelog would have told us how the patch fixes the
> problem?
> 

Sure, I suppose "Add the task_lock()." could follow it.

> This patch series is partially core MM and partially drivers/staging. 
> I don't normally handle staging things, so I'd need to be cherrypicking
> from this lot.  Is there any interdependency between the two things?
> 

This was a source of confusion from me when I posted a smaller set earlier 
that made the same changes to the staging driver, and the only reason I 
sent the change here was because drivers/staging/android/lowmemorykiller.c 
is in Linus' tree.

This patch series moves oomkilladj from struct task_struct to struct 
mm_struct where it more appropriately belongs.  The Android 
lowmemorykiller uses oomkilladj, so it will fail to compile in Linus' git 
if you pushed the patchset to him and the Android driver were enabled.

To prevent that breakage, they should probably either all go through one 
series so nothing ever breaks or Greg takes the staging patches and then 
pushes them when you push the other changes.  I thought the former would 
be easier for the maintainers (apparently not :).

I'd be fine with seperating the staging changes out, sending them to Greg, 
and then just assuming that it'll all work out in the end.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
  2009-05-10 23:59   ` KOSAKI Motohiro
@ 2009-05-11 21:29   ` Andrew Morton
  2009-05-11 21:45     ` David Rientjes
  1 sibling, 1 reply; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Sun, 10 May 2009 15:07:21 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom killer must be invoked regardless of the order if the allocation
> is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> free some memory.

We should discourage callers from using __GFP_NOFAIL at all.  We should
electrocute callers for using __GFP_NOFAIL on large allocations.  How's about

	WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER &&	
			(gfp_mask & __GFP_NOFAIL));
or, preferably:

	WARN_ON_ONCE(order > 0 && (gfp_mask & __GFP_NOFAIL));

?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 09/11 -mmotm] oom: return vm size of oom killed task
  2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
@ 2009-05-11 21:36   ` Andrew Morton
  0 siblings, 0 replies; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Sun, 10 May 2009 15:07:23 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> It's not optimal to continuously loop in the page allocator if the oom
> killer fails to kill a task.

Ambiguous.  This means either "selected a task but failed to kill it"
or "failed to select any task to kill".

>  Thus, it's necessary to report how many
> pages may be freed when the task finally exits to determine if any
> progress has been made.

>From this I infer that you're referring to the first case.

Under which circumstances can this happen?  Should we just go on and
select another process?

> This also changes the TIF_MEMDIE exception in select_bad_process().  If
> a task is found with this thread flag set, yet it has already detached
> its memory, then an additional task is chosen since we are still out of
> memory.

hm, that was fairly dumb of us.

But what state is this task in?  Has it actually returned all its
memory to the page allocator, or is its memory in some limbo state?  If
so, what is that state and why did it happen?

> total_vm is used instead of the file and anon rss since this is what the
> badness scoring heuristic is based on and it may be possible to oom kill
> a task with no rss causing the page allocator to believe no progress has
> been made.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks
  2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
@ 2009-05-11 21:39   ` Andrew Morton
  2009-05-11 23:08     ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 21:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.ziljstra, cl, dave, san, arve, linux-kernel

On Sun, 10 May 2009 15:07:25 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> If there are no interruptible system tasks other than kthreads,

It's unclear what the term "system task" means.  Just "task"?

> no task
> should be chosen for oom kill since they won't respond to the SIGKILL
> anyway.  Instead, we choose to simply fail page allocations if reclaim
> cannot free memory and hope for the best.

But plain old user processes enter and leave D state all the time.  The
task may well just be sitting there waiting for a disk read to
complete.  It will respond to the SIGKILL shortly after the IO
completes, so an appropriate action here is to just wait for that to
happen.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-11 21:28     ` David Rientjes
@ 2009-05-11 21:41       ` Greg KH
  2009-05-11 22:05         ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Greg KH @ 2009-05-11 21:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, npiggin, mel, a.p.ziljstra, cl, dave, san, arve,
	linux-kernel

On Mon, May 11, 2009 at 02:28:05PM -0700, David Rientjes wrote:
> On Mon, 11 May 2009, Andrew Morton wrote:
> 
> > On Sun, 10 May 2009 15:07:16 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
> > > pointer for tasks that have detached mm's since task_lock() is not held
> > > during the tasklist scan.
> > 
> > ok.  But a better changelog would have told us how the patch fixes the
> > problem?
> > 
> 
> Sure, I suppose "Add the task_lock()." could follow it.
> 
> > This patch series is partially core MM and partially drivers/staging. 
> > I don't normally handle staging things, so I'd need to be cherrypicking
> > from this lot.  Is there any interdependency between the two things?
> > 
> 
> This was a source of confusion from me when I posted a smaller set earlier 
> that made the same changes to the staging driver, and the only reason I 
> sent the change here was because drivers/staging/android/lowmemorykiller.c 
> is in Linus' tree.

I have that series in my "to-apply" queue to get through this week.  I
will handle them.

> This patch series moves oomkilladj from struct task_struct to struct 
> mm_struct where it more appropriately belongs.  The Android 
> lowmemorykiller uses oomkilladj, so it will fail to compile in Linus' git 
> if you pushed the patchset to him and the Android driver were enabled.
> 
> To prevent that breakage, they should probably either all go through one 
> series so nothing ever breaks or Greg takes the staging patches and then 
> pushes them when you push the other changes.  I thought the former would 
> be easier for the maintainers (apparently not :).

I think the original lowmemorykiller.c patches have no problem getting
into the tree.  But, it seems that people are still disagreeing with the
core oom changes you have made, so those will probably take a few more
review cycles in order to work themselves out.

I would _really_ like to see the end goal to get this lowmemorykiller
code out of the staging tree, and I think that is what your goal here is
as well.  If, at the end of your series that happens, I think that would
be very good.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 21:29   ` Andrew Morton
@ 2009-05-11 21:45     ` David Rientjes
  2009-05-11 22:11       ` Andrew Morton
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, npiggin, mel, a.p.zijlstra,
	Christoph Lameter, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> > The oom killer must be invoked regardless of the order if the allocation
> > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > free some memory.
> 
> We should discourage callers from using __GFP_NOFAIL at all.  We should
> electrocute callers for using __GFP_NOFAIL on large allocations.  How's about
> 
> 	WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER &&	
> 			(gfp_mask & __GFP_NOFAIL));
> or, preferably:
> 
> 	WARN_ON_ONCE(order > 0 && (gfp_mask & __GFP_NOFAIL));
> 

Not sure it would help since the oom killer will be now be called for such 
an allocation and that dumps the stack (and will actually show the order 
and gfp flags as well).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-11 21:41       ` Greg KH
@ 2009-05-11 22:05         ` David Rientjes
  0 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-11 22:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, npiggin, mel, a.p.zijlstra, Christoph Lameter,
	dave, san, arve, linux-kernel

On Mon, 11 May 2009, Greg KH wrote:

> > This was a source of confusion from me when I posted a smaller set earlier 
> > that made the same changes to the staging driver, and the only reason I 
> > sent the change here was because drivers/staging/android/lowmemorykiller.c 
> > is in Linus' tree.
> 
> I have that series in my "to-apply" queue to get through this week.  I
> will handle them.
> 

Great!

> I think the original lowmemorykiller.c patches have no problem getting
> into the tree.  But, it seems that people are still disagreeing with the
> core oom changes you have made, so those will probably take a few more
> review cycles in order to work themselves out.
> 

Andrew merged a fixed up version of 
oom-move-oom_adj-value-from-task_struct-to-mm_struct.patch to -mm so that 
it's not dependent on the earlier lowmemorykiller changes.  I assume the 
merge conflicts will be handled by whoever pushes their changes last.

> I would _really_ like to see the end goal to get this lowmemorykiller
> code out of the staging tree, and I think that is what your goal here is
> as well.

Not really, my only real change to it was making sure it would compile 
when I moved ->oomkilladj from task_struct to mm_struct.  In the course of 
that, I found a possible NULL pointer and coding style problems so I 
thought I'd fix them up as well.  Then Arve posted a couple of patches 
he'd been sitting on, so I added them early on to my series so they'd all 
apply cleanly.

> If, at the end of your series that happens, I think that would
> be very good.
> 

Arve, San, it seems like this driver is still being actively developed 
judging by the patches you sent me.  Any ideas?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 21:45     ` David Rientjes
@ 2009-05-11 22:11       ` Andrew Morton
  2009-05-11 22:31         ` David Rientjes
                           ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 22:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009 14:45:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 11 May 2009, Andrew Morton wrote:
> 
> > > The oom killer must be invoked regardless of the order if the allocation
> > > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > > free some memory.
> > 
> > We should discourage callers from using __GFP_NOFAIL at all.  We should
> > electrocute callers for using __GFP_NOFAIL on large allocations.  How's about
> > 
> > 	WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER &&	
> > 			(gfp_mask & __GFP_NOFAIL));
> > or, preferably:
> > 
> > 	WARN_ON_ONCE(order > 0 && (gfp_mask & __GFP_NOFAIL));
> > 
> 
> Not sure it would help since the oom killer will be now be called for such 
> an allocation and that dumps the stack (and will actually show the order 
> and gfp flags as well).

No, the intent of that warning is to find all call sites which use
__GFP_NOFAIL on order>0 so we can hunt down and eliminate them.


please review...

From: Andrew Morton <akpm@linux-foundation.org>

__GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
detect and suitably handle this (and not by lamely moving the infinite
loop up to the caller level either).

Attempting to use __GFP_NOFAIL for a higher-order allocation is even
worse, so add a once-off runtime check for this to slap people around for
even thinking about trying it.

Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff -puN mm/page_alloc.c~a mm/page_alloc.c
--- a/mm/page_alloc.c~a
+++ a/mm/page_alloc.c
@@ -1201,8 +1201,19 @@ static int should_fail_alloc_page(gfp_t 
 {
 	if (order < fail_page_alloc.min_order)
 		return 0;
-	if (gfp_mask & __GFP_NOFAIL)
+	if (gfp_mask & __GFP_NOFAIL) {
+		/*
+		 * __GFP_NOFAIL is not to be used in new code.
+		 *
+		 * All __GFP_NOFAIL callers should be fixed so that they
+		 * properly detect and handle allocation failures.
+		 *
+		 * We most definitely don't want callers attempting to allocate
+		 * greater than single-page units with __GFP_NOFAIL.
+		 */
+		WARN_ON_ONCE(order > 0);
 		return 0;
+	}
 	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
 		return 0;
 	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
_



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 22:11       ` Andrew Morton
@ 2009-05-11 22:31         ` David Rientjes
  2009-05-11 22:46           ` Andrew Morton
  2009-05-12  5:39         ` Peter Zijlstra
  2009-05-12 10:05         ` Mel Gorman
  2 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11 22:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
> detect and suitably handle this (and not by lamely moving the infinite
> loop up to the caller level either).
> 
> Attempting to use __GFP_NOFAIL for a higher-order allocation is even
> worse, so add a once-off runtime check for this to slap people around for
> even thinking about trying it.
> 
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

This only emits a warning when you have CONFIG_DEBUG_KERNEL, 
CONFIG_FAULT_INJECTION, and CONFIG_FAIL_PAGE_ALLOC, which is all related 
to fault injection (since we never want to inject a fault into anything 
using __GFP_NOFAIL).  So it may be helpful in tracking down such callers, 
but is unrelated to the config options that enable it and it may not get 
the best coverage.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 22:31         ` David Rientjes
@ 2009-05-11 22:46           ` Andrew Morton
  2009-05-11 23:00             ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 22:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009 15:31:04 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 11 May 2009, Andrew Morton wrote:
> 
> > __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
> > detect and suitably handle this (and not by lamely moving the infinite
> > loop up to the caller level either).
> > 
> > Attempting to use __GFP_NOFAIL for a higher-order allocation is even
> > worse, so add a once-off runtime check for this to slap people around for
> > even thinking about trying it.
> > 
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> This only emits a warning when you have CONFIG_DEBUG_KERNEL, 
> CONFIG_FAULT_INJECTION, and CONFIG_FAIL_PAGE_ALLOC, which is all related 
> to fault injection (since we never want to inject a fault into anything 
> using __GFP_NOFAIL).  So it may be helpful in tracking down such callers, 
> but is unrelated to the config options that enable it and it may not get 
> the best coverage.

oh, well that was pretty useless then.  I was trying to find a handy
spot where we can avoid adding fastpath cycles.

How about we sneak it into the order>0 leg inside buffered_rmqueue()?


--- a/mm/page_alloc.c~page-allocator-warn-if-__gfp_nofail-is-used-for-a-large-allocation
+++ a/mm/page_alloc.c
@@ -1130,6 +1130,20 @@ again:
 		list_del(&page->lru);
 		pcp->count--;
 	} else {
+		if (unlikely(gfp_mask & __GFP_NOFAIL)) {
+			/*
+			 * __GFP_NOFAIL is not to be used in new code.
+			 *
+			 * All __GFP_NOFAIL callers should be fixed so that they
+			 * properly detect and handle allocation failures.
+			 *
+			 * We most definitely don't want callers attempting to
+			 * allocate greater than single-page units with
+			 * __GFP_NOFAIL.
+			 */
+			WARN_ON_ONCE(order > 0);
+			return 0;
+		}
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
 		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
_


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 22:46           ` Andrew Morton
@ 2009-05-11 23:00             ` David Rientjes
  2009-05-11 23:14               ` Andrew Morton
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-11 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> oh, well that was pretty useless then.  I was trying to find a handy
> spot where we can avoid adding fastpath cycles.
> 
> How about we sneak it into the order>0 leg inside buffered_rmqueue()?
> 

Wouldn't it be easier after my patch is merged to just check the oom 
killer stack traces for such allocations and people complain about 
unnecessary oom killing when memory is available but too fragmented?  The 
gfp_flags and order are shown in the oom killer header.

> 
> --- a/mm/page_alloc.c~page-allocator-warn-if-__gfp_nofail-is-used-for-a-large-allocation
> +++ a/mm/page_alloc.c
> @@ -1130,6 +1130,20 @@ again:
>  		list_del(&page->lru);
>  		pcp->count--;
>  	} else {
> +		if (unlikely(gfp_mask & __GFP_NOFAIL)) {
> +			/*
> +			 * __GFP_NOFAIL is not to be used in new code.
> +			 *
> +			 * All __GFP_NOFAIL callers should be fixed so that they
> +			 * properly detect and handle allocation failures.
> +			 *
> +			 * We most definitely don't want callers attempting to
> +			 * allocate greater than single-page units with
> +			 * __GFP_NOFAIL.
> +			 */
> +			WARN_ON_ONCE(order > 0);
> +			return 0;
> +		}
>  		spin_lock_irqsave(&zone->lock, flags);
>  		page = __rmqueue(zone, order, migratetype);
>  		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));

That "return 0" definitely needs to be removed, though :)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks
  2009-05-11 21:39   ` Andrew Morton
@ 2009-05-11 23:08     ` David Rientjes
  0 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-11 23:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Kroah-Hartman, npiggin, mel, a.p.zijlstra,
	Christoph Lameter, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> > If there are no interruptible system tasks other than kthreads,
> 
> It's unclear what the term "system task" means.  Just "task"?
> 

Yes.

> > no task
> > should be chosen for oom kill since they won't respond to the SIGKILL
> > anyway.  Instead, we choose to simply fail page allocations if reclaim
> > cannot free memory and hope for the best.
> 
> But plain old user processes enter and leave D state all the time.  The
> task may well just be sitting there waiting for a disk read to
> complete.  It will respond to the SIGKILL shortly after the IO
> completes, so an appropriate action here is to just wait for that to
> happen.
> 

Since current is in the oom killer, it's runnable.  So the only way this 
can fail an allocation is if a kthread is looking for memory and reclaim 
has failed and no user tasks are running.  It's not guaranteed that any 
memory will ever become available (for instance if all userspace tasks are 
waiting on mm->mmap_sem for an oom killed task that will never exit even 
after having been given TIF_MEMDIE), so the best course of action is to 
fail, if allowed.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 23:00             ` David Rientjes
@ 2009-05-11 23:14               ` Andrew Morton
  2009-05-11 23:37                 ` David Rientjes
  0 siblings, 1 reply; 95+ messages in thread
From: Andrew Morton @ 2009-05-11 23:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009 16:00:58 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 11 May 2009, Andrew Morton wrote:
> 
> > oh, well that was pretty useless then.  I was trying to find a handy
> > spot where we can avoid adding fastpath cycles.
> > 
> > How about we sneak it into the order>0 leg inside buffered_rmqueue()?
> > 
> 
> Wouldn't it be easier after my patch is merged to just check the oom 
> killer stack traces for such allocations and people complain about 
> unnecessary oom killing when memory is available but too fragmented?  The 
> gfp_flags and order are shown in the oom killer header.

That assumes that the oom-killer is triggered - in the typical
kernel developer testing, that won't happen.

I think what we should do here is to prevent people even attempting to
use __GFP_NOFAIL with higher-order allocations.

Are you aware of any callsite which is presently using __GFP_NOFAIL on
order>0 allocations?

I expect slub might cause this to happen due to its habit of using
larger-than-needed orders for small objects.  For example, cxgb3 is
passing __GFP_NOFAIL into alloc_skb().

> > 
> > --- a/mm/page_alloc.c~page-allocator-warn-if-__gfp_nofail-is-used-for-a-large-allocation
> > +++ a/mm/page_alloc.c
> > @@ -1130,6 +1130,20 @@ again:
> >  		list_del(&page->lru);
> >  		pcp->count--;
> >  	} else {
> > +		if (unlikely(gfp_mask & __GFP_NOFAIL)) {
> > +			/*
> > +			 * __GFP_NOFAIL is not to be used in new code.
> > +			 *
> > +			 * All __GFP_NOFAIL callers should be fixed so that they
> > +			 * properly detect and handle allocation failures.
> > +			 *
> > +			 * We most definitely don't want callers attempting to
> > +			 * allocate greater than single-page units with
> > +			 * __GFP_NOFAIL.
> > +			 */
> > +			WARN_ON_ONCE(order > 0);
> > +			return 0;
> > +		}
> >  		spin_lock_irqsave(&zone->lock, flags);
> >  		page = __rmqueue(zone, order, migratetype);
> >  		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
> 
> That "return 0" definitely needs to be removed, though :)

The inventor of copy-n-paste has a lot to answer for.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 23:14               ` Andrew Morton
@ 2009-05-11 23:37                 ` David Rientjes
  0 siblings, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-11 23:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: gregkh, npiggin, mel, a.p.zijlstra, cl, dave, san, arve, linux-kernel

On Mon, 11 May 2009, Andrew Morton wrote:

> That assumes that the oom-killer is triggered - in the typical
> kernel developer testing, that won't happen.
> 
> I think what we should do here is to prevent people even attempting to
> use __GFP_NOFAIL with higher-order allocations.
> 

You could just add WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL) in the check for 
order != 0 in __alloc_pages_internal() to keep it out of the fastpath and 
make it appear when we have to reclaim.  That still probably has less 
coverage than you want, but these allocations have a high probability of 
triggering reclaim anyway so it should be pretty obvious.

> Are you aware of any callsite which is presently using __GFP_NOFAIL on
> order>0 allocations?
> 

Nope.

> I expect slub might cause this to happen due to its habit of using
> larger-than-needed orders for small objects.  For example, cxgb3 is
> passing __GFP_NOFAIL into alloc_skb().
> 

slub_max_order is set to PAGE_ALLOC_COSTLY_ORDER by default, so this would 
have to be overridden by the user on the command line unless a single 
object can't fit into such a slab.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 22:11       ` Andrew Morton
  2009-05-11 22:31         ` David Rientjes
@ 2009-05-12  5:39         ` Peter Zijlstra
  2009-05-12 11:36           ` KOSAKI Motohiro
  2009-05-12 10:05         ` Mel Gorman
  2 siblings, 1 reply; 95+ messages in thread
From: Peter Zijlstra @ 2009-05-12  5:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, gregkh, npiggin, mel, cl, dave, san, arve, linux-kernel

On Mon, 2009-05-11 at 15:11 -0700, Andrew Morton wrote:
> On Mon, 11 May 2009 14:45:18 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Mon, 11 May 2009, Andrew Morton wrote:
> > 
> > > > The oom killer must be invoked regardless of the order if the allocation
> > > > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > > > free some memory.
> > > 
> > > We should discourage callers from using __GFP_NOFAIL at all.  We should
> > > electrocute callers for using __GFP_NOFAIL on large allocations.  How's about
> > > 
> > > 	WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER &&	
> > > 			(gfp_mask & __GFP_NOFAIL));
> > > or, preferably:
> > > 
> > > 	WARN_ON_ONCE(order > 0 && (gfp_mask & __GFP_NOFAIL));
> > > 
> > 
> > Not sure it would help since the oom killer will be now be called for such 
> > an allocation and that dumps the stack (and will actually show the order 
> > and gfp flags as well).
> 
> No, the intent of that warning is to find all call sites which use
> __GFP_NOFAIL on order>0 so we can hunt down and eliminate them.
> 
> 
> please review...

Fully agreed, people should use banker's algorithm to guarantee
progress, not create deadlocks with inf loops.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> From: Andrew Morton <akpm@linux-foundation.org>
> 
> __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
> detect and suitably handle this (and not by lamely moving the infinite
> loop up to the caller level either).
> 
> Attempting to use __GFP_NOFAIL for a higher-order allocation is even
> worse, so add a once-off runtime check for this to slap people around for
> even thinking about trying it.
> 
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/page_alloc.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff -puN mm/page_alloc.c~a mm/page_alloc.c
> --- a/mm/page_alloc.c~a
> +++ a/mm/page_alloc.c
> @@ -1201,8 +1201,19 @@ static int should_fail_alloc_page(gfp_t 
>  {
>  	if (order < fail_page_alloc.min_order)
>  		return 0;
> -	if (gfp_mask & __GFP_NOFAIL)
> +	if (gfp_mask & __GFP_NOFAIL) {
> +		/*
> +		 * __GFP_NOFAIL is not to be used in new code.
> +		 *
> +		 * All __GFP_NOFAIL callers should be fixed so that they
> +		 * properly detect and handle allocation failures.
> +		 *
> +		 * We most definitely don't want callers attempting to allocate
> +		 * greater than single-page units with __GFP_NOFAIL.
> +		 */
> +		WARN_ON_ONCE(order > 0);
>  		return 0;
> +	}
>  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>  		return 0;
>  	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> _
> 
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed.
  2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (9 preceding siblings ...)
  2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
@ 2009-05-12  9:09 ` Mel Gorman
  2009-05-13  0:43   ` Arve Hjønnevåg
  10 siblings, 1 reply; 95+ messages in thread
From: Mel Gorman @ 2009-05-12  9:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat, Arve Hj?nnev?g,
	linux-kernel

On Sun, May 10, 2009 at 03:07:05PM -0700, David Rientjes wrote:
> From: Arve Hjønnevåg <arve@android.com>
> 
> Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
> instead the sum of rss. Neither method is accurate.
> 

If neither estimate is accurate (and I agree), then explain why this is
better and why better decisions are made as a result.

I would assume it's because pages can be double accounted when using RSS
due to shared pages and you might over-estimate the amount of memory
really in use. If that reasoning is correct, say that.

> Also skip the process scan, if the amount of memory available is above
> the largest threshold set.
> 

Patches should do one thing, please split if possible. This patch aspect
seems to be the if check early on so the splitting should not be
difficult.

In that statement, you check nr_to_scan <= 0, how can you scan a
negative number of things? Say in the description that the process scan
is set if min_adj is such a high value that it cannot be used to
discriminate between processes for suitability to kill.

> Signed-off-by: Arve Hjønnevåg <arve@android.com>
> ---
>  drivers/staging/android/lowmemorykiller.c |   35 +++++++++++++++++-----------
>  1 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -71,23 +71,30 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  	}
>  	if(nr_to_scan > 0)
>  		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
> +	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
> +	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> +		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
> +		return rem;
> +	}
> +
>  	read_lock(&tasklist_lock);
>  	for_each_process(p) {
> -		if(p->oomkilladj >= 0 && p->mm) {
> -			tasksize = get_mm_rss(p->mm);
> -			if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
> -				if(selected == NULL ||
> -				   p->oomkilladj > selected->oomkilladj ||
> -				   (p->oomkilladj == selected->oomkilladj &&
> -				    tasksize > selected_tasksize)) {
> -					selected = p;
> -					selected_tasksize = tasksize;
> -					lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> -					             p->pid, p->comm, p->oomkilladj, tasksize);
> -				}
> -			}
> -			rem += tasksize;
> +		if (p->oomkilladj < min_adj || !p->mm)
> +			continue;
> +		tasksize = get_mm_rss(p->mm);
> +		if (tasksize <= 0)
> +			continue;

It's also odd to consider the possibility of a negative tasksize.

That aside, RSS is the sum of two longs (or atomic_long_t depending) and
tasksize here is a signed integer. Are you not at the risk of
overflowing and this check being unable to do anything useful if all the
processes of interest are using large amounts of memory?

> +		if (selected) {
> +			if (p->oomkilladj < selected->oomkilladj)
> +				continue;
> +			if (p->oomkilladj == selected->oomkilladj &&
> +			    tasksize <= selected_tasksize)
> +				continue;
>  		}
> +		selected = p;
> +		selected_tasksize = tasksize;
> +		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> +		             p->pid, p->comm, p->oomkilladj, tasksize);
>  	}
>  	if(selected != NULL) {
>  		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",

I'm hoping it has been discussed elsewhere why the normal OOM killer is not
being used instead of this driver thing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself
  2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
@ 2009-05-12  9:23   ` Mel Gorman
  2009-05-13  0:27     ` Arve Hjønnevåg
  0 siblings, 1 reply; 95+ messages in thread
From: Mel Gorman @ 2009-05-12  9:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat, Arve Hj?nnev?g,
	linux-kernel

On Sun, May 10, 2009 at 03:07:10PM -0700, David Rientjes wrote:
> From: Arve Hjønnevåg <arve@android.com>
> 
> This allows processes to be killed when the kernel evict cache pages in
> an attempt to get more contiguous free memory.
> 

I'm not seeing what this patch has to do with contiguous free memory. I
see what the patch is doing - lowering the threshold allowing a
particular min_adj value to be used, but not why.


> Signed-off-by: Arve Hjønnevåg <arve@android.com>
> ---
>  drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
>  1 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  	int min_adj = OOM_ADJUST_MAX + 1;
>  	int selected_tasksize = 0;
>  	int array_size = ARRAY_SIZE(lowmem_adj);
> -	int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
> +	int other_free = global_page_state(NR_FREE_PAGES);
> +	int other_file = global_page_state(NR_FILE_PAGES);
>  	if(lowmem_adj_size < array_size)
>  		array_size = lowmem_adj_size;
>  	if(lowmem_minfree_size < array_size)
>  		array_size = lowmem_minfree_size;
>  	for(i = 0; i < array_size; i++) {

It would appear that lowmem_adj_size is making assumptions on
the number of zones that exist in the system. It's not clear why
sysctl_lowmem_reserve_ratio[] is not being used or how they differ.

> -		if(other_free < lowmem_minfree[i]) {
> +		if (other_free < lowmem_minfree[i] &&
> +		    other_file < lowmem_minfree[i]) {
>  			min_adj = lowmem_adj[i];
>  			break;
>  		}
>  	}
>  	if(nr_to_scan > 0)
> -		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
> -	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
> +		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
> +	rem = global_page_state(NR_ACTIVE_ANON) +
> +		global_page_state(NR_ACTIVE_FILE) +
> +		global_page_state(NR_INACTIVE_ANON) +
> +		global_page_state(NR_INACTIVE_FILE);

This looks like it's a compile fix since changes made to 4f98a2fe. This
should have been in a separate patch and prioritised.

>  	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
>  		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
>  		return rem;

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
  2009-05-11 21:11   ` Andrew Morton
@ 2009-05-12  9:38   ` Mel Gorman
  1 sibling, 0 replies; 95+ messages in thread
From: Mel Gorman @ 2009-05-12  9:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat, Arve Hj?nnev?g,
	linux-kernel

On Sun, May 10, 2009 at 03:07:16PM -0700, David Rientjes wrote:
> When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
> pointer for tasks that have detached mm's since task_lock() is not held
> during the tasklist scan.
> 
> Acked-by: Nick Piggin <npiggin@suse.de>
> Signed-off-by: David Rientjes <rientjes@google.com>

Minor nit. The Acked-by should come after the Signed-off-by.

The window during which p->mm was a valid point and become an invalid
one is *extremely* small but sure, it's there.

Acked-by: Mel Gorman <mel@csn.ul.ie>

I would have preferred if the core VM patches were in a different set to
the driver patches. Maybe there is an interdependency later but this patch
at least looks standalone.

> ---
>  mm/oom_kill.c |   24 +++++++++++++++---------
>  1 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -284,22 +284,28 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
>  	       "name\n");
>  	do_each_thread(g, p) {
> -		/*
> -		 * total_vm and rss sizes do not exist for tasks with a
> -		 * detached mm so there's no need to report them.
> -		 */
> -		if (!p->mm)
> -			continue;
> +		struct mm_struct *mm;
> +
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
>  		if (!thread_group_leader(p))
>  			continue;
>  
>  		task_lock(p);
> +		mm = p->mm;
> +		if (!mm) {
> +			/*
> +			 * total_vm and rss sizes do not exist for tasks with no
> +			 * mm so there's no need to report them; they can't be
> +			 * oom killed anyway.
> +			 */
> +			task_unlock(p);
> +			continue;
> +		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
> -		       p->pid, __task_cred(p)->uid, p->tgid,
> -		       p->mm->total_vm, get_mm_rss(p->mm), (int)task_cpu(p),
> -		       p->oomkilladj, p->comm);
> +		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> +		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> +		       p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
  2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
  2009-05-11  0:17   ` KOSAKI Motohiro
@ 2009-05-12  9:56   ` Mel Gorman
  1 sibling, 0 replies; 95+ messages in thread
From: Mel Gorman @ 2009-05-12  9:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Peter Ziljstra,
	Christoph Lameter, Dave Hansen, San Mehat, Arve Hj?nnev?g,
	linux-kernel

On Sun, May 10, 2009 at 03:07:18PM -0700, David Rientjes wrote:
> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares
> the mm.  If a task were to be killed while attached to an mm that could
> not be freed because another thread were set to OOM_DISABLE, it would
> have needlessly been terminated since there is no potential for future
> memory freeing.
> 

Good point. It also sets up weird races for processes that create/delete
threads a lot where they can have different oom_adj values. Now you only
need one thread to configure for the mm which is probably what users wanted
in the first place.

> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
> 
> This fixes a livelock if the oom killer chooses a task and another thread
> sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> because oom_kill_task() repeatedly returns 1 and refuses to kill the
> chosen task while select_bad_process() will repeatedly choose the same
> task during the next retry.
> 
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> in oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
> 
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.
> 

Makes sense but I see two things at least that may need to be fixed up.
Details below.

> Cc: Nick Piggin <npiggin@suse.de>
> Cc: San Mehat <san@android.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt        |   15 +++++++++----
>  drivers/staging/android/lowmemorykiller.c |    2 +-
>  fs/proc/base.c                            |   19 ++++++++++++++--
>  include/linux/mm_types.h                  |    2 +
>  include/linux/sched.h                     |    1 -
>  mm/oom_kill.c                             |   32 ++++++++++++++++++----------
>  6 files changed, 49 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>  
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the liklihood of this process being

%s/liklihood/likelihood//

> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>  
>  The following heuristics are then applied:
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  			task_unlock(p);
>  			continue;
>  		}
> -		oom_adj = p->oomkilladj;
> +		oom_adj = p->mm->oom_adj;
>  		if (oom_adj < min_adj) {
>  			task_unlock(p);
>  			continue;

Ouch, it's a pity that a staging driver and the core VM have a dependency
like this. It makes me wonder again why there is a low memory killer that
is separate from the OOM killer. I should dig through the archives and
see what the reasoning was.

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  
>  	if (!task)
>  		return -ESRCH;
> -	oom_adjust = task->oomkilladj;
> +	task_lock(task);
> +	if (task->mm)
> +		oom_adjust = task->mm->oom_adj;
> +	else
> +		oom_adjust = OOM_DISABLE;
> +	task_unlock(task);
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +	task_lock(task);
> +	if (!task->mm) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->oomkilladj = oom_adjust;
> +	task->mm->oom_adj = oom_adjust;
> +	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -232,6 +232,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1149,7 +1149,6 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> -	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> +	int oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> +	oom_adj = mm->oom_adj;
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 8;
>  
>  	/*
> -	 * Adjust the score by oomkilladj.
> +	 * Adjust the score by oom_adj.
>  	 */
> -	if (p->oomkilladj) {
> -		if (p->oomkilladj > 0) {
> +	if (oom_adj) {
> +		if (oom_adj > 0) {
>  			if (!points)
>  				points = 1;
> -			points <<= p->oomkilladj;
> +			points <<= oom_adj;
>  		} else
> -			points >>= -(p->oomkilladj);
> +			points >>= -(oom_adj);
>  	}
>  
>  #ifdef DEBUG
> @@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		if (p->oomkilladj == OOM_DISABLE)
> +		task_lock(p);
> +		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
>  			continue;

Oops, it would appear we didn't unlock the task here. I bet we deadlock
if oom_adj == OOM_DISABLE for a normal process and it then exits.

> +		task_unlock(p);
>  
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
> @@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -		       p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
>  	 * Don't kill the process if any threads are set to OOM_DISABLE
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +		task_lock(q);
> +		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +			task_unlock(q);
>  			return 1;
> +		}
> +		task_unlock(q);
>  	} while_each_thread(g, q);
>  
>  	__oom_kill_task(p, 1);
> @@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +			current->comm, gfp_mask, order,
> +			current->mm ? current->mm->oom_adj : OOM_DISABLE);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-11 22:11       ` Andrew Morton
  2009-05-11 22:31         ` David Rientjes
  2009-05-12  5:39         ` Peter Zijlstra
@ 2009-05-12 10:05         ` Mel Gorman
  2 siblings, 0 replies; 95+ messages in thread
From: Mel Gorman @ 2009-05-12 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, gregkh, npiggin, a.p.zijlstra, cl, dave, san,
	arve, linux-kernel

On Mon, May 11, 2009 at 03:11:30PM -0700, Andrew Morton wrote:
> On Mon, 11 May 2009 14:45:18 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
> 
> > On Mon, 11 May 2009, Andrew Morton wrote:
> > 
> > > > The oom killer must be invoked regardless of the order if the allocation
> > > > is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> > > > free some memory.
> > > 
> > > We should discourage callers from using __GFP_NOFAIL at all.  We should
> > > electrocute callers for using __GFP_NOFAIL on large allocations.  How's about
> > > 
> > > 	WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER &&	
> > > 			(gfp_mask & __GFP_NOFAIL));
> > > or, preferably:
> > > 
> > > 	WARN_ON_ONCE(order > 0 && (gfp_mask & __GFP_NOFAIL));
> > > 
> > 
> > Not sure it would help since the oom killer will be now be called for such 
> > an allocation and that dumps the stack (and will actually show the order 
> > and gfp flags as well).
> 
> No, the intent of that warning is to find all call sites which use
> __GFP_NOFAIL on order>0 so we can hunt down and eliminate them.
> 
> 
> please review...
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> 
> __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
> detect and suitably handle this (and not by lamely moving the infinite
> loop up to the caller level either).
> 
> Attempting to use __GFP_NOFAIL for a higher-order allocation is even
> worse, so add a once-off runtime check for this to slap people around for
> even thinking about trying it.
> 
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
> 
>  mm/page_alloc.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff -puN mm/page_alloc.c~a mm/page_alloc.c
> --- a/mm/page_alloc.c~a
> +++ a/mm/page_alloc.c
> @@ -1201,8 +1201,19 @@ static int should_fail_alloc_page(gfp_t 
>  {
>  	if (order < fail_page_alloc.min_order)
>  		return 0;
> -	if (gfp_mask & __GFP_NOFAIL)
> +	if (gfp_mask & __GFP_NOFAIL) {
> +		/*
> +		 * __GFP_NOFAIL is not to be used in new code.
> +		 *
> +		 * All __GFP_NOFAIL callers should be fixed so that they
> +		 * properly detect and handle allocation failures.
> +		 *
> +		 * We most definitely don't want callers attempting to allocate
> +		 * greater than single-page units with __GFP_NOFAIL.
> +		 */
> +		WARN_ON_ONCE(order > 0);
>  		return 0;
> +	}
>  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>  		return 0;
>  	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> _
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-05-12  5:39         ` Peter Zijlstra
@ 2009-05-12 11:36           ` KOSAKI Motohiro
  0 siblings, 0 replies; 95+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, David Rientjes, gregkh, npiggin, mel, cl, dave,
	san, arve, linux-kernel

>> > Not sure it would help since the oom killer will be now be called for such
>> > an allocation and that dumps the stack (and will actually show the order
>> > and gfp flags as well).
>>
>> No, the intent of that warning is to find all call sites which use
>> __GFP_NOFAIL on order>0 so we can hunt down and eliminate them.
>>
>>
>> please review...
>
> Fully agreed, people should use banker's algorithm to guarantee
> progress, not create deadlocks with inf loops.
>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

me too.

>
>> From: Andrew Morton <akpm@linux-foundation.org>
>>
>> __GFP_NOFAIL is a bad fiction.  Allocations _can_ fail, and callers should
>> detect and suitably handle this (and not by lamely moving the infinite
>> loop up to the caller level either).
>>
>> Attempting to use __GFP_NOFAIL for a higher-order allocation is even
>> worse, so add a once-off runtime check for this to slap people around for
>> even thinking about trying it.
>>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Misleading OOM messages
  2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
@ 2009-05-12 21:14   ` Christoph Lameter
  2009-05-14  9:29     ` Pavel Machek
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-12 21:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Ziljstra, Dave Hansen, San Mehat, Arve Hjønnevåg,
	linux-kernel

While we are at it: Could we get rid of the name "Out of Memory" and stop
printing texts to that effect? What we call an OOM is a failure to
perform memory reclaim or we are running out of reserves due
to not being able to run reclaim. Mostly this is due to OS internal issues
having nothing to do actual amounts of memory available.

I have seen scores of people put more memory into their boxes because
they thought they did not have enough memory. Then there are those that
lower ulimit values so that users cannot allocate so much memory.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 02/11 -mmotm] lowmemorykiller: Don't count free space  unless it meets the specified limit by itself
  2009-05-12  9:23   ` Mel Gorman
@ 2009-05-13  0:27     ` Arve Hjønnevåg
  2009-05-13  9:42       ` Mel Gorman
  0 siblings, 1 reply; 95+ messages in thread
From: Arve Hjønnevåg @ 2009-05-13  0:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	linux-kernel

On Tue, May 12, 2009 at 2:23 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Sun, May 10, 2009 at 03:07:10PM -0700, David Rientjes wrote:
>> From: Arve Hjønnevåg <arve@android.com>
>>
>> This allows processes to be killed when the kernel evict cache pages in
>> an attempt to get more contiguous free memory.
>>
>
> I'm not seeing what this patch has to do with contiguous free memory. I
> see what the patch is doing - lowering the threshold allowing a
> particular min_adj value to be used, but not why.
>

If the memory is fragmented, contiguous allocations can cause all
caches to be freed. The low memory killer would not trigger in this
case.

>
>> Signed-off-by: Arve Hjønnevåg <arve@android.com>
>> ---
>>  drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
>>  1 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
>> --- a/drivers/staging/android/lowmemorykiller.c
>> +++ b/drivers/staging/android/lowmemorykiller.c
>> @@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>>       int min_adj = OOM_ADJUST_MAX + 1;
>>       int selected_tasksize = 0;
>>       int array_size = ARRAY_SIZE(lowmem_adj);
>> -     int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
>> +     int other_free = global_page_state(NR_FREE_PAGES);
>> +     int other_file = global_page_state(NR_FILE_PAGES);
>>       if(lowmem_adj_size < array_size)
>>               array_size = lowmem_adj_size;
>>       if(lowmem_minfree_size < array_size)
>>               array_size = lowmem_minfree_size;
>>       for(i = 0; i < array_size; i++) {
>
> It would appear that lowmem_adj_size is making assumptions on
> the number of zones that exist in the system. It's not clear why
> sysctl_lowmem_reserve_ratio[] is not being used or how they differ.

lowmem_adj_size is the number of elements in the array. Why would the
number of zones in the system affect it? The goal of the
lowmemorykiller is to make more memory available for caches. I would
expect increasing sysctl_lowmem_reserve_ratio to have the opposite
effect, but on our system with only one zone it has no effect.

>
>> -             if(other_free < lowmem_minfree[i]) {
>> +             if (other_free < lowmem_minfree[i] &&
>> +                 other_file < lowmem_minfree[i]) {
>>                       min_adj = lowmem_adj[i];
>>                       break;
>>               }
>>       }
>>       if(nr_to_scan > 0)
>> -             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
>> -     rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
>> +             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
>> +     rem = global_page_state(NR_ACTIVE_ANON) +
>> +             global_page_state(NR_ACTIVE_FILE) +
>> +             global_page_state(NR_INACTIVE_ANON) +
>> +             global_page_state(NR_INACTIVE_FILE);
>
> This looks like it's a compile fix since changes made to 4f98a2fe. This
> should have been in a separate patch and prioritised.

You are right. These patches were originally on 2.6.27 and I did not
notice that the first patch was also broken when I fixed this patch
for 2.6.29.

>
>>       if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
>>               lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
>>               return rem;
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process  list when needed.
  2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
@ 2009-05-13  0:43   ` Arve Hjønnevåg
  0 siblings, 0 replies; 95+ messages in thread
From: Arve Hjønnevåg @ 2009-05-13  0:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	linux-kernel

On Tue, May 12, 2009 at 2:09 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Sun, May 10, 2009 at 03:07:05PM -0700, David Rientjes wrote:
>> From: Arve Hjønnevåg <arve@android.com>
>>
>> Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
>> instead the sum of rss. Neither method is accurate.
>>
>
> If neither estimate is accurate (and I agree), then explain why this is
> better and why better decisions are made as a result.
>
It should make similar decisions without iterating over the process
list as often.

> I would assume it's because pages can be double accounted when using RSS
> due to shared pages and you might over-estimate the amount of memory
> really in use. If that reasoning is correct, say that.
>
>> Also skip the process scan, if the amount of memory available is above
>> the largest threshold set.
>>
>
> Patches should do one thing, please split if possible. This patch aspect
> seems to be the if check early on so the splitting should not be
> difficult.
>

This is the whole point of the patch.

> In that statement, you check nr_to_scan <= 0, how can you scan a
> negative number of things? Say in the description that the process scan
> is set if min_adj is such a high value that it cannot be used to
> discriminate between processes for suitability to kill.
>
>> Signed-off-by: Arve Hjønnevåg <arve@android.com>
>> ---
>>  drivers/staging/android/lowmemorykiller.c |   35 +++++++++++++++++-----------
>>  1 files changed, 21 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
>> --- a/drivers/staging/android/lowmemorykiller.c
>> +++ b/drivers/staging/android/lowmemorykiller.c
>> @@ -71,23 +71,30 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>>       }
>>       if(nr_to_scan > 0)
>>               lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
>> +     rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
>> +     if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
>> +             lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
>> +             return rem;
>> +     }
>> +
>>       read_lock(&tasklist_lock);
>>       for_each_process(p) {
>> -             if(p->oomkilladj >= 0 && p->mm) {
>> -                     tasksize = get_mm_rss(p->mm);
>> -                     if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
>> -                             if(selected == NULL ||
>> -                                p->oomkilladj > selected->oomkilladj ||
>> -                                (p->oomkilladj == selected->oomkilladj &&
>> -                                 tasksize > selected_tasksize)) {
>> -                                     selected = p;
>> -                                     selected_tasksize = tasksize;
>> -                                     lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
>> -                                                  p->pid, p->comm, p->oomkilladj, tasksize);
>> -                             }
>> -                     }
>> -                     rem += tasksize;
>> +             if (p->oomkilladj < min_adj || !p->mm)
>> +                     continue;
>> +             tasksize = get_mm_rss(p->mm);
>> +             if (tasksize <= 0)
>> +                     continue;
>
> It's also odd to consider the possibility of a negative tasksize.

Yes, using an unsigned tasksize would be better, but this patch did
not change any of this.

>
> That aside, RSS is the sum of two longs (or atomic_long_t depending) and
> tasksize here is a signed integer. Are you not at the risk of
> overflowing and this check being unable to do anything useful if all the
> processes of interest are using large amounts of memory?

Yes (The systems we use this on do not have enough memory for this to
be possible though).

>
>> +             if (selected) {
>> +                     if (p->oomkilladj < selected->oomkilladj)
>> +                             continue;
>> +                     if (p->oomkilladj == selected->oomkilladj &&
>> +                         tasksize <= selected_tasksize)
>> +                             continue;
>>               }
>> +             selected = p;
>> +             selected_tasksize = tasksize;
>> +             lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
>> +                          p->pid, p->comm, p->oomkilladj, tasksize);
>>       }
>>       if(selected != NULL) {
>>               lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
>
> I'm hoping it has been discussed elsewhere why the normal OOM killer is not
> being used instead of this driver thing.
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself
  2009-05-13  0:27     ` Arve Hjønnevåg
@ 2009-05-13  9:42       ` Mel Gorman
  2009-05-14 23:25         ` Arve Hjønnevåg
  0 siblings, 1 reply; 95+ messages in thread
From: Mel Gorman @ 2009-05-13  9:42 UTC (permalink / raw)
  To: Arve Hj?nnev?g
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	linux-kernel

On Tue, May 12, 2009 at 05:27:18PM -0700, Arve Hj?nnev?g wrote:
> On Tue, May 12, 2009 at 2:23 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Sun, May 10, 2009 at 03:07:10PM -0700, David Rientjes wrote:
> >> From: Arve Hjønnevåg <arve@android.com>
> >>
> >> This allows processes to be killed when the kernel evict cache pages in
> >> an attempt to get more contiguous free memory.
> >>
> >
> > I'm not seeing what this patch has to do with contiguous free memory. I
> > see what the patch is doing - lowering the threshold allowing a
> > particular min_adj value to be used, but not why.
> >
> 
> If the memory is fragmented, contiguous allocations can cause all
> caches to be freed. The low memory killer would not trigger in this
> case.
> 

That still doesn't explain why this patch allows processes to be killed
in an effort to get contiguous free memory other than freeing pages in
general may free up contiguous pages. It seems this patch makes the
killer more agressive but I think I know why.

How about the following as a changelog? It might help me understand the
use case better and how this patch changes it if the changelog was better.
Of course, as this is this driver is isolated to the Android platform and
I'm not developing or own one, you're also welcome to tell me to "shut up,
you don't know what you're talking about" :)

=====
The Android low memory killer is responsible for ensuring there is enough
free memory at configured watermarks stored in a lowmem_minfree[] array. At
each watermark, there is an associated oom_adj score. When the watermark is
not met, processes with a higher oom_adj are considered for OOM killing
so that lower-priority processes are killed before the situation becomes
unmanageable.

The problem is that the driver currently sums NR_FREE_PAGES+NR_FILE_PAGES
as an estimate of how much memory is potentially free. This can lead to
a situation where no memory is really free because it's all in the page
cache and corrective action is taken too late with too many processes being
considered. This patch changes the semantics such that both NR_FREE_PAGES
and NR_FILE_PAGES must be above the lowmem_minfree threshold.
====

?

If this is accurate, it also wouldn't hurt to put the first paragraph into a
comment earlier in the driver so the next poor fool like me isn't scratching
his head wondering what this is for.

> >
> >> Signed-off-by: Arve Hjønnevåg <arve@android.com>
> >> ---
> >>  drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
> >>  1 files changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> >> --- a/drivers/staging/android/lowmemorykiller.c
> >> +++ b/drivers/staging/android/lowmemorykiller.c
> >> @@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
> >>       int min_adj = OOM_ADJUST_MAX + 1;
> >>       int selected_tasksize = 0;
> >>       int array_size = ARRAY_SIZE(lowmem_adj);
> >> -     int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
> >> +     int other_free = global_page_state(NR_FREE_PAGES);
> >> +     int other_file = global_page_state(NR_FILE_PAGES);
> >>       if(lowmem_adj_size < array_size)
> >>               array_size = lowmem_adj_size;
> >>       if(lowmem_minfree_size < array_size)
> >>               array_size = lowmem_minfree_size;
> >>       for(i = 0; i < array_size; i++) {
> >
> > It would appear that lowmem_adj_size is making assumptions on
> > the number of zones that exist in the system. It's not clear why
> > sysctl_lowmem_reserve_ratio[] is not being used or how they differ.
> 
> lowmem_adj_size is the number of elements in the array. Why would the
> number of zones in the system affect it?

I misread what the purpose of lowmem_adj[] was. I thought it was
thresholds on a per-zone basis because it looks similar to
lowmem_reserve_ratio[] but that's not what it is at all.

Looking at it closer, this is more about introducing more watermarks to
a zone for the control of the size of the cache - maybe because the OOM
killer takes action too late for you.

Lack of familiarity with the driver and how it's expected to be used is
catching me.

> The goal of the
> lowmemorykiller is to make more memory available for caches. I would
> expect increasing sysctl_lowmem_reserve_ratio to have the opposite
> effect, but on our system with only one zone it has no effect.
> 

Ok.

> >
> >> -             if(other_free < lowmem_minfree[i]) {
> >> +             if (other_free < lowmem_minfree[i] &&
> >> +                 other_file < lowmem_minfree[i]) {
> >>                       min_adj = lowmem_adj[i];
> >>                       break;
> >>               }
> >>       }
> >>       if(nr_to_scan > 0)
> >> -             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
> >> -     rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
> >> +             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
> >> +     rem = global_page_state(NR_ACTIVE_ANON) +
> >> +             global_page_state(NR_ACTIVE_FILE) +
> >> +             global_page_state(NR_INACTIVE_ANON) +
> >> +             global_page_state(NR_INACTIVE_FILE);
> >
> > This looks like it's a compile fix since changes made to 4f98a2fe. This
> > should have been in a separate patch and prioritised.
> 
> You are right. These patches were originally on 2.6.27 and I did not
> notice that the first patch was also broken when I fixed this patch
> for 2.6.29.
> 

Ok

> >
> >>       if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> >>               lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
> >>               return rem;
> >

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-12 21:14   ` Misleading OOM messages Christoph Lameter
@ 2009-05-14  9:29     ` Pavel Machek
  2009-05-14 19:46       ` Christoph Lameter
  0 siblings, 1 reply; 95+ messages in thread
From: Pavel Machek @ 2009-05-14  9:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Dave Hansen, San Mehat,
	Arve Hj?nnev?g, linux-kernel

Hi!

> While we are at it: Could we get rid of the name "Out of Memory" and stop
> printing texts to that effect? What we call an OOM is a failure to
> perform memory reclaim or we are running out of reserves due
> to not being able to run reclaim. Mostly this is due to OS internal issues
> having nothing to do actual amounts of memory available.

It can be 'low on memory' if you play with mlock() a bit. 

It is out of memory if you run out of swap (or have no swap to begin with).

I believe message is often correct. What message would you suggest?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14  9:29     ` Pavel Machek
@ 2009-05-14 19:46       ` Christoph Lameter
  2009-05-14 20:38         ` Dave Hansen
  2009-05-14 20:56         ` Pavel Machek
  0 siblings, 2 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-14 19:46 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Dave Hansen, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Pavel Machek wrote:

> It can be 'low on memory' if you play with mlock() a bit.

But that is a reclaim failure becuase of mlocking pages.

> It is out of memory if you run out of swap (or have no swap to begin with).

That is a swap config issue.

> I believe message is often correct. What message would you suggest?

"Failure to reclaim memory"



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 19:46       ` Christoph Lameter
@ 2009-05-14 20:38         ` Dave Hansen
  2009-05-14 20:49           ` Christoph Lameter
  2009-05-14 20:49           ` David Rientjes
  2009-05-14 20:56         ` Pavel Machek
  1 sibling, 2 replies; 95+ messages in thread
From: Dave Hansen @ 2009-05-14 20:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 2009-05-14 at 15:46 -0400, Christoph Lameter wrote:
> On Thu, 14 May 2009, Pavel Machek wrote:
> > It can be 'low on memory' if you play with mlock() a bit.
> 
> But that is a reclaim failure becuase of mlocking pages.
> 
> > It is out of memory if you run out of swap (or have no swap to begin with).
> 
> That is a swap config issue.

The other thing that I find confusing myself is that we're almost never
at '0 pages free' (which is what I intrinsically think) when we OOM.
We're just under the watermarks and not apparently making any progress.
But I don't think we want to say "under the watermarks" in our error
message.

> > I believe message is often correct. What message would you suggest?
> 
> "Failure to reclaim memory"

The problem I have with that is that it also doesn't tell the whole.
story.  It's the end symptom when *just* before we OOM, but it doesn't
characterize the whole thing very well.  It's like saying the Titanic
sunk because "too much water onboard." :)  It's true, but it
concentrates a bit too much on the end state.

To me, it's a question of how much information we can get out in a line
or two on the console.  Is something like this better?

        "Unable to satisfy memory allocation request and not making
        progress reclaiming from other sources."

We can't exactly go spitting out an entire tutorial in dmesg, but could
we stick a short URL in there?  Like http://linux-mm.org/OOM perhaps?

-- Dave

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 20:38         ` Dave Hansen
@ 2009-05-14 20:49           ` Christoph Lameter
  2009-05-14 20:49           ` David Rientjes
  1 sibling, 0 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-14 20:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Dave Hansen wrote:

> The other thing that I find confusing myself is that we're almost never
> at '0 pages free' (which is what I intrinsically think) when we OOM.
> We're just under the watermarks and not apparently making any progress.
> But I don't think we want to say "under the watermarks" in our error
> message.

Hmmm... We could say that there was memory available but the kernel thread
was not allowed to use it. But that would cause more confusion.


> > > I believe message is often correct. What message would you suggest?
> >
> > "Failure to reclaim memory"
>
> The problem I have with that is that it also doesn't tell the whole.
> story.  It's the end symptom when *just* before we OOM, but it doesn't
> characterize the whole thing very well.  It's like saying the Titanic
> sunk because "too much water onboard." :)  It's true, but it
> concentrates a bit too much on the end state.
>
> To me, it's a question of how much information we can get out in a line
> or two on the console.  Is something like this better?

I am all for putting out more information if we can. But we ran reclaim
and it did not work. The Titanic sank because there was too much water
onboard after all. How it got in there (Iceberg, torpedoed by the CIA (oh
it did not yet exist) in retaliation for British misbehavior) would be
good to know. But the message should not suggest increasing the size of
the Titanic because it cannot hold all the incoming water.

>         "Unable to satisfy memory allocation request and not making
>         progress reclaiming from other sources."
>
> We can't exactly go spitting out an entire tutorial in dmesg, but could
> we stick a short URL in there?  Like http://linux-mm.org/OOM perhaps?

Good idea.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 20:38         ` Dave Hansen
  2009-05-14 20:49           ` Christoph Lameter
@ 2009-05-14 20:49           ` David Rientjes
  2009-05-14 21:05             ` Dave Hansen
  1 sibling, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-14 20:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Dave Hansen wrote:

> The problem I have with that is that it also doesn't tell the whole.
> story.  It's the end symptom when *just* before we OOM, but it doesn't
> characterize the whole thing very well.  It's like saying the Titanic
> sunk because "too much water onboard." :)  It's true, but it
> concentrates a bit too much on the end state.
> 
> To me, it's a question of how much information we can get out in a line
> or two on the console.  Is something like this better?
>         
>         "Unable to satisfy memory allocation request and not making
>         progress reclaiming from other sources."
> 
> We can't exactly go spitting out an entire tutorial in dmesg, but could
> we stick a short URL in there?  Like http://linux-mm.org/OOM perhaps?
> 

I think switching all the oom killer messages to be "no available memory" 
as it is in the MPOL_BIND case would be the best alternative.  We 
currently use "out of memory" even for cpusets, for example, when it 
happens because it cannot accommodate any more hardwall allocations while 
there may be an abundance of memory elsewhere that it cannot access.  I 
also think "no available memory" makes more sense than "out of memory" 
when describing situations where we're at or below the minimum watermarks 
for all allowable zones.  Either that or "no allowable memory".

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 19:46       ` Christoph Lameter
  2009-05-14 20:38         ` Dave Hansen
@ 2009-05-14 20:56         ` Pavel Machek
  1 sibling, 0 replies; 95+ messages in thread
From: Pavel Machek @ 2009-05-14 20:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Mel Gorman, Peter Ziljstra, Dave Hansen, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu 2009-05-14 15:46:50, Christoph Lameter wrote:
> On Thu, 14 May 2009, Pavel Machek wrote:
> 
> > It can be 'low on memory' if you play with mlock() a bit.
> 
> But that is a reclaim failure becuase of mlocking pages.
> 
> > It is out of memory if you run out of swap (or have no swap to begin with).
> 
> That is a swap config issue.

Well, many machines are _designed_ to run without swap. Swapping on
flash is considered bad.

> > I believe message is often correct. What message would you suggest?
> 
> "Failure to reclaim memory"

That is not a helpful message, I'm afraid. What about

 "Memory allocation failure" ?
 "Out of memory, or failed to allocate memory due to temporary spike"?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 20:49           ` David Rientjes
@ 2009-05-14 21:05             ` Dave Hansen
  2009-05-14 21:12               ` David Rientjes
  2009-05-14 21:30               ` Christoph Lameter
  0 siblings, 2 replies; 95+ messages in thread
From: Dave Hansen @ 2009-05-14 21:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Thu, 2009-05-14 at 13:49 -0700, David Rientjes wrote:
> I think switching all the oom killer messages to be "no available memory" 
> as it is in the MPOL_BIND case would be the best alternative.  We 
> currently use "out of memory" even for cpusets, for example, when it 
> happens because it cannot accommodate any more hardwall allocations while 
> there may be an abundance of memory elsewhere that it cannot access.  I 
> also think "no available memory" makes more sense than "out of memory" 
> when describing situations where we're at or below the minimum watermarks 
> for all allowable zones.  Either that or "no allowable memory".

How about something like this to start?  We can "mv mm/oom_kill.c
mm/nam_kill.c" later. ;)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 92bcf1d..0ce0d59 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -409,7 +409,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		return 0;
 	}
 
-	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
+	printk(KERN_ERR "No available memory %s: "
+			"kill process %d (%s) score %li or a child\n",
 					message, task_pid_nr(p), p->comm, points);
 
 	/* Try to kill a child first */
@@ -438,7 +439,7 @@ retry:
 		p = current;
 
 	if (oom_kill_process(p, gfp_mask, 0, points, mem,
-				"Memory cgroup out of memory"))
+				"(memory cgroup)"))
 		goto retry;
 out:
 	read_unlock(&tasklist_lock);
@@ -519,7 +520,7 @@ static void __out_of_memory(gfp_t gfp_mask, int order)
 
 	if (sysctl_oom_kill_allocating_task)
 		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+				"(oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
@@ -537,8 +538,7 @@ retry:
 		panic("Out of memory and no killable processes...\n");
 	}
 
-	if (oom_kill_process(p, gfp_mask, order, points, NULL,
-			     "Out of memory"))
+	if (oom_kill_process(p, gfp_mask, order, points, NULL, ""))
 		goto retry;
 }
 
@@ -612,7 +612,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
+				"(MPOL_BIND)");
 		break;
 
 	case CONSTRAINT_NONE:


-- Dave


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:05             ` Dave Hansen
@ 2009-05-14 21:12               ` David Rientjes
  2009-05-14 21:30               ` Christoph Lameter
  1 sibling, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-14 21:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Zijlstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Dave Hansen wrote:

> On Thu, 2009-05-14 at 13:49 -0700, David Rientjes wrote:
> > I think switching all the oom killer messages to be "no available memory" 
> > as it is in the MPOL_BIND case would be the best alternative.  We 
> > currently use "out of memory" even for cpusets, for example, when it 
> > happens because it cannot accommodate any more hardwall allocations while 
> > there may be an abundance of memory elsewhere that it cannot access.  I 
> > also think "no available memory" makes more sense than "out of memory" 
> > when describing situations where we're at or below the minimum watermarks 
> > for all allowable zones.  Either that or "no allowable memory".
> 
> How about something like this to start?  We can "mv mm/oom_kill.c
> mm/nam_kill.c" later. ;)
> 

I don't think it's that easy, I think we need to indicate what the set of 
available memory is.  For instance, we need to show the nodes that 
MPOL_BIND is allowed to access, what the hard limit for the memory cgroup 
is, what current->mems_allowed is for cpuset ooms, why we can't allocate 
in ZONE_DMA or ZONE_DMA32 because of lowmem_reserve_ratio, etc.

oom_kill_allocating_task, for instance, can indicate all of the above 
scenarios but it doesn't show the cause for the oom like others do, even 
with your patch.  It simply shows that 
/proc/sys/vm/oom_kill_allocating_task was set to avoid the expensive 
tasklist scan.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:05             ` Dave Hansen
  2009-05-14 21:12               ` David Rientjes
@ 2009-05-14 21:30               ` Christoph Lameter
  2009-05-14 21:34                 ` Pavel Machek
  2009-05-14 21:37                 ` Dave Hansen
  1 sibling, 2 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-14 21:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Rientjes, Pavel Machek, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Dave Hansen wrote:

> -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> +	printk(KERN_ERR "No available memory %s: "
> +			"kill process %d (%s) score %li or a child\n",
>  					message, task_pid_nr(p), p->comm, points);

"No available memory" still suggests that plugging in more memory is the
right solution.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:30               ` Christoph Lameter
@ 2009-05-14 21:34                 ` Pavel Machek
  2009-05-14 21:41                   ` Dave Hansen
  2009-05-15 17:57                   ` Christoph Lameter
  2009-05-14 21:37                 ` Dave Hansen
  1 sibling, 2 replies; 95+ messages in thread
From: Pavel Machek @ 2009-05-14 21:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu 2009-05-14 17:30:02, Christoph Lameter wrote:
> On Thu, 14 May 2009, Dave Hansen wrote:
> 
> > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > +	printk(KERN_ERR "No available memory %s: "
> > +			"kill process %d (%s) score %li or a child\n",
> >  					message, task_pid_nr(p), p->comm, points);
> 
> "No available memory" still suggests that plugging in more memory is the
> right solution.

And... on correctly working kernel, it is, right?

If you have no swap space and too many applications, you plug more
memory. (Or invent some swap).

If you misconfigured cgroups, you give more memory to them.

If your applications mlocked 900MB and you have 1GB, you need to plug
more memory.

So... when is plugging more memory _not_ valid answer? AFAICT it is
when it is some kernel problem, resulting in memory not being
reclaimed fast enough....
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:30               ` Christoph Lameter
  2009-05-14 21:34                 ` Pavel Machek
@ 2009-05-14 21:37                 ` Dave Hansen
  2009-05-14 22:00                   ` David Rientjes
  1 sibling, 1 reply; 95+ messages in thread
From: Dave Hansen @ 2009-05-14 21:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Pavel Machek, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 2009-05-14 at 17:30 -0400, Christoph Lameter wrote:
> On Thu, 14 May 2009, Dave Hansen wrote:
> 
> > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > +	printk(KERN_ERR "No available memory %s: "
> > +			"kill process %d (%s) score %li or a child\n",
> >  					message, task_pid_nr(p), p->comm, points);
> 
> "No available memory" still suggests that plugging in more memory is the
> right solution.

To me it at least adds the fact that more should be made *available* and
not just that you're out of it.  So, definitely not perfect, but better
than "out".

-- Dave


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:34                 ` Pavel Machek
@ 2009-05-14 21:41                   ` Dave Hansen
  2009-05-15 13:05                     ` Pavel Machek
  2009-05-15 17:57                   ` Christoph Lameter
  1 sibling, 1 reply; 95+ messages in thread
From: Dave Hansen @ 2009-05-14 21:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christoph Lameter, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Thu, 2009-05-14 at 23:34 +0200, Pavel Machek wrote:
> On Thu 2009-05-14 17:30:02, Christoph Lameter wrote:
> > On Thu, 14 May 2009, Dave Hansen wrote: 
> > > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > > +	printk(KERN_ERR "No available memory %s: "
> > > +			"kill process %d (%s) score %li or a child\n",
> > >  					message, task_pid_nr(p), p->comm, points);
> > 
> > "No available memory" still suggests that plugging in more memory is the
> > right solution.
> 
> And... on correctly working kernel, it is, right?
> 
> If you have no swap space and too many applications, you plug more
> memory. (Or invent some swap).
> 
> If you misconfigured cgroups, you give more memory to them.
> 
> If your applications mlocked 900MB and you have 1GB, you need to plug
> more memory.
> 
> So... when is plugging more memory _not_ valid answer? AFAICT it is
> when it is some kernel problem, resulting in memory not being
> reclaimed fast enough....

I recently had a problem (~2.6.27) where the user did an mlock() of ~95%
of memory then started doing ftp tests.  The machine also had 64k base
pages.  We let you dirty ~30% of memory, so they were able to dirty 6x
more memory than what we even had to work with.  We OOMed pretty fast
every time.

Now, that situation never gets better when you add more memory.  It only
gets worse because that "30% of memory number" takes longer and longer
to write out to the disk.

This is actually a pretty common scenario for the HPC and database
folks.  They go sucking up and locking as much memory as they can get
their hands on.  Adding memory never helps them because they'll use up
whatever is there.

-- Dave

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:37                 ` Dave Hansen
@ 2009-05-14 22:00                   ` David Rientjes
  2009-05-15 17:58                     ` Christoph Lameter
  0 siblings, 1 reply; 95+ messages in thread
From: David Rientjes @ 2009-05-14 22:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Dave Hansen wrote:

> > > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > > +	printk(KERN_ERR "No available memory %s: "
> > > +			"kill process %d (%s) score %li or a child\n",
> > >  					message, task_pid_nr(p), p->comm, points);
> > 
> > "No available memory" still suggests that plugging in more memory is the
> > right solution.
> 
> To me it at least adds the fact that more should be made *available* and
> not just that you're out of it.  So, definitely not perfect, but better
> than "out".
> 

I think "no allowable memory" followed by information on what is and is 
not allowed in that specific context would remove any ambiguity.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 02/11 -mmotm] lowmemorykiller: Don't count free space  unless it meets the specified limit by itself
  2009-05-13  9:42       ` Mel Gorman
@ 2009-05-14 23:25         ` Arve Hjønnevåg
  2009-05-15  9:18           ` Mel Gorman
  0 siblings, 1 reply; 95+ messages in thread
From: Arve Hjønnevåg @ 2009-05-14 23:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	linux-kernel

On Wed, May 13, 2009 at 2:42 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, May 12, 2009 at 05:27:18PM -0700, Arve Hj?nnev?g wrote:
>> On Tue, May 12, 2009 at 2:23 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > On Sun, May 10, 2009 at 03:07:10PM -0700, David Rientjes wrote:
>> >> From: Arve Hjønnevåg <arve@android.com>
>> >>
>> >> This allows processes to be killed when the kernel evict cache pages in
>> >> an attempt to get more contiguous free memory.
>> >>
>> >
>> > I'm not seeing what this patch has to do with contiguous free memory. I
>> > see what the patch is doing - lowering the threshold allowing a
>> > particular min_adj value to be used, but not why.
>> >
>>
>> If the memory is fragmented, contiguous allocations can cause all
>> caches to be freed. The low memory killer would not trigger in this
>> case.
>>
>
> That still doesn't explain why this patch allows processes to be killed
> in an effort to get contiguous free memory other than freeing pages in
> general may free up contiguous pages. It seems this patch makes the
> killer more agressive but I think I know why.

Without this patch, the lowmemorykiller would never trigger when the
kernel freed cache memory (since the lowmemorykiller looked at the sum
of cache plus free). For normal allocations this is not a problem
since the kernel stops freeing cache memory when the required amount
of free memory is met. For contiguous allocations, there is no limit
to how much cache memory the kernel will try to free.

>
> How about the following as a changelog? It might help me understand the
> use case better and how this patch changes it if the changelog was better.
> Of course, as this is this driver is isolated to the Android platform and
> I'm not developing or own one, you're also welcome to tell me to "shut up,
> you don't know what you're talking about" :)
>
> =====
> The Android low memory killer is responsible for ensuring there is enough
> free memory at configured watermarks stored in a lowmem_minfree[] array. At
> each watermark, there is an associated oom_adj score. When the watermark is
> not met, processes with a higher oom_adj are considered for OOM killing
> so that lower-priority processes are killed before the situation becomes
> unmanageable.

This is describing the lowmemorykiller with or without this patch and
does not belong in this patch description. Have you looked at
drivers/staging/android/lowmemorykiller.txt?

>
> The problem is that the driver currently sums NR_FREE_PAGES+NR_FILE_PAGES
> as an estimate of how much memory is potentially free. This can lead to
> a situation where no memory is really free because it's all in the page
> cache and corrective action is taken too late with too many processes being
> considered. This patch changes the semantics such that both NR_FREE_PAGES
> and NR_FILE_PAGES must be above the lowmem_minfree threshold.
> ====

I don't think this is correct. To start with, we want memory to be
used for the cache.

>
> ?
>
> If this is accurate, it also wouldn't hurt to put the first paragraph into a
> comment earlier in the driver so the next poor fool like me isn't scratching
> his head wondering what this is for.
>
>> >
>> >> Signed-off-by: Arve Hjønnevåg <arve@android.com>
>> >> ---
>> >>  drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
>> >>  1 files changed, 9 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
>> >> --- a/drivers/staging/android/lowmemorykiller.c
>> >> +++ b/drivers/staging/android/lowmemorykiller.c
>> >> @@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>> >>       int min_adj = OOM_ADJUST_MAX + 1;
>> >>       int selected_tasksize = 0;
>> >>       int array_size = ARRAY_SIZE(lowmem_adj);
>> >> -     int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
>> >> +     int other_free = global_page_state(NR_FREE_PAGES);
>> >> +     int other_file = global_page_state(NR_FILE_PAGES);
>> >>       if(lowmem_adj_size < array_size)
>> >>               array_size = lowmem_adj_size;
>> >>       if(lowmem_minfree_size < array_size)
>> >>               array_size = lowmem_minfree_size;
>> >>       for(i = 0; i < array_size; i++) {
>> >
>> > It would appear that lowmem_adj_size is making assumptions on
>> > the number of zones that exist in the system. It's not clear why
>> > sysctl_lowmem_reserve_ratio[] is not being used or how they differ.
>>
>> lowmem_adj_size is the number of elements in the array. Why would the
>> number of zones in the system affect it?
>
> I misread what the purpose of lowmem_adj[] was. I thought it was
> thresholds on a per-zone basis because it looks similar to
> lowmem_reserve_ratio[] but that's not what it is at all.
>
> Looking at it closer, this is more about introducing more watermarks to
> a zone for the control of the size of the cache - maybe because the OOM
> killer takes action too late for you.
>
> Lack of familiarity with the driver and how it's expected to be used is
> catching me.
>
>> The goal of the
>> lowmemorykiller is to make more memory available for caches. I would
>> expect increasing sysctl_lowmem_reserve_ratio to have the opposite
>> effect, but on our system with only one zone it has no effect.
>>
>
> Ok.
>
>> >
>> >> -             if(other_free < lowmem_minfree[i]) {
>> >> +             if (other_free < lowmem_minfree[i] &&
>> >> +                 other_file < lowmem_minfree[i]) {
>> >>                       min_adj = lowmem_adj[i];
>> >>                       break;
>> >>               }
>> >>       }
>> >>       if(nr_to_scan > 0)
>> >> -             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
>> >> -     rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
>> >> +             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
>> >> +     rem = global_page_state(NR_ACTIVE_ANON) +
>> >> +             global_page_state(NR_ACTIVE_FILE) +
>> >> +             global_page_state(NR_INACTIVE_ANON) +
>> >> +             global_page_state(NR_INACTIVE_FILE);
>> >
>> > This looks like it's a compile fix since changes made to 4f98a2fe. This
>> > should have been in a separate patch and prioritised.
>>
>> You are right. These patches were originally on 2.6.27 and I did not
>> notice that the first patch was also broken when I fixed this patch
>> for 2.6.29.
>>
>
> Ok
>
>> >
>> >>       if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
>> >>               lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
>> >>               return rem;
>> >
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Arve Hjønnevåg

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself
  2009-05-14 23:25         ` Arve Hjønnevåg
@ 2009-05-15  9:18           ` Mel Gorman
  0 siblings, 0 replies; 95+ messages in thread
From: Mel Gorman @ 2009-05-15  9:18 UTC (permalink / raw)
  To: Arve Hj?nnev?g
  Cc: David Rientjes, Andrew Morton, Greg Kroah-Hartman, Nick Piggin,
	Peter Ziljstra, Christoph Lameter, Dave Hansen, San Mehat,
	linux-kernel

On Thu, May 14, 2009 at 04:25:04PM -0700, Arve Hj?nnev?g wrote:
> On Wed, May 13, 2009 at 2:42 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Tue, May 12, 2009 at 05:27:18PM -0700, Arve Hj?nnev?g wrote:
> >> On Tue, May 12, 2009 at 2:23 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> > On Sun, May 10, 2009 at 03:07:10PM -0700, David Rientjes wrote:
> >> >> From: Arve Hjønnevåg <arve@android.com>
> >> >>
> >> >> This allows processes to be killed when the kernel evict cache pages in
> >> >> an attempt to get more contiguous free memory.
> >> >>
> >> >
> >> > I'm not seeing what this patch has to do with contiguous free memory. I
> >> > see what the patch is doing - lowering the threshold allowing a
> >> > particular min_adj value to be used, but not why.
> >> >
> >>
> >> If the memory is fragmented, contiguous allocations can cause all
> >> caches to be freed. The low memory killer would not trigger in this
> >> case.
> >>
> >
> > That still doesn't explain why this patch allows processes to be killed
> > in an effort to get contiguous free memory other than freeing pages in
> > general may free up contiguous pages. It seems this patch makes the
> > killer more agressive but I think I know why.
> 
> Without this patch, the lowmemorykiller would never trigger when the
> kernel freed cache memory (since the lowmemorykiller looked at the sum
> of cache plus free). For normal allocations this is not a problem
> since the kernel stops freeing cache memory when the required amount
> of free memory is met. For contiguous allocations, there is no limit
> to how much cache memory the kernel will try to free.
> 

Nice explanation. Please make that the changelog.

> >
> > How about the following as a changelog? It might help me understand the
> > use case better and how this patch changes it if the changelog was better.
> > Of course, as this is this driver is isolated to the Android platform and
> > I'm not developing or own one, you're also welcome to tell me to "shut up,
> > you don't know what you're talking about" :)
> >
> > =====
> > The Android low memory killer is responsible for ensuring there is enough
> > free memory at configured watermarks stored in a lowmem_minfree[] array. At
> > each watermark, there is an associated oom_adj score. When the watermark is
> > not met, processes with a higher oom_adj are considered for OOM killing
> > so that lower-priority processes are killed before the situation becomes
> > unmanageable.
> 
> This is describing the lowmemorykiller with or without this patch and
> does not belong in this patch description.

It was for my own benefit to see was I getting the intent of the code
right.

> Have you looked at
> drivers/staging/android/lowmemorykiller.txt?
> 

No, I hadn't. I glanced through Documentation/ and saw nothing and didn't
spot there was help with the driver itself. I've have been a lot less confused
if I had. Thanks.

> >
> > The problem is that the driver currently sums NR_FREE_PAGES+NR_FILE_PAGES
> > as an estimate of how much memory is potentially free. This can lead to
> > a situation where no memory is really free because it's all in the page
> > cache and corrective action is taken too late with too many processes being
> > considered. This patch changes the semantics such that both NR_FREE_PAGES
> > and NR_FILE_PAGES must be above the lowmem_minfree threshold.
> > ====
> 
> I don't think this is correct. To start with, we want memory to be
> used for the cache.
> 

What you said above should be the changelog explained the intent much better.

After all that, patch now looks good to me. Once you update the
changelog, feel free to add a

Reviewed-by: Mel Gorman <mel@csn.ul.ie>

Thanks.

> >
> > ?
> >
> > If this is accurate, it also wouldn't hurt to put the first paragraph into a
> > comment earlier in the driver so the next poor fool like me isn't scratching
> > his head wondering what this is for.
> >
> >> >
> >> >> Signed-off-by: Arve Hjønnevåg <arve@android.com>
> >> >> ---
> >> >>  drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
> >> >>  1 files changed, 9 insertions(+), 4 deletions(-)
> >> >>
> >> >> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> >> >> --- a/drivers/staging/android/lowmemorykiller.c
> >> >> +++ b/drivers/staging/android/lowmemorykiller.c
> >> >> @@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
> >> >>       int min_adj = OOM_ADJUST_MAX + 1;
> >> >>       int selected_tasksize = 0;
> >> >>       int array_size = ARRAY_SIZE(lowmem_adj);
> >> >> -     int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
> >> >> +     int other_free = global_page_state(NR_FREE_PAGES);
> >> >> +     int other_file = global_page_state(NR_FILE_PAGES);
> >> >>       if(lowmem_adj_size < array_size)
> >> >>               array_size = lowmem_adj_size;
> >> >>       if(lowmem_minfree_size < array_size)
> >> >>               array_size = lowmem_minfree_size;
> >> >>       for(i = 0; i < array_size; i++) {
> >> >
> >> > It would appear that lowmem_adj_size is making assumptions on
> >> > the number of zones that exist in the system. It's not clear why
> >> > sysctl_lowmem_reserve_ratio[] is not being used or how they differ.
> >>
> >> lowmem_adj_size is the number of elements in the array. Why would the
> >> number of zones in the system affect it?
> >
> > I misread what the purpose of lowmem_adj[] was. I thought it was
> > thresholds on a per-zone basis because it looks similar to
> > lowmem_reserve_ratio[] but that's not what it is at all.
> >
> > Looking at it closer, this is more about introducing more watermarks to
> > a zone for the control of the size of the cache - maybe because the OOM
> > killer takes action too late for you.
> >
> > Lack of familiarity with the driver and how it's expected to be used is
> > catching me.
> >
> >> The goal of the
> >> lowmemorykiller is to make more memory available for caches. I would
> >> expect increasing sysctl_lowmem_reserve_ratio to have the opposite
> >> effect, but on our system with only one zone it has no effect.
> >>
> >
> > Ok.
> >
> >> >
> >> >> -             if(other_free < lowmem_minfree[i]) {
> >> >> +             if (other_free < lowmem_minfree[i] &&
> >> >> +                 other_file < lowmem_minfree[i]) {
> >> >>                       min_adj = lowmem_adj[i];
> >> >>                       break;
> >> >>               }
> >> >>       }
> >> >>       if(nr_to_scan > 0)
> >> >> -             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
> >> >> -     rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
> >> >> +             lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
> >> >> +     rem = global_page_state(NR_ACTIVE_ANON) +
> >> >> +             global_page_state(NR_ACTIVE_FILE) +
> >> >> +             global_page_state(NR_INACTIVE_ANON) +
> >> >> +             global_page_state(NR_INACTIVE_FILE);
> >> >
> >> > This looks like it's a compile fix since changes made to 4f98a2fe. This
> >> > should have been in a separate patch and prioritised.
> >>
> >> You are right. These patches were originally on 2.6.27 and I did not
> >> notice that the first patch was also broken when I fixed this patch
> >> for 2.6.29.
> >>
> >
> > Ok
> >
> >> >
> >> >>       if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> >> >>               lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
> >> >>               return rem;
> >> >
> >
> > --
> > Mel Gorman
> > Part-time Phd Student                          Linux Technology Center
> > University of Limerick                         IBM Dublin Software Lab
> >
> 
> 
> 
> -- 
> Arve Hjønnevåg
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:41                   ` Dave Hansen
@ 2009-05-15 13:05                     ` Pavel Machek
  2009-05-15 17:59                       ` Christoph Lameter
  0 siblings, 1 reply; 95+ messages in thread
From: Pavel Machek @ 2009-05-15 13:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

Hi!

> > > > -	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
> > > > +	printk(KERN_ERR "No available memory %s: "
> > > > +			"kill process %d (%s) score %li or a child\n",
> > > >  					message, task_pid_nr(p), p->comm, points);
> > > 
> > > "No available memory" still suggests that plugging in more memory is the
> > > right solution.
> > 
> > And... on correctly working kernel, it is, right?
> > 
> > If you have no swap space and too many applications, you plug more
> > memory. (Or invent some swap).
> > 
> > If you misconfigured cgroups, you give more memory to them.
> > 
> > If your applications mlocked 900MB and you have 1GB, you need to plug
> > more memory.
> > 
> > So... when is plugging more memory _not_ valid answer? AFAICT it is
> > when it is some kernel problem, resulting in memory not being
> > reclaimed fast enough....
> 
> I recently had a problem (~2.6.27) where the user did an mlock() of ~95%
> of memory then started doing ftp tests.  The machine also had 64k base
> pages.  We let you dirty ~30% of memory, so they were able to dirty 6x
> more memory than what we even had to work with.  We OOMed pretty fast
> every time.

Ok, so kernel should be fixed to make limits 30% of non-mlocked
memory.

> This is actually a pretty common scenario for the HPC and database
> folks.  They go sucking up and locking as much memory as they can get
> their hands on.  Adding memory never helps them because they'll use up
> whatever is there.

Well, but it is uncommon everywhere else. If you have desktop system,
job size is pretty much constant. If you have too little memory, you
OOM.

Even with mlock, if mlock size is constant (like 1GB), adding memory
will help.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 21:34                 ` Pavel Machek
  2009-05-14 21:41                   ` Dave Hansen
@ 2009-05-15 17:57                   ` Christoph Lameter
  2009-05-15 18:15                     ` Dave Hansen
  1 sibling, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 17:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, Pavel Machek wrote:

> > "No available memory" still suggests that plugging in more memory is the
> > right solution.
>
> And... on correctly working kernel, it is, right?

Nope. Usually something else is amiss if OOM occurs.

> If you have no swap space and too many applications, you plug more
> memory. (Or invent some swap).

Thats not a usual configuration. OOM there also depends on various OS
knobs. The failure occurred because application did anonymous allocations
and you did not give the OS a way to effectively push these pages out to
disk. Thus it was not able to reclaim memory.

> If you misconfigured cgroups, you give more memory to them.

If you do not have enough memory in a cgroup then your application should
slow down (because of page evictions) but the system should not OOM.
Are cgroups broken or why are you getting OOMs when using them?

> If your applications mlocked 900MB and you have 1GB, you need to plug
> more memory.

IMHO the mlocking is the issue. There are safeguards (ulimit) to prevent
this. Again a typical misconfiguration that requires disabling safeguards.
If you increase memory then more memory is likely going to be mlocked by
whoever went crazy with mlocking in the first place.

> So... when is plugging more memory _not_ valid answer? AFAICT it is
> when it is some kernel problem, resulting in memory not being
> reclaimed fast enough....

Reclaim failures occur typically because memory is not reclaimable due to
mlocking, memory allocation in a context where we cannot perform
effective reclaim (no disk access, atomic context) (device
drivers are prone to that), or when asking for higher order pages and the
defrag logic cannot satisfy your request.

Then there is the issue on 32 bit platforms where certain kernel
allocations must occur in the memory zone under 1G. If you add more memory
then less memory is available e under !G because the kernel needs to
allocate more metadata to manage more memory. Thus you OOM faster.

So I thin that an OOM is about misconfigurations or a kernel bug. If the
application needs more memory then the pageing mechanism of the OS should
create more virtual memory for the process.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-14 22:00                   ` David Rientjes
@ 2009-05-15 17:58                     ` Christoph Lameter
  2009-05-15 18:23                       ` Dave Hansen
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 17:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, Pavel Machek, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Thu, 14 May 2009, David Rientjes wrote:

> > To me it at least adds the fact that more should be made *available* and
> > not just that you're out of it.  So, definitely not perfect, but better
> > than "out".
> >
>
> I think "no allowable memory" followed by information on what is and is
> not allowed in that specific context would remove any ambiguity.

Useful information to have. If a NUMA or cgroup restriction caused the
failure then we should print that out.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 13:05                     ` Pavel Machek
@ 2009-05-15 17:59                       ` Christoph Lameter
  2009-05-15 18:22                         ` Dave Hansen
  2009-05-15 20:02                         ` Pavel Machek
  0 siblings, 2 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 17:59 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri, 15 May 2009, Pavel Machek wrote:

> Ok, so kernel should be fixed to make limits 30% of non-mlocked
> memory.

There is already ulimit.

> > folks.  They go sucking up and locking as much memory as they can get
> > their hands on.  Adding memory never helps them because they'll use up
> > whatever is there.
>
> Well, but it is uncommon everywhere else. If you have desktop system,
> job size is pretty much constant. If you have too little memory, you
> OOM.

Nope. If you have too little memory for your app then the kernel pages
portions of the app out to disk. Thats is why you have a VM (VIRTUAL
machine). The app is not running with physical memory.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 17:57                   ` Christoph Lameter
@ 2009-05-15 18:15                     ` Dave Hansen
  2009-05-15 18:19                       ` Balbir Singh
  2009-05-15 19:26                       ` Christoph Lameter
  0 siblings, 2 replies; 95+ messages in thread
From: Dave Hansen @ 2009-05-15 18:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel, Balbir Singh

On Fri, 2009-05-15 at 13:57 -0400, Christoph Lameter wrote:
> > If you misconfigured cgroups, you give more memory to them.
> 
> If you do not have enough memory in a cgroup then your application should
> slow down (because of page evictions) but the system should not OOM.
> Are cgroups broken or why are you getting OOMs when using them?

See mm/oom_kill.c::mem_cgroup_out_of_memory().  A group itself can have
an OOM done on it.  It's not a system-wide oom.  We just need to ensure
that we continue to differentiate the cgroup-specific oom message from
the general one.  Maybe also include some more cgroup info in the debug
outbut.

-- Dave


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 18:15                     ` Dave Hansen
@ 2009-05-15 18:19                       ` Balbir Singh
  2009-05-15 19:26                       ` Christoph Lameter
  1 sibling, 0 replies; 95+ messages in thread
From: Balbir Singh @ 2009-05-15 18:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Pavel Machek, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

* Dave Hansen <dave@linux.vnet.ibm.com> [2009-05-15 11:15:07]:

> On Fri, 2009-05-15 at 13:57 -0400, Christoph Lameter wrote:
> > > If you misconfigured cgroups, you give more memory to them.
> > 
> > If you do not have enough memory in a cgroup then your application should
> > slow down (because of page evictions) but the system should not OOM.
> > Are cgroups broken or why are you getting OOMs when using them?
> 
> See mm/oom_kill.c::mem_cgroup_out_of_memory().  A group itself can have
> an OOM done on it.  It's not a system-wide oom.  We just need to ensure
> that we continue to differentiate the cgroup-specific oom message from
> the general one.  Maybe also include some more cgroup info in the debug
> outbut.
>

A long pending action item is to allow user space to handle cgroup
OOM. A cgroup can OOM, if the tasks in the group is way over their limit
and reclaim fails. I'll look to see if we enhance the output.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 17:59                       ` Christoph Lameter
@ 2009-05-15 18:22                         ` Dave Hansen
  2009-05-15 19:29                           ` Christoph Lameter
  2009-05-15 20:02                         ` Pavel Machek
  1 sibling, 1 reply; 95+ messages in thread
From: Dave Hansen @ 2009-05-15 18:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri, 2009-05-15 at 13:59 -0400, Christoph Lameter wrote:
> On Fri, 15 May 2009, Pavel Machek wrote:
> > Ok, so kernel should be fixed to make limits 30% of non-mlocked
> > memory.
> 
> There is already ulimit.

There is, but it gets overridden anyway since there are users that
"need" tons of mlock() space.  We also shouldn't be allowing things to
get into crazy configurations.  Having a max dirty ratio doesn't even
have an effect when you can dirty all unlocked memory several times over
and still not be hitting the max dirty mark.

Perhaps we should ensure that total RAM mlock()ed + vm_dirty_bytes is
always under total ram.  That'll force people to either stop mlocking or
do the safe thing which is drop vm_dirty_bytes.  

-- Dave

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 17:58                     ` Christoph Lameter
@ 2009-05-15 18:23                       ` Dave Hansen
  2009-05-15 18:57                         ` Balbir Singh
  2009-05-15 19:37                         ` David Rientjes
  0 siblings, 2 replies; 95+ messages in thread
From: Dave Hansen @ 2009-05-15 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Pavel Machek, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, San Mehat, Arve Hj?nnev?g, linux-kernel,
	Peter Zijlstra, Balbir Singh

On Fri, 2009-05-15 at 13:58 -0400, Christoph Lameter wrote:
> On Thu, 14 May 2009, David Rientjes wrote:
> > > To me it at least adds the fact that more should be made *available* and
> > > not just that you're out of it.  So, definitely not perfect, but better
> > > than "out".
> > >
> >
> > I think "no allowable memory" followed by information on what is and is
> > not allowed in that specific context would remove any ambiguity.
> 
> Useful information to have. If a NUMA or cgroup restriction caused the
> failure then we should print that out.

We get a wee bit of info out for the cgroups case at least:

void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
...
        if (oom_kill_process(p, gfp_mask, 0, points, mem,
                                "Memory cgroup out of memory"))
                goto retry;

That can surely be improved, but it's a decent start.

-- Dave


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 18:23                       ` Dave Hansen
@ 2009-05-15 18:57                         ` Balbir Singh
  2009-05-15 19:37                         ` David Rientjes
  1 sibling, 0 replies; 95+ messages in thread
From: Balbir Singh @ 2009-05-15 18:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, David Rientjes, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, San Mehat,
	Arve Hj?nnev?g, linux-kernel, Peter Zijlstra

* Dave Hansen <dave@linux.vnet.ibm.com> [2009-05-15 11:23:27]:

> On Fri, 2009-05-15 at 13:58 -0400, Christoph Lameter wrote:
> > On Thu, 14 May 2009, David Rientjes wrote:
> > > > To me it at least adds the fact that more should be made *available* and
> > > > not just that you're out of it.  So, definitely not perfect, but better
> > > > than "out".
> > > >
> > >
> > > I think "no allowable memory" followed by information on what is and is
> > > not allowed in that specific context would remove any ambiguity.
> > 
> > Useful information to have. If a NUMA or cgroup restriction caused the
> > failure then we should print that out.
> 
> We get a wee bit of info out for the cgroups case at least:
> 
> void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> {
> ...
>         if (oom_kill_process(p, gfp_mask, 0, points, mem,
>                                 "Memory cgroup out of memory"))
>                 goto retry;
> 
> That can surely be improved, but it's a decent start.
>

Also look at mem_cgroup_print_oom_info(). 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 18:15                     ` Dave Hansen
  2009-05-15 18:19                       ` Balbir Singh
@ 2009-05-15 19:26                       ` Christoph Lameter
  2009-05-15 20:31                         ` Balbir Singh
  1 sibling, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 19:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel, Balbir Singh

On Fri, 15 May 2009, Dave Hansen wrote:

> On Fri, 2009-05-15 at 13:57 -0400, Christoph Lameter wrote:
> > > If you misconfigured cgroups, you give more memory to them.
> >
> > If you do not have enough memory in a cgroup then your application should
> > slow down (because of page evictions) but the system should not OOM.
> > Are cgroups broken or why are you getting OOMs when using them?
>
> See mm/oom_kill.c::mem_cgroup_out_of_memory().  A group itself can have
> an OOM done on it.  It's not a system-wide oom.  We just need to ensure
> that we continue to differentiate the cgroup-specific oom message from
> the general one.  Maybe also include some more cgroup info in the debug
> outbut.

But that is a resource control isssue. Should also not say out of memory
but state that the cgroup memory limit was reached.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 18:22                         ` Dave Hansen
@ 2009-05-15 19:29                           ` Christoph Lameter
  0 siblings, 0 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 19:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Machek, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri, 15 May 2009, Dave Hansen wrote:

> Perhaps we should ensure that total RAM mlock()ed + vm_dirty_bytes is
> always under total ram.  That'll force people to either stop mlocking or
> do the safe thing which is drop vm_dirty_bytes.

Guess a global mlock ratio would be good? Set it at a reasonable limit of
10% and tell the user that the system may OOM if set too high.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 18:23                       ` Dave Hansen
  2009-05-15 18:57                         ` Balbir Singh
@ 2009-05-15 19:37                         ` David Rientjes
  1 sibling, 0 replies; 95+ messages in thread
From: David Rientjes @ 2009-05-15 19:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christoph Lameter, Pavel Machek, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, San Mehat,
	Arve Hj?nnev?g, linux-kernel, Peter Zijlstra, Balbir Singh

On Fri, 15 May 2009, Dave Hansen wrote:

> We get a wee bit of info out for the cgroups case at least:
> 
> void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> {
> ...
>         if (oom_kill_process(p, gfp_mask, 0, points, mem,
>                                 "Memory cgroup out of memory"))
>                 goto retry;
> 
> That can surely be improved, but it's a decent start.
> 

Cpusets are also cgroups and have their own oom handling logic 
(CONSTRAINT_CPUSET and the penalization of the badness score for not 
sharing memory with current's set of allowed nodes).  In this case, we're 
interested in only the nodes set in cpuset_current_mems_allowed, for 
instance, and not the entire state of the machine for exclusive cpusets.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 17:59                       ` Christoph Lameter
  2009-05-15 18:22                         ` Dave Hansen
@ 2009-05-15 20:02                         ` Pavel Machek
  2009-05-15 21:15                           ` Christoph Lameter
  1 sibling, 1 reply; 95+ messages in thread
From: Pavel Machek @ 2009-05-15 20:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri 2009-05-15 13:59:50, Christoph Lameter wrote:
> On Fri, 15 May 2009, Pavel Machek wrote:
> 
> > Ok, so kernel should be fixed to make limits 30% of non-mlocked
> > memory.
> 
> There is already ulimit.

...which does not work as described in the message you snipped.

> > > folks.  They go sucking up and locking as much memory as they can get
> > > their hands on.  Adding memory never helps them because they'll use up
> > > whatever is there.
> >
> > Well, but it is uncommon everywhere else. If you have desktop system,
> > job size is pretty much constant. If you have too little memory, you
> > OOM.
> 
> Nope. If you have too little memory for your app then the kernel pages
> portions of the app out to disk. Thats is why you have a VM (VIRTUAL
> machine). The app is not running with physical memory.

Try running your machine with mem=8M, then tell me how virtual memory
works.

If you have too little RAM+swap, you OOM. (Adding memory helps).

If you have *way* too little RAM, you OOM. (Kernel data is
unswappable. Task struct is 8KB. At some point it breaks).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 19:26                       ` Christoph Lameter
@ 2009-05-15 20:31                         ` Balbir Singh
  2009-05-18 14:34                           ` Christoph Lameter
  0 siblings, 1 reply; 95+ messages in thread
From: Balbir Singh @ 2009-05-15 20:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, Pavel Machek, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

* Christoph Lameter <cl@linux-foundation.org> [2009-05-15 15:26:43]:

> On Fri, 15 May 2009, Dave Hansen wrote:
> 
> > On Fri, 2009-05-15 at 13:57 -0400, Christoph Lameter wrote:
> > > > If you misconfigured cgroups, you give more memory to them.
> > >
> > > If you do not have enough memory in a cgroup then your application should
> > > slow down (because of page evictions) but the system should not OOM.
> > > Are cgroups broken or why are you getting OOMs when using them?
> >
> > See mm/oom_kill.c::mem_cgroup_out_of_memory().  A group itself can have
> > an OOM done on it.  It's not a system-wide oom.  We just need to ensure
> > that we continue to differentiate the cgroup-specific oom message from
> > the general one.  Maybe also include some more cgroup info in the debug
> > outbut.
> 
> But that is a resource control isssue. Should also not say out of memory
> but state that the cgroup memory limit was reached.
>

But we are out of memory for the cgroup under consideration. 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 20:02                         ` Pavel Machek
@ 2009-05-15 21:15                           ` Christoph Lameter
  2009-05-19 20:39                             ` Pavel Machek
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-15 21:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri, 15 May 2009, Pavel Machek wrote:

> > Nope. If you have too little memory for your app then the kernel pages
> > portions of the app out to disk. Thats is why you have a VM (VIRTUAL
> > machine). The app is not running with physical memory.
>
> Try running your machine with mem=8M, then tell me how virtual memory
> works.

Well that is of course not enough memory. And you intentionally screwed up
your setup.

> If you have *way* too little RAM, you OOM. (Kernel data is
> unswappable. Task struct is 8KB. At some point it breaks).

I do not see anyone doubting that this is the case.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 20:31                         ` Balbir Singh
@ 2009-05-18 14:34                           ` Christoph Lameter
  2009-05-18 15:45                             ` Balbir Singh
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-18 14:34 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Dave Hansen, Pavel Machek, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Sat, 16 May 2009, Balbir Singh wrote:

> > But that is a resource control isssue. Should also not say out of memory
> > but state that the cgroup memory limit was reached.
> >
>
> But we are out of memory for the cgroup under consideration.

Not necessarily. This only occurs if the cgroup cannot swap or if
allocations are done that are not swappable.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-18 14:34                           ` Christoph Lameter
@ 2009-05-18 15:45                             ` Balbir Singh
  0 siblings, 0 replies; 95+ messages in thread
From: Balbir Singh @ 2009-05-18 15:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, Pavel Machek, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

* Christoph Lameter <cl@linux-foundation.org> [2009-05-18 10:34:22]:

> On Sat, 16 May 2009, Balbir Singh wrote:
> 
> > > But that is a resource control isssue. Should also not say out of memory
> > > but state that the cgroup memory limit was reached.
> > >
> >
> > But we are out of memory for the cgroup under consideration.
> 
> Not necessarily. This only occurs if the cgroup cannot swap or if
> allocations are done that are not swappable.
>

Isn't that the equivalent of out of memory for the system? What would
you like the OOM message to state? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-15 21:15                           ` Christoph Lameter
@ 2009-05-19 20:39                             ` Pavel Machek
  2009-05-22 13:53                               ` Christoph Lameter
  2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
  0 siblings, 2 replies; 95+ messages in thread
From: Pavel Machek @ 2009-05-19 20:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri 2009-05-15 17:15:54, Christoph Lameter wrote:
> On Fri, 15 May 2009, Pavel Machek wrote:
> 
> > > Nope. If you have too little memory for your app then the kernel pages
> > > portions of the app out to disk. Thats is why you have a VM (VIRTUAL
> > > machine). The app is not running with physical memory.
> >
> > Try running your machine with mem=8M, then tell me how virtual memory
> > works.
> 
> Well that is of course not enough memory.

Ok, so in the end, there are two reasons for OOM:

1) Out of virtual memory.

   there's simply not enough ram+swap to fit the data. You go OOM.
   This seems to be common on small machines. 8M is pushing it, but
   64M ram + 64M swap + todays gnome would probably do that.

   And maybe the way to hint people would be printing 'out of
   _virtual_ memory'.

2) Something goes very wrong with reclaim

   this seems to be common on very  big machines you have experience
   with.

Perhaps 1 and 2 can be told appart by zero swap free in the 1) case?
And perhaps you can invent some better message for 2) case?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-19 20:39                             ` Pavel Machek
@ 2009-05-22 13:53                               ` Christoph Lameter
  2009-05-22 14:17                                 ` Warn when we run out of swap space (was Re: Misleading OOM messages) Christoph Lameter
  2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
  1 sibling, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-22 13:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Tue, 19 May 2009, Pavel Machek wrote:

> > Well that is of course not enough memory.
>
> Ok, so in the end, there are two reasons for OOM:
>
> 1) Out of virtual memory.
>
>    there's simply not enough ram+swap to fit the data. You go OOM.
>    This seems to be common on small machines. 8M is pushing it, but
>    64M ram + 64M swap + todays gnome would probably do that.
>
>    And maybe the way to hint people would be printing 'out of
>    _virtual_ memory'.

This is only an issue for anonymous page and is therefore load dependent.
Memory can be provided through additional swap space.

> 2) Something goes very wrong with reclaim
>
>    this seems to be common on very  big machines you have experience
>    with.
>
> Perhaps 1 and 2 can be told appart by zero swap free in the 1) case?

We could add a message spitting out a warning in get_swap_page() to cover
the "out of memory" case. Would be triggered once only when we first run
out of swap space.

> And perhaps you can invent some better message for 2) case?

The something-goes-wrong with reclaim occurs for a variety of reasons
on other machines. Even on the small machine that I currently work with.
I am not in the embedded space right now so this likely means that I do
not see the out of swap -> OOM condition. The out of memory issues that
I see are misconfigurations on a varity of levels. On top right now is
running out of memory on 32 bit machines since someone put too much memory
into them. Thus ZONE_NORMAL gets exhausted.

Then they add more memory and therefore OOM occurs faster. Which leaves
them somewhat confused.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Warn when we run out of swap space (was Re: Misleading OOM messages)
  2009-05-22 13:53                               ` Christoph Lameter
@ 2009-05-22 14:17                                 ` Christoph Lameter
  2009-05-22 14:56                                   ` Pavel Machek
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-22 14:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel


Subject: Warn if we run out of swap space

Running out of swap space means that the evicton of anonymous pages may no longer
be possible which can lead to OOM conditions.

Print a warning when swap space first becomes exhausted.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/swapfile.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2009-05-22 09:07:40.000000000 -0500
+++ linux-2.6/mm/swapfile.c	2009-05-22 09:15:59.000000000 -0500
@@ -412,6 +412,7 @@ swp_entry_t get_swap_page(void)
 	nr_swap_pages++;
 noswap:
 	spin_unlock(&swap_lock);
+	WARN_ONCE(1, "All of swap is in use. Some pages cannot be swapped out.");
 	return (swp_entry_t) {0};
 }


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Warn when we run out of swap space (was Re: Misleading OOM messages)
  2009-05-22 14:17                                 ` Warn when we run out of swap space (was Re: Misleading OOM messages) Christoph Lameter
@ 2009-05-22 14:56                                   ` Pavel Machek
  0 siblings, 0 replies; 95+ messages in thread
From: Pavel Machek @ 2009-05-22 14:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

On Fri 2009-05-22 10:17:15, Christoph Lameter wrote:
> 
> Subject: Warn if we run out of swap space
> 
> Running out of swap space means that the evicton of anonymous pages may no longer
> be possible which can lead to OOM conditions.
> 
> Print a warning when swap space first becomes exhausted.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

WARN_ONCE... will it mean a backtrace? That's quite an overkill for
something that is not a kernel fault (and where backtrace is useless).

But  yes, I agree in principle.
									Pavel

> @@ -412,6 +412,7 @@ swp_entry_t get_swap_page(void)
>  	nr_swap_pages++;
>  noswap:
>  	spin_unlock(&swap_lock);
> +	WARN_ONCE(1, "All of swap is in use. Some pages cannot be swapped out.");
>  	return (swp_entry_t) {0};
>  }

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-19 20:39                             ` Pavel Machek
  2009-05-22 13:53                               ` Christoph Lameter
@ 2009-05-22 19:01                               ` Christoph Lameter
  2009-05-22 19:40                                 ` Randy Dunlap
  2009-05-22 21:43                                 ` Alan Cox
  1 sibling, 2 replies; 95+ messages in thread
From: Christoph Lameter @ 2009-05-22 19:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dave Hansen, David Rientjes, Andrew Morton, Greg Kroah-Hartman,
	Nick Piggin, Mel Gorman, Peter Ziljstra, San Mehat,
	Arve Hj?nnev?g, linux-kernel

Subject: Remove misleading kernel log entries about "Out of Memory" conditions

What we traditionally call an "out of memory" failure is mostly not really
related to having enough physical memory. "out of memory" occurs when the
memory reclaim attempts fail to provide enough memory for an allocation.

Typically there is a misconfiguration or kernel bug that is at the root
of an out of memory issue. The message suggests that the machine does
not have enough memory which is not true.

People have done strange things as a result of these messages. Some
put more physical memory into their machines others limit the memory
use of their applications with ulimit. Having a clear message avoids
these reactions.

So change the messages to describe what actually went wrong.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/oom_kill.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/oom_kill.c
===================================================================
--- linux-2.6.orig/mm/oom_kill.c	2009-05-12 12:37:52.000000000 -0500
+++ linux-2.6/mm/oom_kill.c	2009-05-12 12:44:36.000000000 -0500
@@ -387,7 +387,7 @@ static int oom_kill_process(struct task_
 	struct task_struct *c;

 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
+		printk(KERN_WARNING "%s invoked process-killer: "
 			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
 			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
@@ -519,7 +519,7 @@ static void __out_of_memory(gfp_t gfp_ma

 	if (sysctl_oom_kill_allocating_task)
 		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"Out of memory (oom_kill_allocating_task)"))
+				"Failure to reclaim enough memory (oom_kill_allocating_task)"))
 			return;
 retry:
 	/*
@@ -534,11 +534,11 @@ retry:
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p) {
 		read_unlock(&tasklist_lock);
-		panic("Out of memory and no killable processes...\n");
+		panic("Failure to reclaim enough memory and no killable processes...\n");
 	}

 	if (oom_kill_process(p, gfp_mask, order, points, NULL,
-			     "Out of memory"))
+			     "Memory reclaim failure"))
 		goto retry;
 }

@@ -563,7 +563,7 @@ void pagefault_out_of_memory(void)
 		goto rest_and_return;

 	if (sysctl_panic_on_oom)
-		panic("out of memory from page fault. panic_on_oom is selected.\n");
+		panic("failure to reclaim enough memory. panic_on_oom is selected.\n");

 	read_lock(&tasklist_lock);
 	__out_of_memory(0, 0); /* unknown gfp_mask and order */
@@ -600,7 +600,7 @@ void out_of_memory(struct zonelist *zone
 		return;

 	if (sysctl_panic_on_oom == 2)
-		panic("out of memory. Compulsory panic_on_oom is selected.\n");
+		panic("failure to reclaim enough memory. Compulsory panic_on_oom is selected.\n");

 	/*
 	 * Check if there were limitations on the allocation (only relevant for
@@ -617,7 +617,7 @@ void out_of_memory(struct zonelist *zone

 	case CONSTRAINT_NONE:
 		if (sysctl_panic_on_oom)
-			panic("out of memory. panic_on_oom is selected\n");
+			panic("failure to enough reclaim memory. panic_on_oom is selected\n");
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
 		__out_of_memory(gfp_mask, order);

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
@ 2009-05-22 19:40                                 ` Randy Dunlap
  2009-05-22 19:44                                   ` Christoph Lameter
  2009-05-22 21:43                                 ` Alan Cox
  1 sibling, 1 reply; 95+ messages in thread
From: Randy Dunlap @ 2009-05-22 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, Dave Hansen, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

Christoph Lameter wrote:
> Subject: Remove misleading kernel log entries about "Out of Memory" conditions
> 
> What we traditionally call an "out of memory" failure is mostly not really
> related to having enough physical memory. "out of memory" occurs when the
> memory reclaim attempts fail to provide enough memory for an allocation.
> 
> Typically there is a misconfiguration or kernel bug that is at the root
> of an out of memory issue. The message suggests that the machine does
> not have enough memory which is not true.
> 
> People have done strange things as a result of these messages. Some
> put more physical memory into their machines others limit the memory
> use of their applications with ulimit. Having a clear message avoids
> these reactions.
> 
> So change the messages to describe what actually went wrong.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/oom_kill.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6/mm/oom_kill.c
> ===================================================================
> --- linux-2.6.orig/mm/oom_kill.c	2009-05-12 12:37:52.000000000 -0500
> +++ linux-2.6/mm/oom_kill.c	2009-05-12 12:44:36.000000000 -0500
> @@ -387,7 +387,7 @@ static int oom_kill_process(struct task_
>  	struct task_struct *c;
> 
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> +		printk(KERN_WARNING "%s invoked process-killer: "
>  			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
>  			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> @@ -519,7 +519,7 @@ static void __out_of_memory(gfp_t gfp_ma
> 
>  	if (sysctl_oom_kill_allocating_task)
>  		if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"Out of memory (oom_kill_allocating_task)"))
> +				"Failure to reclaim enough memory (oom_kill_allocating_task)"))
>  			return;
>  retry:
>  	/*
> @@ -534,11 +534,11 @@ retry:
>  	/* Found nothing?!?! Either we hang forever, or we panic. */
>  	if (!p) {
>  		read_unlock(&tasklist_lock);
> -		panic("Out of memory and no killable processes...\n");
> +		panic("Failure to reclaim enough memory and no killable processes...\n");
>  	}
> 
>  	if (oom_kill_process(p, gfp_mask, order, points, NULL,
> -			     "Out of memory"))
> +			     "Memory reclaim failure"))
>  		goto retry;
>  }
> 
> @@ -563,7 +563,7 @@ void pagefault_out_of_memory(void)
>  		goto rest_and_return;
> 
>  	if (sysctl_panic_on_oom)
> -		panic("out of memory from page fault. panic_on_oom is selected.\n");
> +		panic("failure to reclaim enough memory. panic_on_oom is selected.\n");
> 
>  	read_lock(&tasklist_lock);
>  	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> @@ -600,7 +600,7 @@ void out_of_memory(struct zonelist *zone
>  		return;
> 
>  	if (sysctl_panic_on_oom == 2)
> -		panic("out of memory. Compulsory panic_on_oom is selected.\n");
> +		panic("failure to reclaim enough memory. Compulsory panic_on_oom is selected.\n");
> 
>  	/*
>  	 * Check if there were limitations on the allocation (only relevant for
> @@ -617,7 +617,7 @@ void out_of_memory(struct zonelist *zone
> 
>  	case CONSTRAINT_NONE:
>  		if (sysctl_panic_on_oom)
> -			panic("out of memory. panic_on_oom is selected\n");
> +			panic("failure to enough reclaim memory. panic_on_oom is selected\n");

huh?

But I think that normal users won't know what reclaiming memory is anyway,
so the patch doesn't help IMO.

>  		/* Fall-through */
>  	case CONSTRAINT_CPUSET:
>  		__out_of_memory(gfp_mask, order);


-- 
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-22 19:40                                 ` Randy Dunlap
@ 2009-05-22 19:44                                   ` Christoph Lameter
  2009-05-22 21:45                                     ` Alan Cox
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Lameter @ 2009-05-22 19:44 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Pavel Machek, Dave Hansen, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Fri, 22 May 2009, Randy Dunlap wrote:

> But I think that normal users won't know what reclaiming memory is anyway,
> so the patch doesn't help IMO.

If they dont know then they may ask someone who knows and/or report a bug.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
  2009-05-22 19:40                                 ` Randy Dunlap
@ 2009-05-22 21:43                                 ` Alan Cox
  1 sibling, 0 replies; 95+ messages in thread
From: Alan Cox @ 2009-05-22 21:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pavel Machek, Dave Hansen, David Rientjes, Andrew Morton,
	Greg Kroah-Hartman, Nick Piggin, Mel Gorman, Peter Ziljstra,
	San Mehat, Arve Hj?nnev?g, linux-kernel

On Fri, 22 May 2009 15:01:23 -0400 (EDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> Subject: Remove misleading kernel log entries about "Out of Memory" conditions
> 
> What we traditionally call an "out of memory" failure is mostly not really
> related to having enough physical memory. "out of memory" occurs when the
> memory reclaim attempts fail to provide enough memory for an allocation.

That looks even more misleading. It now sounds like something you should
immediately rush out and file in bugzilla.

Not a good change at all

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: Misleading OOM messages
  2009-05-22 19:44                                   ` Christoph Lameter
@ 2009-05-22 21:45                                     ` Alan Cox
  0 siblings, 0 replies; 95+ messages in thread
From: Alan Cox @ 2009-05-22 21:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Randy Dunlap, Pavel Machek, Dave Hansen, David Rientjes,
	Andrew Morton, Greg Kroah-Hartman, Nick Piggin, Mel Gorman,
	Peter Ziljstra, San Mehat, Arve Hj?nnev?g, linux-kernel

On Fri, 22 May 2009 15:44:37 -0400 (EDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Fri, 22 May 2009, Randy Dunlap wrote:
> 
> > But I think that normal users won't know what reclaiming memory is anyway,
> > so the patch doesn't help IMO.
> 
> If they dont know then they may ask someone who knows and/or report a bug.

So 200,000 people report it to their distribution mailing lists....

Yeah right - not very clever change really

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2009-05-22 21:45 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
2009-05-12  9:23   ` Mel Gorman
2009-05-13  0:27     ` Arve Hjønnevåg
2009-05-13  9:42       ` Mel Gorman
2009-05-14 23:25         ` Arve Hjønnevåg
2009-05-15  9:18           ` Mel Gorman
2009-05-10 22:07 ` [patch 03/11 -mmotm] oom: cleanup android low memory killer David Rientjes
2009-05-10 22:07 ` [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer David Rientjes
2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
2009-05-11 21:11   ` Andrew Morton
2009-05-11 21:28     ` David Rientjes
2009-05-11 21:41       ` Greg KH
2009-05-11 22:05         ` David Rientjes
2009-05-12  9:38   ` Mel Gorman
2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
2009-05-11  0:17   ` KOSAKI Motohiro
2009-05-11  0:26     ` David Rientjes
2009-05-11  1:47       ` KOSAKI Motohiro
2009-05-11  8:43         ` David Rientjes
2009-05-11 21:19           ` Andrew Morton
2009-05-12  9:56   ` Mel Gorman
2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
2009-05-11 21:22   ` Andrew Morton
2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
2009-05-10 23:59   ` KOSAKI Motohiro
2009-05-11  0:24     ` David Rientjes
2009-05-11  1:45       ` KOSAKI Motohiro
2009-05-11  7:40         ` Minchan Kim
2009-05-11  8:49           ` David Rientjes
2009-05-11 11:23             ` Minchan Kim
2009-05-11  8:45         ` David Rientjes
2009-05-11 16:03           ` Dave Hansen
2009-05-11 19:09             ` David Rientjes
2009-05-11 19:45               ` Dave Hansen
2009-05-11 20:21                 ` David Rientjes
2009-05-11 21:29   ` Andrew Morton
2009-05-11 21:45     ` David Rientjes
2009-05-11 22:11       ` Andrew Morton
2009-05-11 22:31         ` David Rientjes
2009-05-11 22:46           ` Andrew Morton
2009-05-11 23:00             ` David Rientjes
2009-05-11 23:14               ` Andrew Morton
2009-05-11 23:37                 ` David Rientjes
2009-05-12  5:39         ` Peter Zijlstra
2009-05-12 11:36           ` KOSAKI Motohiro
2009-05-12 10:05         ` Mel Gorman
2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
2009-05-11 21:36   ` Andrew Morton
2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
2009-05-11 21:39   ` Andrew Morton
2009-05-11 23:08     ` David Rientjes
2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
2009-05-12 21:14   ` Misleading OOM messages Christoph Lameter
2009-05-14  9:29     ` Pavel Machek
2009-05-14 19:46       ` Christoph Lameter
2009-05-14 20:38         ` Dave Hansen
2009-05-14 20:49           ` Christoph Lameter
2009-05-14 20:49           ` David Rientjes
2009-05-14 21:05             ` Dave Hansen
2009-05-14 21:12               ` David Rientjes
2009-05-14 21:30               ` Christoph Lameter
2009-05-14 21:34                 ` Pavel Machek
2009-05-14 21:41                   ` Dave Hansen
2009-05-15 13:05                     ` Pavel Machek
2009-05-15 17:59                       ` Christoph Lameter
2009-05-15 18:22                         ` Dave Hansen
2009-05-15 19:29                           ` Christoph Lameter
2009-05-15 20:02                         ` Pavel Machek
2009-05-15 21:15                           ` Christoph Lameter
2009-05-19 20:39                             ` Pavel Machek
2009-05-22 13:53                               ` Christoph Lameter
2009-05-22 14:17                                 ` Warn when we run out of swap space (was Re: Misleading OOM messages) Christoph Lameter
2009-05-22 14:56                                   ` Pavel Machek
2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
2009-05-22 19:40                                 ` Randy Dunlap
2009-05-22 19:44                                   ` Christoph Lameter
2009-05-22 21:45                                     ` Alan Cox
2009-05-22 21:43                                 ` Alan Cox
2009-05-15 17:57                   ` Christoph Lameter
2009-05-15 18:15                     ` Dave Hansen
2009-05-15 18:19                       ` Balbir Singh
2009-05-15 19:26                       ` Christoph Lameter
2009-05-15 20:31                         ` Balbir Singh
2009-05-18 14:34                           ` Christoph Lameter
2009-05-18 15:45                             ` Balbir Singh
2009-05-14 21:37                 ` Dave Hansen
2009-05-14 22:00                   ` David Rientjes
2009-05-15 17:58                     ` Christoph Lameter
2009-05-15 18:23                       ` Dave Hansen
2009-05-15 18:57                         ` Balbir Singh
2009-05-15 19:37                         ` David Rientjes
2009-05-14 20:56         ` Pavel Machek
2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
2009-05-13  0:43   ` Arve Hjønnevåg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.