linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/7] lowmemorykiller: Only iterate over process list when needed.
@ 2009-05-05  0:26 David Rientjes
  2009-05-05  0:26 ` [patch 2/7] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:26 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2249 bytes --]

From: Arve Hjønnevåg <arve@android.com>

Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
instead the sum of rss. Neither method is accurate.

Also skip the process scan, if the amount of memory available is above
the largest threshold set.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
---
 drivers/staging/android/lowmemorykiller.c |   35 +++++++++++++++++-----------
 1 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -71,23 +71,30 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	}
 	if(nr_to_scan > 0)
 		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
+	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
+		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+		return rem;
+	}
+
 	read_lock(&tasklist_lock);
 	for_each_process(p) {
-		if(p->oomkilladj >= 0 && p->mm) {
-			tasksize = get_mm_rss(p->mm);
-			if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
-				if(selected == NULL ||
-				   p->oomkilladj > selected->oomkilladj ||
-				   (p->oomkilladj == selected->oomkilladj &&
-				    tasksize > selected_tasksize)) {
-					selected = p;
-					selected_tasksize = tasksize;
-					lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
-					             p->pid, p->comm, p->oomkilladj, tasksize);
-				}
-			}
-			rem += tasksize;
+		if (p->oomkilladj < min_adj || !p->mm)
+			continue;
+		tasksize = get_mm_rss(p->mm);
+		if (tasksize <= 0)
+			continue;
+		if (selected) {
+			if (p->oomkilladj < selected->oomkilladj)
+				continue;
+			if (p->oomkilladj == selected->oomkilladj &&
+			    tasksize <= selected_tasksize)
+				continue;
 		}
+		selected = p;
+		selected_tasksize = tasksize;
+		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
+		             p->pid, p->comm, p->oomkilladj, tasksize);
 	}
 	if(selected != NULL) {
 		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/7] lowmemorykiller: Don't count free space unless it meets the specified limit by itself
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
@ 2009-05-05  0:26 ` David Rientjes
  2009-05-05  0:26 ` [patch 3/7] oom: cleanup android low memory killer David Rientjes
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:26 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1897 bytes --]

From: Arve Hjønnevåg <arve@android.com>

This allows processes to be killed when the kernel evict cache pages in
an attempt to get more contiguous free memory.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
---
 drivers/staging/android/lowmemorykiller.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -58,20 +58,25 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	int min_adj = OOM_ADJUST_MAX + 1;
 	int selected_tasksize = 0;
 	int array_size = ARRAY_SIZE(lowmem_adj);
-	int other_free = global_page_state(NR_FREE_PAGES) + global_page_state(NR_FILE_PAGES);
+	int other_free = global_page_state(NR_FREE_PAGES);
+	int other_file = global_page_state(NR_FILE_PAGES);
 	if(lowmem_adj_size < array_size)
 		array_size = lowmem_adj_size;
 	if(lowmem_minfree_size < array_size)
 		array_size = lowmem_minfree_size;
 	for(i = 0; i < array_size; i++) {
-		if(other_free < lowmem_minfree[i]) {
+		if (other_free < lowmem_minfree[i] &&
+		    other_file < lowmem_minfree[i]) {
 			min_adj = lowmem_adj[i];
 			break;
 		}
 	}
 	if(nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
-	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
+	rem = global_page_state(NR_ACTIVE_ANON) +
+		global_page_state(NR_ACTIVE_FILE) +
+		global_page_state(NR_INACTIVE_ANON) +
+		global_page_state(NR_INACTIVE_FILE);
 	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
 		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
 		return rem;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/7] oom: cleanup android low memory killer
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
  2009-05-05  0:26 ` [patch 2/7] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
@ 2009-05-05  0:26 ` David Rientjes
  2009-05-05  0:26 ` [patch 4/7] oom: fix possible android low memory killer NULL pointer David Rientjes
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:26 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

Clean up the code in lowmem_shrink() for the Android low memory killer so
that it follows the kernel coding style.

It's unnecessary to check for p->oomkilladj >= min_adj if the selected
task's oomkilladj score is stored since get_mm_rss() will always be
greater than zero.

Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c |   39 +++++++++++++++++++---------
 1 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -57,58 +57,71 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	int i;
 	int min_adj = OOM_ADJUST_MAX + 1;
 	int selected_tasksize = 0;
+	int selected_oom_adj;
 	int array_size = ARRAY_SIZE(lowmem_adj);
 	int other_free = global_page_state(NR_FREE_PAGES);
 	int other_file = global_page_state(NR_FILE_PAGES);
-	if(lowmem_adj_size < array_size)
+
+	if (lowmem_adj_size < array_size)
 		array_size = lowmem_adj_size;
-	if(lowmem_minfree_size < array_size)
+	if (lowmem_minfree_size < array_size)
 		array_size = lowmem_minfree_size;
-	for(i = 0; i < array_size; i++) {
+	for (i = 0; i < array_size; i++) {
 		if (other_free < lowmem_minfree[i] &&
 		    other_file < lowmem_minfree[i]) {
 			min_adj = lowmem_adj[i];
 			break;
 		}
 	}
-	if(nr_to_scan > 0)
-		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n", nr_to_scan, gfp_mask, other_free, other_file, min_adj);
+	if (nr_to_scan > 0)
+		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d %d, ma %d\n",
+			     nr_to_scan, gfp_mask, other_free, other_file,
+			     min_adj);
 	rem = global_page_state(NR_ACTIVE_ANON) +
 		global_page_state(NR_ACTIVE_FILE) +
 		global_page_state(NR_INACTIVE_ANON) +
 		global_page_state(NR_INACTIVE_FILE);
 	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
-		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n",
+			     nr_to_scan, gfp_mask, rem);
 		return rem;
 	}
+	selected_oom_adj = min_adj;
 
 	read_lock(&tasklist_lock);
 	for_each_process(p) {
-		if (p->oomkilladj < min_adj || !p->mm)
+		int oom_adj;
+
+		if (!p->mm)
+			continue;
+		oom_adj = p->oomkilladj;
+		if (oom_adj < min_adj)
 			continue;
 		tasksize = get_mm_rss(p->mm);
 		if (tasksize <= 0)
 			continue;
 		if (selected) {
-			if (p->oomkilladj < selected->oomkilladj)
+			if (oom_adj < selected_oom_adj)
 				continue;
-			if (p->oomkilladj == selected->oomkilladj &&
+			if (oom_adj == selected_oom_adj &&
 			    tasksize <= selected_tasksize)
 				continue;
 		}
 		selected = p;
 		selected_tasksize = tasksize;
+		selected_oom_adj = oom_adj;
 		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
-		             p->pid, p->comm, p->oomkilladj, tasksize);
+		             p->pid, p->comm, oom_adj, tasksize);
 	}
-	if(selected != NULL) {
+	if (selected) {
 		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",
 		             selected->pid, selected->comm,
-		             selected->oomkilladj, selected_tasksize);
+		             selected_oom_adj, selected_tasksize);
 		force_sig(SIGKILL, selected);
 		rem -= selected_tasksize;
 	}
-	lowmem_print(4, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
+	lowmem_print(4, "lowmem_shrink %d, %x, return %d\n",
+		     nr_to_scan, gfp_mask, rem);
 	read_unlock(&tasklist_lock);
 	return rem;
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 4/7] oom: fix possible android low memory killer NULL pointer
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
  2009-05-05  0:26 ` [patch 2/7] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
  2009-05-05  0:26 ` [patch 3/7] oom: cleanup android low memory killer David Rientjes
@ 2009-05-05  0:26 ` David Rientjes
  2009-05-05  0:27 ` [patch 5/7] oom: fix possible oom_dump_tasks " David Rientjes
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:26 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

get_mm_rss() atomically dereferences the actual without checking for a
NULL pointer, which is possible since task_lock() is not held.

Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -92,12 +92,18 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 	for_each_process(p) {
 		int oom_adj;
 
-		if (!p->mm)
+		task_lock(p);
+		if (!p->mm) {
+			task_unlock(p);
 			continue;
+		}
 		oom_adj = p->oomkilladj;
-		if (oom_adj < min_adj)
+		if (oom_adj < min_adj) {
+			task_unlock(p);
 			continue;
+		}
 		tasksize = get_mm_rss(p->mm);
+		task_unlock(p);
 		if (tasksize <= 0)
 			continue;
 		if (selected) {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 5/7] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (2 preceding siblings ...)
  2009-05-05  0:26 ` [patch 4/7] oom: fix possible android low memory killer NULL pointer David Rientjes
@ 2009-05-05  0:27 ` David Rientjes
  2009-05-05 10:11   ` Nick Piggin
  2009-05-05  0:27 ` [patch 6/7] oom: move oom_adj value from task_struct to mm_struct David Rientjes
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:27 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
pointer for tasks that have detached mm's since task_lock() is not held
during the tasklist scan.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   24 +++++++++++++++---------
 1 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -284,22 +284,28 @@ static void dump_tasks(const struct mem_cgroup *mem)
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
 	       "name\n");
 	do_each_thread(g, p) {
-		/*
-		 * total_vm and rss sizes do not exist for tasks with a
-		 * detached mm so there's no need to report them.
-		 */
-		if (!p->mm)
-			continue;
+		struct mm_struct *mm;
+
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 		if (!thread_group_leader(p))
 			continue;
 
 		task_lock(p);
+		mm = p->mm;
+		if (!mm) {
+			/*
+			 * total_vm and rss sizes do not exist for tasks with no
+			 * mm so there's no need to report them; they can't be
+			 * oom killed anyway.
+			 */
+			task_unlock(p);
+			continue;
+		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
-		       p->pid, __task_cred(p)->uid, p->tgid,
-		       p->mm->total_vm, get_mm_rss(p->mm), (int)task_cpu(p),
-		       p->oomkilladj, p->comm);
+		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
+		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
+		       p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 6/7] oom: move oom_adj value from task_struct to mm_struct
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (3 preceding siblings ...)
  2009-05-05  0:27 ` [patch 5/7] oom: fix possible oom_dump_tasks " David Rientjes
@ 2009-05-05  0:27 ` David Rientjes
  2009-05-05 10:18   ` Nick Piggin
  2009-05-05  0:27 ` [patch 7/7] oom: prevent possible OOM_DISABLE livelock David Rientjes
  2009-05-05 10:23 ` [patch 1/7] lowmemorykiller: Only iterate over process list when needed Nick Piggin
  6 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:27 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares
the mm.  If a task were to be killed while attached to an mm that could
not be freed because another thread were set to OOM_DISABLE, it would
have needlessly been terminated since there is no potential for future
memory freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
in oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: San Mehat <san@android.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt        |   15 +++++++++----
 drivers/staging/android/lowmemorykiller.c |    2 +-
 fs/proc/base.c                            |   19 ++++++++++++++--
 include/linux/mm_types.h                  |    2 +
 include/linux/sched.h                     |    1 -
 mm/oom_kill.c                             |   32 ++++++++++++++++++----------
 6 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation.  The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value.  A high value will increase the liklihood of this process being
+killed by the oom-killer.  Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
 /proc/<pid>/oom_score shows process' current badness score.
 
 The following heuristics are then applied:
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
 			task_unlock(p);
 			continue;
 		}
-		oom_adj = p->oomkilladj;
+		oom_adj = p->mm->oom_adj;
 		if (oom_adj < min_adj) {
 			task_unlock(p);
 			continue;
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 
 	if (!task)
 		return -ESRCH;
-	oom_adjust = task->oomkilladj;
+	task_lock(task);
+	if (task->mm)
+		oom_adjust = task->mm->oom_adj;
+	else
+		oom_adjust = OOM_DISABLE;
+	task_unlock(task);
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+	task_lock(task);
+	if (!task->mm) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->oomkilladj = oom_adjust;
+	task->mm->oom_adj = oom_adjust;
+	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -232,6 +232,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1149,7 +1149,6 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
-	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
+	int oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
+	oom_adj = mm->oom_adj;
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 8;
 
 	/*
-	 * Adjust the score by oomkilladj.
+	 * Adjust the score by oom_adj.
 	 */
-	if (p->oomkilladj) {
-		if (p->oomkilladj > 0) {
+	if (oom_adj) {
+		if (oom_adj > 0) {
 			if (!points)
 				points = 1;
-			points <<= p->oomkilladj;
+			points <<= oom_adj;
 		} else
-			points >>= -(p->oomkilladj);
+			points >>= -(oom_adj);
 	}
 
 #ifdef DEBUG
@@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->oomkilladj == OOM_DISABLE)
+		task_lock(p);
+		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
 			continue;
+		task_unlock(p);
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
-		       p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
 	 * Don't kill the process if any threads are set to OOM_DISABLE
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+		task_lock(q);
+		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(q);
 			return 1;
+		}
+		task_unlock(q);
 	} while_each_thread(g, q);
 
 	__oom_kill_task(p, 1);
@@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
-			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
-			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+			current->comm, gfp_mask, order,
+			current->mm ? current->mm->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 7/7] oom: prevent possible OOM_DISABLE livelock
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (4 preceding siblings ...)
  2009-05-05  0:27 ` [patch 6/7] oom: move oom_adj value from task_struct to mm_struct David Rientjes
@ 2009-05-05  0:27 ` David Rientjes
  2009-05-05 10:22   ` Nick Piggin
  2009-05-05 10:23 ` [patch 1/7] lowmemorykiller: Only iterate over process list when needed Nick Piggin
  6 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2009-05-05  0:27 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Nick Piggin, San Mehat, Arve Hjønnevåg, linux-kernel

It is currently possible to livelock the oom killer if a task is chosen
for oom kill and another thread sharing the same memory has an oom_adj
value of OOM_DISABLE.  This occurs because oom_kill_task() repeatedly
returns 1 and refuses to kill the chosen task while select_bad_process()
will repeatedly chooses the same task during the next retry.

This moves the check for OOM_DISABLE to the badness heuristic while
holding task_lock().  Badness scores of 0 are now explicitly prohibited
from being oom killed and since the oom_adj value is a characteristic of
an mm and not a task, it is no longer necessary to check the oom_adj
value for threads sharing the same memory (except when simply issuing
SIGKILLs for threads in other thread groups).

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   40 ++++++++++------------------------------
 1 files changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -67,6 +67,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		return 0;
 	}
 	oom_adj = mm->oom_adj;
+	if (oom_adj == OOM_DISABLE) {
+		task_unlock(p);
+		return 0;
+	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -253,13 +257,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		task_lock(p);
-		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
-			continue;
-		task_unlock(p);
-
 		points = badness(p, uptime.tv_sec);
-		if (points > *ppoints || !chosen) {
+		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -352,32 +351,13 @@ static int oom_kill_task(struct task_struct *p)
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
 
+	task_lock(p);
 	mm = p->mm;
-
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-
-	if (mm == NULL)
+	if (!mm || mm->oom_adj == OOM_DISABLE) {
+		task_unlock(p);
 		return 1;
-
-	/*
-	 * Don't kill the process if any threads are set to OOM_DISABLE
-	 */
-	do_each_thread(g, q) {
-		task_lock(q);
-		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
-			task_unlock(q);
-			return 1;
-		}
-		task_unlock(q);
-	} while_each_thread(g, q);
-
+	}
+	task_unlock(p);
 	__oom_kill_task(p, 1);
 
 	/*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 5/7] oom: fix possible oom_dump_tasks NULL pointer
  2009-05-05  0:27 ` [patch 5/7] oom: fix possible oom_dump_tasks " David Rientjes
@ 2009-05-05 10:11   ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2009-05-05 10:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Mon, May 04, 2009 at 05:27:02PM -0700, David Rientjes wrote:
> When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
> pointer for tasks that have detached mm's since task_lock() is not held
> during the tasklist scan.
> 
> Cc: Nick Piggin <npiggin@suse.de>
> Signed-off-by: David Rientjes <rientjes@google.com>

I hope I'm not the oom killer maintainer ;) But anyway,

Acked-by: Nick Piggin <npiggin@suse.de>


> ---
>  mm/oom_kill.c |   24 +++++++++++++++---------
>  1 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -284,22 +284,28 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
>  	       "name\n");
>  	do_each_thread(g, p) {
> -		/*
> -		 * total_vm and rss sizes do not exist for tasks with a
> -		 * detached mm so there's no need to report them.
> -		 */
> -		if (!p->mm)
> -			continue;
> +		struct mm_struct *mm;
> +
>  		if (mem && !task_in_mem_cgroup(p, mem))
>  			continue;
>  		if (!thread_group_leader(p))
>  			continue;
>  
>  		task_lock(p);
> +		mm = p->mm;
> +		if (!mm) {
> +			/*
> +			 * total_vm and rss sizes do not exist for tasks with no
> +			 * mm so there's no need to report them; they can't be
> +			 * oom killed anyway.
> +			 */
> +			task_unlock(p);
> +			continue;
> +		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
> -		       p->pid, __task_cred(p)->uid, p->tgid,
> -		       p->mm->total_vm, get_mm_rss(p->mm), (int)task_cpu(p),
> -		       p->oomkilladj, p->comm);
> +		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> +		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> +		       p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 6/7] oom: move oom_adj value from task_struct to mm_struct
  2009-05-05  0:27 ` [patch 6/7] oom: move oom_adj value from task_struct to mm_struct David Rientjes
@ 2009-05-05 10:18   ` Nick Piggin
  2009-05-05 18:29     ` David Rientjes
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-05-05 10:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Mon, May 04, 2009 at 05:27:04PM -0700, David Rientjes wrote:
> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares
> the mm.  If a task were to be killed while attached to an mm that could
> not be freed because another thread were set to OOM_DISABLE, it would
> have needlessly been terminated since there is no potential for future
> memory freeing.

True.

 
> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
> 
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> in oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
> 
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.

Hmm, not particularly bad to do any of this, but I guess the way to
do it without changing userspace APIs would be to take the lowest
value of oomkilladj for any thread in the mm and use that.

I guess the previous behaviour was basically quite random anyway if
multiple threads each have different oomkilladj variables so it
probably doesn't matter.

I don't suppose you'd ever get a significantly advanced (aka crazy)
userspace app that would want to deliberately set *one* of its threads
as OOM_DISABLE and others allowed to be killed, would we?

> 
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: San Mehat <san@android.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt        |   15 +++++++++----
>  drivers/staging/android/lowmemorykiller.c |    2 +-
>  fs/proc/base.c                            |   19 ++++++++++++++--
>  include/linux/mm_types.h                  |    2 +
>  include/linux/sched.h                     |    1 -
>  mm/oom_kill.c                             |   32 ++++++++++++++++++----------
>  6 files changed, 49 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>  
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the liklihood of this process being
> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>  
>  The following heuristics are then applied:
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  			task_unlock(p);
>  			continue;
>  		}
> -		oom_adj = p->oomkilladj;
> +		oom_adj = p->mm->oom_adj;
>  		if (oom_adj < min_adj) {
>  			task_unlock(p);
>  			continue;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  
>  	if (!task)
>  		return -ESRCH;
> -	oom_adjust = task->oomkilladj;
> +	task_lock(task);
> +	if (task->mm)
> +		oom_adjust = task->mm->oom_adj;
> +	else
> +		oom_adjust = OOM_DISABLE;
> +	task_unlock(task);
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +	task_lock(task);
> +	if (!task->mm) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->oomkilladj = oom_adjust;
> +	task->mm->oom_adj = oom_adjust;
> +	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -232,6 +232,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1149,7 +1149,6 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> -	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> +	int oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> +	oom_adj = mm->oom_adj;
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 8;
>  
>  	/*
> -	 * Adjust the score by oomkilladj.
> +	 * Adjust the score by oom_adj.
>  	 */
> -	if (p->oomkilladj) {
> -		if (p->oomkilladj > 0) {
> +	if (oom_adj) {
> +		if (oom_adj > 0) {
>  			if (!points)
>  				points = 1;
> -			points <<= p->oomkilladj;
> +			points <<= oom_adj;
>  		} else
> -			points >>= -(p->oomkilladj);
> +			points >>= -(oom_adj);
>  	}
>  
>  #ifdef DEBUG
> @@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		if (p->oomkilladj == OOM_DISABLE)
> +		task_lock(p);
> +		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
>  			continue;
> +		task_unlock(p);
>  
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
> @@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -		       p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
>  	 * Don't kill the process if any threads are set to OOM_DISABLE
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +		task_lock(q);
> +		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +			task_unlock(q);
>  			return 1;
> +		}
> +		task_unlock(q);
>  	} while_each_thread(g, q);
>  
>  	__oom_kill_task(p, 1);
> @@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +			current->comm, gfp_mask, order,
> +			current->mm ? current->mm->oom_adj : OOM_DISABLE);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 7/7] oom: prevent possible OOM_DISABLE livelock
  2009-05-05  0:27 ` [patch 7/7] oom: prevent possible OOM_DISABLE livelock David Rientjes
@ 2009-05-05 10:22   ` Nick Piggin
  2009-05-05 18:33     ` David Rientjes
  0 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-05-05 10:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Mon, May 04, 2009 at 05:27:07PM -0700, David Rientjes wrote:
> It is currently possible to livelock the oom killer if a task is chosen
> for oom kill and another thread sharing the same memory has an oom_adj
> value of OOM_DISABLE.  This occurs because oom_kill_task() repeatedly

Hmm, but didn't the last patch make it a per-mm value?

> returns 1 and refuses to kill the chosen task while select_bad_process()
> will repeatedly chooses the same task during the next retry.
> 
> This moves the check for OOM_DISABLE to the badness heuristic while
> holding task_lock().  Badness scores of 0 are now explicitly prohibited
> from being oom killed and since the oom_adj value is a characteristic of
> an mm and not a task, it is no longer necessary to check the oom_adj
> value for threads sharing the same memory (except when simply issuing
> SIGKILLs for threads in other thread groups).
> 
> Cc: Nick Piggin <npiggin@suse.de>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |   40 ++++++++++------------------------------
>  1 files changed, 10 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -67,6 +67,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		return 0;
>  	}
>  	oom_adj = mm->oom_adj;
> +	if (oom_adj == OOM_DISABLE) {
> +		task_unlock(p);
> +		return 0;
> +	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -253,13 +257,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		task_lock(p);
> -		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
> -			continue;
> -		task_unlock(p);
> -
>  		points = badness(p, uptime.tv_sec);
> -		if (points > *ppoints || !chosen) {
> +		if (points > *ppoints) {
>  			chosen = p;
>  			*ppoints = points;
>  		}
> @@ -352,32 +351,13 @@ static int oom_kill_task(struct task_struct *p)
>  	struct mm_struct *mm;
>  	struct task_struct *g, *q;
>  
> +	task_lock(p);
>  	mm = p->mm;
> -
> -	/* WARNING: mm may not be dereferenced since we did not obtain its
> -	 * value from get_task_mm(p).  This is OK since all we need to do is
> -	 * compare mm to q->mm below.
> -	 *
> -	 * Furthermore, even if mm contains a non-NULL value, p->mm may
> -	 * change to NULL at any time since we do not hold task_lock(p).
> -	 * However, this is of no concern to us.
> -	 */
> -
> -	if (mm == NULL)
> +	if (!mm || mm->oom_adj == OOM_DISABLE) {
> +		task_unlock(p);
>  		return 1;
> -
> -	/*
> -	 * Don't kill the process if any threads are set to OOM_DISABLE
> -	 */
> -	do_each_thread(g, q) {
> -		task_lock(q);
> -		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> -			task_unlock(q);
> -			return 1;
> -		}
> -		task_unlock(q);
> -	} while_each_thread(g, q);
> -
> +	}
> +	task_unlock(p);
>  	__oom_kill_task(p, 1);
>  
>  	/*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] lowmemorykiller: Only iterate over process list when needed.
  2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
                   ` (5 preceding siblings ...)
  2009-05-05  0:27 ` [patch 7/7] oom: prevent possible OOM_DISABLE livelock David Rientjes
@ 2009-05-05 10:23 ` Nick Piggin
  2009-05-05 16:31   ` Greg KH
  6 siblings, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2009-05-05 10:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Mon, May 04, 2009 at 05:26:50PM -0700, David Rientjes wrote:
> From: Arve Hjønnevåg <arve@android.com>
> 
> Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
> instead the sum of rss. Neither method is accurate.
> 
> Also skip the process scan, if the amount of memory available is above
> the largest threshold set.
> 
> Signed-off-by: Arve Hjønnevåg <arve@android.com>

Didn't really look at the android stuff because I didn't even know
it was there before. But I asume it is going to be "submitted" to
the kernel proper at some stage (by some means other than a git rename) :)

> ---
>  drivers/staging/android/lowmemorykiller.c |   35 +++++++++++++++++-----------
>  1 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -71,23 +71,30 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  	}
>  	if(nr_to_scan > 0)
>  		lowmem_print(3, "lowmem_shrink %d, %x, ofree %d, ma %d\n", nr_to_scan, gfp_mask, other_free, min_adj);
> +	rem = global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
> +	if (nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> +		lowmem_print(5, "lowmem_shrink %d, %x, return %d\n", nr_to_scan, gfp_mask, rem);
> +		return rem;
> +	}
> +
>  	read_lock(&tasklist_lock);
>  	for_each_process(p) {
> -		if(p->oomkilladj >= 0 && p->mm) {
> -			tasksize = get_mm_rss(p->mm);
> -			if(nr_to_scan > 0 && tasksize > 0 && p->oomkilladj >= min_adj) {
> -				if(selected == NULL ||
> -				   p->oomkilladj > selected->oomkilladj ||
> -				   (p->oomkilladj == selected->oomkilladj &&
> -				    tasksize > selected_tasksize)) {
> -					selected = p;
> -					selected_tasksize = tasksize;
> -					lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> -					             p->pid, p->comm, p->oomkilladj, tasksize);
> -				}
> -			}
> -			rem += tasksize;
> +		if (p->oomkilladj < min_adj || !p->mm)
> +			continue;
> +		tasksize = get_mm_rss(p->mm);
> +		if (tasksize <= 0)
> +			continue;
> +		if (selected) {
> +			if (p->oomkilladj < selected->oomkilladj)
> +				continue;
> +			if (p->oomkilladj == selected->oomkilladj &&
> +			    tasksize <= selected_tasksize)
> +				continue;
>  		}
> +		selected = p;
> +		selected_tasksize = tasksize;
> +		lowmem_print(2, "select %d (%s), adj %d, size %d, to kill\n",
> +		             p->pid, p->comm, p->oomkilladj, tasksize);
>  	}
>  	if(selected != NULL) {
>  		lowmem_print(1, "send sigkill to %d (%s), adj %d, size %d\n",

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] lowmemorykiller: Only iterate over process list when needed.
  2009-05-05 10:23 ` [patch 1/7] lowmemorykiller: Only iterate over process list when needed Nick Piggin
@ 2009-05-05 16:31   ` Greg KH
  0 siblings, 0 replies; 14+ messages in thread
From: Greg KH @ 2009-05-05 16:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Rientjes, Andrew Morton, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Tue, May 05, 2009 at 12:23:59PM +0200, Nick Piggin wrote:
> On Mon, May 04, 2009 at 05:26:50PM -0700, David Rientjes wrote:
> > From: Arve Hjønnevåg <arve@android.com>
> > 
> > Use NR_ACTIVE plus NR_INACTIVE as a size estimate for our fake cache
> > instead the sum of rss. Neither method is accurate.
> > 
> > Also skip the process scan, if the amount of memory available is above
> > the largest threshold set.
> > 
> > Signed-off-by: Arve Hjønnevåg <arve@android.com>
> 
> Didn't really look at the android stuff because I didn't even know
> it was there before. But I asume it is going to be "submitted" to
> the kernel proper at some stage (by some means other than a git rename) :)

Yes it will, when everyone agrees that what it is doing is the "right"
think to do :)

But for now, it sits in staging, as there are a few hundred thousand
devices using it...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 6/7] oom: move oom_adj value from task_struct to mm_struct
  2009-05-05 10:18   ` Nick Piggin
@ 2009-05-05 18:29     ` David Rientjes
  0 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05 18:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Tue, 5 May 2009, Nick Piggin wrote:

> Hmm, not particularly bad to do any of this, but I guess the way to
> do it without changing userspace APIs would be to take the lowest
> value of oomkilladj for any thread in the mm and use that.
> 

Then /proc/pid/oom_adj (and, thus, what /proc/pid/oom_score reports) will 
be inconsistent.  In other words, it would be possible for a task to have 
an oom_adj value of +15, meaning to prefer killing it above all others, 
and it turns out to be immune from oom killing because it shares memory 
with an OOM_DISABLE task.  You could argue that's broken in userspace, but 
this patchset actually makes oom_adj values a trait of the memory it 
represents rather than the task itself so not only is it consistent but it 
doesn't cause any surprises when your system (or memcg or cpuset) is oom.

Taking the lowest oom_adj value would also require more than one expensive 
tasklist scan or an internal oom killer stack of eligible tasks to kill so 
we can pop a task that shares ineligible memory and still kill a task that 
has a higher priority as defined by the user.

> I guess the previous behaviour was basically quite random anyway if
> multiple threads each have different oomkilladj variables so it
> probably doesn't matter.
> 

Exactly, and this is what the patchset addresses: oom_adj values become a 
trait of the memory it describes and not tasks that may happen to share 
it.  It also fixes a livelock if the highest priority task actually can't 
be killed because it shares memory with an OOM_DISABLE task.

> I don't suppose you'd ever get a significantly advanced (aka crazy)
> userspace app that would want to deliberately set *one* of its threads
> as OOM_DISABLE and others allowed to be killed, would we?
> 

You could argue that's totally broken in userspace, but there's no reason 
for the kernel to not expose a consistent API that describes the actual 
oom priority.  oomkilladj really is inappropriate as a per task value.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 7/7] oom: prevent possible OOM_DISABLE livelock
  2009-05-05 10:22   ` Nick Piggin
@ 2009-05-05 18:33     ` David Rientjes
  0 siblings, 0 replies; 14+ messages in thread
From: David Rientjes @ 2009-05-05 18:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Greg Kroah-Hartman, San Mehat,
	Arve Hjønnevåg, linux-kernel

On Tue, 5 May 2009, Nick Piggin wrote:

> On Mon, May 04, 2009 at 05:27:07PM -0700, David Rientjes wrote:
> > It is currently possible to livelock the oom killer if a task is chosen
> > for oom kill and another thread sharing the same memory has an oom_adj
> > value of OOM_DISABLE.  This occurs because oom_kill_task() repeatedly
> 
> Hmm, but didn't the last patch make it a per-mm value?
> 

Yes, but this avoids taking task_lock(p) in select_bad_process() and moves 
all locking to badness().  It also introduces new behavior that anything 
with a badness score of 0 won't be killed, which wasn't explicit before 
(we relied on oom_kill_task() to reject it).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-05-05 18:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-05  0:26 [patch 1/7] lowmemorykiller: Only iterate over process list when needed David Rientjes
2009-05-05  0:26 ` [patch 2/7] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
2009-05-05  0:26 ` [patch 3/7] oom: cleanup android low memory killer David Rientjes
2009-05-05  0:26 ` [patch 4/7] oom: fix possible android low memory killer NULL pointer David Rientjes
2009-05-05  0:27 ` [patch 5/7] oom: fix possible oom_dump_tasks " David Rientjes
2009-05-05 10:11   ` Nick Piggin
2009-05-05  0:27 ` [patch 6/7] oom: move oom_adj value from task_struct to mm_struct David Rientjes
2009-05-05 10:18   ` Nick Piggin
2009-05-05 18:29     ` David Rientjes
2009-05-05  0:27 ` [patch 7/7] oom: prevent possible OOM_DISABLE livelock David Rientjes
2009-05-05 10:22   ` Nick Piggin
2009-05-05 18:33     ` David Rientjes
2009-05-05 10:23 ` [patch 1/7] lowmemorykiller: Only iterate over process list when needed Nick Piggin
2009-05-05 16:31   ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).