All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for 2.6.31 0/4] fix oom_adj regression v2
@ 2009-08-04 10:25 ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:25 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
It is very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job scheduler.
Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd")
	}
	....

vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent.
Then, vfork() parent (job scheduler) lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program. We must
not break this assumption.

This patch series are slightly big, but we must fix any regression soon.



Sorting out OOM requirements:
-----------------------
  - select_bad_process() must select killable process.
    otherwise OOM might makes following livelock.
      1. select_bad_process() select unkillable process
      2. oom_kill_process() do no-op and return.
      3. exit out_of_memory and makes next OOM soon. then, goto 1 again.
  - vfork parent and child must not shared oom_adj.


My proposal
-----------------------
  - oom_adj become per-process property. it have been documented long time.
    but the implementaion was not correct.
  - oom_score also become per-process property. it makes oom logic simpler and faster.
  - remove bogus vfork() parent killing logic






^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH for 2.6.31 0/4] fix oom_adj regression v2
@ 2009-08-04 10:25 ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:25 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
It is very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job scheduler.
Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd")
	}
	....

vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent.
Then, vfork() parent (job scheduler) lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program. We must
not break this assumption.

This patch series are slightly big, but we must fix any regression soon.



Sorting out OOM requirements:
-----------------------
  - select_bad_process() must select killable process.
    otherwise OOM might makes following livelock.
      1. select_bad_process() select unkillable process
      2. oom_kill_process() do no-op and return.
      3. exit out_of_memory and makes next OOM soon. then, goto 1 again.
  - vfork parent and child must not shared oom_adj.


My proposal
-----------------------
  - oom_adj become per-process property. it have been documented long time.
    but the implementaion was not correct.
  - oom_score also become per-process property. it makes oom logic simpler and faster.
  - remove bogus vfork() parent killing logic





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-04 10:25 ` KOSAKI Motohiro
@ 2009-08-04 10:25   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:25 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: move oom_adj to signal_struct

The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
It is very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job scheduler.
Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd")
	}
	....

vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
only change oom_adj for vfork() child, but also change oom_adj for vfork() parent.
Then, vfork() parent (job scheduler) lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program. We must
not break this assumption.

Therefore, this patch move oom_adj again. it will be moved signal_struct. it is
truth per-process data place.

Plus, Restored writing to /proc/pid/oom_adj for a kernel thread return success again.
changing return value might make confusing to shell-script.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |    9 ++---
 fs/proc/base.c                     |   33 +++++++++++---------
 include/linux/mm_types.h           |    2 -
 include/linux/sched.h              |    2 +
 kernel/fork.c                      |    3 +
 mm/oom_kill.c                      |   60 +++++++++++++++++++++++++++----------
 6 files changed, 70 insertions(+), 39 deletions(-)

Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1002,16 +1002,17 @@ static ssize_t oom_adjust_read(struct fi
 	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
 	char buffer[PROC_NUMBUF];
 	size_t len;
-	int oom_adjust;
+	int oom_adjust = OOM_DISABLE;
+	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	task_lock(task);
-	if (task->mm)
-		oom_adjust = task->mm->oom_adj;
-	else
-		oom_adjust = OOM_DISABLE;
-	task_unlock(task);
+
+	if (lock_task_sighand(task, &flags)) {
+		oom_adjust = task->signal->oom_adj;
+		unlock_task_sighand(task, &flags);
+	}
+
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1025,6 +1026,7 @@ static ssize_t oom_adjust_write(struct f
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
+	unsigned long flags;
 
 	memset(buffer, 0, sizeof(buffer));
 	if (count > sizeof(buffer) - 1)
@@ -1040,19 +1042,20 @@ static ssize_t oom_adjust_write(struct f
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	task_lock(task);
-	if (!task->mm) {
-		task_unlock(task);
+
+	if (!lock_task_sighand(task, &flags)) {
 		put_task_struct(task);
-		return -EINVAL;
+		return -ESRCH;
 	}
-	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		task_unlock(task);
+
+	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->mm->oom_adj = oom_adjust;
-	task_unlock(task);
+
+	task->signal->oom_adj = oom_adjust;
+	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
Index: b/include/linux/mm_types.h
===================================================================
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -240,8 +240,6 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
-
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
Index: b/include/linux/sched.h
===================================================================
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,8 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+
+	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
Index: b/kernel/fork.c
===================================================================
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -426,7 +426,6 @@ static struct mm_struct * mm_init(struct
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
-	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
@@ -868,6 +867,8 @@ static int copy_signal(unsigned long clo
 
 	tty_audit_fork(sig);
 
+	sig->oom_adj = current->signal->oom_adj;
+
 	return 0;
 }
 
Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
+int get_oom_adj(struct task_struct *tsk)
+{
+	unsigned long flags;
+	int oom_adj = OOM_DISABLE;
+
+	if (tsk->mm && lock_task_sighand(tsk, &flags)) {
+		oom_adj = tsk->signal->oom_adj;
+		unlock_task_sighand(tsk, &flags);
+	}
+
+	return oom_adj;
+}
+
+void set_oom_adj(struct task_struct *tsk, int oom_adj)
+{
+	unsigned long flags;
+
+	if (lock_task_sighand(tsk, &flags)) {
+		tsk->signal->oom_adj = oom_adj;
+		unlock_task_sighand(tsk, &flags);
+	}
+}
+
+
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -60,17 +85,16 @@ unsigned long badness(struct task_struct
 	struct task_struct *child;
 	int oom_adj;
 
+	oom_adj = get_oom_adj(p);
+	if (oom_adj == OOM_DISABLE)
+		return 0;
+
 	task_lock(p);
 	mm = p->mm;
 	if (!mm) {
 		task_unlock(p);
 		return 0;
 	}
-	oom_adj = mm->oom_adj;
-	if (oom_adj == OOM_DISABLE) {
-		task_unlock(p);
-		return 0;
-	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -283,6 +307,8 @@ static struct task_struct *select_bad_pr
 static void dump_tasks(const struct mem_cgroup *mem)
 {
 	struct task_struct *g, *p;
+	unsigned long flags;
+	int oom_adj;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
 	       "name\n");
@@ -294,6 +320,12 @@ static void dump_tasks(const struct mem_
 		if (!thread_group_leader(p))
 			continue;
 
+		if (!lock_task_sighand(p, &flags))
+			continue;
+
+		oom_adj = p->signal->oom_adj;
+		unlock_task_sighand(p, &flags);
+
 		task_lock(p);
 		mm = p->mm;
 		if (!mm) {
@@ -307,7 +339,7 @@ static void dump_tasks(const struct mem_
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -345,16 +377,11 @@ static void __oom_kill_task(struct task_
 
 static int oom_kill_task(struct task_struct *p)
 {
-	struct mm_struct *mm;
 	struct task_struct *g, *q;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm || mm->oom_adj == OOM_DISABLE) {
-		task_unlock(p);
+	if (get_oom_adj(p) == OOM_DISABLE)
 		return 1;
-	}
-	task_unlock(p);
+
 	__oom_kill_task(p, 1);
 
 	/*
@@ -363,7 +390,7 @@ static int oom_kill_task(struct task_str
 	 * to memory reserves though, otherwise we might deplete all memory.
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && !same_thread_group(q, p))
+		if (q->mm == p->mm && !same_thread_group(q, p))
 			force_sig(SIGKILL, q);
 	} while_each_thread(g, q);
 
@@ -377,11 +404,12 @@ static int oom_kill_process(struct task_
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
+		int oom_adj = get_oom_adj(current);
+
 		task_lock(current);
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
-			current->comm, gfp_mask, order,
-			current->mm ? current->mm->oom_adj : OOM_DISABLE);
+			current->comm, gfp_mask, order, oom_adj);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
Index: b/Documentation/filesystems/proc.txt
===================================================================
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1168,12 +1168,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------
 
 This file can be used to adjust the score used to select which processes should
-be killed in an out-of-memory situation.  The oom_adj value is a characteristic
-of the task's mm, so all threads that share an mm with pid will have the same
+be killed in an out-of-memory situation. All threads in the process will have the same
 oom_adj value.  A high value will increase the likelihood of this process being
 killed by the oom-killer.  Valid values are in the range -16 to +15 as
 explained below and a special value of -17, which disables oom-killing
-altogether for threads sharing pid's mm.
+altogether the process.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1187,8 +1186,8 @@ the parent's score if they do not share 
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
-/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
-oom-killing already.
+/proc/<pid>/oom_adj can be changed for kthreads, but it's meaningless. They are immune from
+oom-killing.
 
 /proc/<pid>/oom_score shows process' current badness score.
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-04 10:25   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:25 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: move oom_adj to signal_struct

The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
It is very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job scheduler.
Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd")
	}
	....

vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
only change oom_adj for vfork() child, but also change oom_adj for vfork() parent.
Then, vfork() parent (job scheduler) lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program. We must
not break this assumption.

Therefore, this patch move oom_adj again. it will be moved signal_struct. it is
truth per-process data place.

Plus, Restored writing to /proc/pid/oom_adj for a kernel thread return success again.
changing return value might make confusing to shell-script.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |    9 ++---
 fs/proc/base.c                     |   33 +++++++++++---------
 include/linux/mm_types.h           |    2 -
 include/linux/sched.h              |    2 +
 kernel/fork.c                      |    3 +
 mm/oom_kill.c                      |   60 +++++++++++++++++++++++++++----------
 6 files changed, 70 insertions(+), 39 deletions(-)

Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1002,16 +1002,17 @@ static ssize_t oom_adjust_read(struct fi
 	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
 	char buffer[PROC_NUMBUF];
 	size_t len;
-	int oom_adjust;
+	int oom_adjust = OOM_DISABLE;
+	unsigned long flags;
 
 	if (!task)
 		return -ESRCH;
-	task_lock(task);
-	if (task->mm)
-		oom_adjust = task->mm->oom_adj;
-	else
-		oom_adjust = OOM_DISABLE;
-	task_unlock(task);
+
+	if (lock_task_sighand(task, &flags)) {
+		oom_adjust = task->signal->oom_adj;
+		unlock_task_sighand(task, &flags);
+	}
+
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1025,6 +1026,7 @@ static ssize_t oom_adjust_write(struct f
 	struct task_struct *task;
 	char buffer[PROC_NUMBUF], *end;
 	int oom_adjust;
+	unsigned long flags;
 
 	memset(buffer, 0, sizeof(buffer));
 	if (count > sizeof(buffer) - 1)
@@ -1040,19 +1042,20 @@ static ssize_t oom_adjust_write(struct f
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	task_lock(task);
-	if (!task->mm) {
-		task_unlock(task);
+
+	if (!lock_task_sighand(task, &flags)) {
 		put_task_struct(task);
-		return -EINVAL;
+		return -ESRCH;
 	}
-	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
-		task_unlock(task);
+
+	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		unlock_task_sighand(task, &flags);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->mm->oom_adj = oom_adjust;
-	task_unlock(task);
+
+	task->signal->oom_adj = oom_adjust;
+	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
Index: b/include/linux/mm_types.h
===================================================================
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -240,8 +240,6 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
-
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
Index: b/include/linux/sched.h
===================================================================
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,8 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+
+	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
Index: b/kernel/fork.c
===================================================================
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -426,7 +426,6 @@ static struct mm_struct * mm_init(struct
 	init_rwsem(&mm->mmap_sem);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
-	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
 	mm->core_state = NULL;
 	mm->nr_ptes = 0;
 	set_mm_counter(mm, file_rss, 0);
@@ -868,6 +867,8 @@ static int copy_signal(unsigned long clo
 
 	tty_audit_fork(sig);
 
+	sig->oom_adj = current->signal->oom_adj;
+
 	return 0;
 }
 
Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
 static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
+int get_oom_adj(struct task_struct *tsk)
+{
+	unsigned long flags;
+	int oom_adj = OOM_DISABLE;
+
+	if (tsk->mm && lock_task_sighand(tsk, &flags)) {
+		oom_adj = tsk->signal->oom_adj;
+		unlock_task_sighand(tsk, &flags);
+	}
+
+	return oom_adj;
+}
+
+void set_oom_adj(struct task_struct *tsk, int oom_adj)
+{
+	unsigned long flags;
+
+	if (lock_task_sighand(tsk, &flags)) {
+		tsk->signal->oom_adj = oom_adj;
+		unlock_task_sighand(tsk, &flags);
+	}
+}
+
+
+
 /**
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
@@ -60,17 +85,16 @@ unsigned long badness(struct task_struct
 	struct task_struct *child;
 	int oom_adj;
 
+	oom_adj = get_oom_adj(p);
+	if (oom_adj == OOM_DISABLE)
+		return 0;
+
 	task_lock(p);
 	mm = p->mm;
 	if (!mm) {
 		task_unlock(p);
 		return 0;
 	}
-	oom_adj = mm->oom_adj;
-	if (oom_adj == OOM_DISABLE) {
-		task_unlock(p);
-		return 0;
-	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -283,6 +307,8 @@ static struct task_struct *select_bad_pr
 static void dump_tasks(const struct mem_cgroup *mem)
 {
 	struct task_struct *g, *p;
+	unsigned long flags;
+	int oom_adj;
 
 	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
 	       "name\n");
@@ -294,6 +320,12 @@ static void dump_tasks(const struct mem_
 		if (!thread_group_leader(p))
 			continue;
 
+		if (!lock_task_sighand(p, &flags))
+			continue;
+
+		oom_adj = p->signal->oom_adj;
+		unlock_task_sighand(p, &flags);
+
 		task_lock(p);
 		mm = p->mm;
 		if (!mm) {
@@ -307,7 +339,7 @@ static void dump_tasks(const struct mem_
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -345,16 +377,11 @@ static void __oom_kill_task(struct task_
 
 static int oom_kill_task(struct task_struct *p)
 {
-	struct mm_struct *mm;
 	struct task_struct *g, *q;
 
-	task_lock(p);
-	mm = p->mm;
-	if (!mm || mm->oom_adj == OOM_DISABLE) {
-		task_unlock(p);
+	if (get_oom_adj(p) == OOM_DISABLE)
 		return 1;
-	}
-	task_unlock(p);
+
 	__oom_kill_task(p, 1);
 
 	/*
@@ -363,7 +390,7 @@ static int oom_kill_task(struct task_str
 	 * to memory reserves though, otherwise we might deplete all memory.
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && !same_thread_group(q, p))
+		if (q->mm == p->mm && !same_thread_group(q, p))
 			force_sig(SIGKILL, q);
 	} while_each_thread(g, q);
 
@@ -377,11 +404,12 @@ static int oom_kill_process(struct task_
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
+		int oom_adj = get_oom_adj(current);
+
 		task_lock(current);
 		printk(KERN_WARNING "%s invoked oom-killer: "
 			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
-			current->comm, gfp_mask, order,
-			current->mm ? current->mm->oom_adj : OOM_DISABLE);
+			current->comm, gfp_mask, order, oom_adj);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();
Index: b/Documentation/filesystems/proc.txt
===================================================================
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1168,12 +1168,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 ------------------------------------------------------
 
 This file can be used to adjust the score used to select which processes should
-be killed in an out-of-memory situation.  The oom_adj value is a characteristic
-of the task's mm, so all threads that share an mm with pid will have the same
+be killed in an out-of-memory situation. All threads in the process will have the same
 oom_adj value.  A high value will increase the likelihood of this process being
 killed by the oom-killer.  Valid values are in the range -16 to +15 as
 explained below and a special value of -17, which disables oom-killing
-altogether for threads sharing pid's mm.
+altogether the process.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1187,8 +1186,8 @@ the parent's score if they do not share 
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
-/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
-oom-killing already.
+/proc/<pid>/oom_adj can be changed for kthreads, but it's meaningless. They are immune from
+oom-killing.
 
 /proc/<pid>/oom_score shows process' current badness score.
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/4] oom: make oom_score to per-process value
  2009-08-04 10:25 ` KOSAKI Motohiro
@ 2009-08-04 10:26   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:26 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: make oom_score to per-process value

oom-killer kill a process, not task. Then oom_score should be
calculated as per-process too. it makes consistency more and
makes speed up select_bad_process().


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 Documentation/filesystems/proc.txt |    4 ++--
 fs/proc/base.c                     |    2 +-
 mm/oom_kill.c                      |   36 +++++++++++++++++++++++++++++-------
 3 files changed, 32 insertions(+), 10 deletions(-)

Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,19 @@ void set_oom_adj(struct task_struct *tsk
 }
 
 
+static int has_intersects_mems_allowed(struct task_struct *tsk)
+{
+	struct task_struct *t;
+
+	t = tsk;
+	do {
+		if (cpuset_mems_allowed_intersects(current, t))
+			return 1;
+		t = next_thread(t);
+	} while (t != tsk);
+
+	return 0;
+}
 
 /**
  * badness - calculate a numeric value for how bad this task has been
@@ -77,18 +90,26 @@ void set_oom_adj(struct task_struct *tsk
  *    algorithm has been meticulously tuned to meet the principle
  *    of least surprise ... (be careful when you change it)
  */
-
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
 	int oom_adj;
+	struct task_cputime task_time;
+	unsigned long flags;
+	unsigned long utime;
+	unsigned long stime;
 
 	oom_adj = get_oom_adj(p);
 	if (oom_adj == OOM_DISABLE)
 		return 0;
 
+	if (!lock_task_sighand(p, &flags))
+		return 0;
+	thread_group_cputime(p, &task_time);
+	unlock_task_sighand(p, &flags);
+
 	task_lock(p);
 	mm = p->mm;
 	if (!mm) {
@@ -132,8 +153,9 @@ unsigned long badness(struct task_struct
          * of seconds. There is no particular reason for this other than
          * that it turned out to work very well in practice.
 	 */
-	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
-		>> (SHIFT_HZ + 3);
+	utime = cputime_to_jiffies(task_time.utime);
+	stime = cputime_to_jiffies(task_time.stime);
+	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
 
 	if (uptime >= p->start_time.tv_sec)
 		run_time = (uptime - p->start_time.tv_sec) >> 10;
@@ -174,7 +196,7 @@ unsigned long badness(struct task_struct
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.
 	 */
-	if (!cpuset_mems_allowed_intersects(current, p))
+	if (!has_intersects_mems_allowed(p))
 		points /= 8;
 
 	/*
@@ -230,13 +252,13 @@ static inline enum oom_constraint constr
 static struct task_struct *select_bad_process(unsigned long *ppoints,
 						struct mem_cgroup *mem)
 {
-	struct task_struct *g, *p;
+	struct task_struct *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
-	do_each_thread(g, p) {
+	for_each_process(p) {
 		unsigned long points;
 
 		/*
@@ -286,7 +308,7 @@ static struct task_struct *select_bad_pr
 			chosen = p;
 			*ppoints = points;
 		}
-	} while_each_thread(g, p);
+	}
 
 	return chosen;
 }
Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -450,7 +450,7 @@ static int proc_oom_score(struct task_st
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task, uptime.tv_sec);
+	points = badness(task->group_leader, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
Index: b/Documentation/filesystems/proc.txt
===================================================================
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1195,13 +1195,13 @@ The following heuristics are then applie
  * if the task was reniced, its score doubles
  * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
  	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked task does not belong
+ * if oom condition happened in one cpuset and checked process does not belong
  	to it, its score is divided by 8
  * the resulting score is multiplied by two to the power of oom_adj, i.e.
 	points <<= oom_adj when it is positive and
 	points >>= -(oom_adj) otherwise
 
-The task with the highest badness score is then selected and its children
+The process with the highest badness score is then selected and its children
 are killed, process itself will be killed in an OOM situation when it does
 not have children or some of them disabled oom like described above.
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 2/4] oom: make oom_score to per-process value
@ 2009-08-04 10:26   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:26 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: make oom_score to per-process value

oom-killer kill a process, not task. Then oom_score should be
calculated as per-process too. it makes consistency more and
makes speed up select_bad_process().


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 Documentation/filesystems/proc.txt |    4 ++--
 fs/proc/base.c                     |    2 +-
 mm/oom_kill.c                      |   36 +++++++++++++++++++++++++++++-------
 3 files changed, 32 insertions(+), 10 deletions(-)

Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,19 @@ void set_oom_adj(struct task_struct *tsk
 }
 
 
+static int has_intersects_mems_allowed(struct task_struct *tsk)
+{
+	struct task_struct *t;
+
+	t = tsk;
+	do {
+		if (cpuset_mems_allowed_intersects(current, t))
+			return 1;
+		t = next_thread(t);
+	} while (t != tsk);
+
+	return 0;
+}
 
 /**
  * badness - calculate a numeric value for how bad this task has been
@@ -77,18 +90,26 @@ void set_oom_adj(struct task_struct *tsk
  *    algorithm has been meticulously tuned to meet the principle
  *    of least surprise ... (be careful when you change it)
  */
-
 unsigned long badness(struct task_struct *p, unsigned long uptime)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
 	int oom_adj;
+	struct task_cputime task_time;
+	unsigned long flags;
+	unsigned long utime;
+	unsigned long stime;
 
 	oom_adj = get_oom_adj(p);
 	if (oom_adj == OOM_DISABLE)
 		return 0;
 
+	if (!lock_task_sighand(p, &flags))
+		return 0;
+	thread_group_cputime(p, &task_time);
+	unlock_task_sighand(p, &flags);
+
 	task_lock(p);
 	mm = p->mm;
 	if (!mm) {
@@ -132,8 +153,9 @@ unsigned long badness(struct task_struct
          * of seconds. There is no particular reason for this other than
          * that it turned out to work very well in practice.
 	 */
-	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
-		>> (SHIFT_HZ + 3);
+	utime = cputime_to_jiffies(task_time.utime);
+	stime = cputime_to_jiffies(task_time.stime);
+	cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
 
 	if (uptime >= p->start_time.tv_sec)
 		run_time = (uptime - p->start_time.tv_sec) >> 10;
@@ -174,7 +196,7 @@ unsigned long badness(struct task_struct
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.
 	 */
-	if (!cpuset_mems_allowed_intersects(current, p))
+	if (!has_intersects_mems_allowed(p))
 		points /= 8;
 
 	/*
@@ -230,13 +252,13 @@ static inline enum oom_constraint constr
 static struct task_struct *select_bad_process(unsigned long *ppoints,
 						struct mem_cgroup *mem)
 {
-	struct task_struct *g, *p;
+	struct task_struct *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
 	*ppoints = 0;
 
 	do_posix_clock_monotonic_gettime(&uptime);
-	do_each_thread(g, p) {
+	for_each_process(p) {
 		unsigned long points;
 
 		/*
@@ -286,7 +308,7 @@ static struct task_struct *select_bad_pr
 			chosen = p;
 			*ppoints = points;
 		}
-	} while_each_thread(g, p);
+	}
 
 	return chosen;
 }
Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -450,7 +450,7 @@ static int proc_oom_score(struct task_st
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	read_lock(&tasklist_lock);
-	points = badness(task, uptime.tv_sec);
+	points = badness(task->group_leader, uptime.tv_sec);
 	read_unlock(&tasklist_lock);
 	return sprintf(buffer, "%lu\n", points);
 }
Index: b/Documentation/filesystems/proc.txt
===================================================================
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1195,13 +1195,13 @@ The following heuristics are then applie
  * if the task was reniced, its score doubles
  * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
  	or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked task does not belong
+ * if oom condition happened in one cpuset and checked process does not belong
  	to it, its score is divided by 8
  * the resulting score is multiplied by two to the power of oom_adj, i.e.
 	points <<= oom_adj when it is positive and
 	points >>= -(oom_adj) otherwise
 
-The task with the highest badness score is then selected and its children
+The process with the highest badness score is then selected and its children
 are killed, process itself will be killed in an OOM situation when it does
 not have children or some of them disabled oom like described above.
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/4] oom: oom_kill doesn't kill vfork parent(or child)
  2009-08-04 10:25 ` KOSAKI Motohiro
@ 2009-08-04 10:27   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:27 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: oom_kill doesn't kill vfork parent(or child).

Current oom_kill doesn't only kill victim process, but also kill
mm shread task. it mean vfork parent will be killed.

but, That's bogus. another process have another oom_adj. we shouldn't
ignore their oom_adj (it might have OOM_DISABLE).

following caller hit the minefield.

---------------------------------------
        switch (constraint) {
        case CONSTRAINT_MEMORY_POLICY:
                oom_kill_process(current, gfp_mask, order, 0, NULL,
                                "No available memory (MPOL_BIND)");
                break;


Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
We don't need to care multi thread in here.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 mm/oom_kill.c |   12 ------------
 1 file changed, 12 deletions(-)

Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,23 +399,11 @@ static void __oom_kill_task(struct task_
 
 static int oom_kill_task(struct task_struct *p)
 {
-	struct task_struct *g, *q;
-
 	if (get_oom_adj(p) == OOM_DISABLE)
 		return 1;
 
 	__oom_kill_task(p, 1);
 
-	/*
-	 * kill all processes that share the ->mm (i.e. all threads),
-	 * but are in a different thread group. Don't let them have access
-	 * to memory reserves though, otherwise we might deplete all memory.
-	 */
-	do_each_thread(g, q) {
-		if (q->mm == p->mm && !same_thread_group(q, p))
-			force_sig(SIGKILL, q);
-	} while_each_thread(g, q);
-
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 3/4] oom: oom_kill doesn't kill vfork parent(or child)
@ 2009-08-04 10:27   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:27 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: oom_kill doesn't kill vfork parent(or child).

Current oom_kill doesn't only kill victim process, but also kill
mm shread task. it mean vfork parent will be killed.

but, That's bogus. another process have another oom_adj. we shouldn't
ignore their oom_adj (it might have OOM_DISABLE).

following caller hit the minefield.

---------------------------------------
        switch (constraint) {
        case CONSTRAINT_MEMORY_POLICY:
                oom_kill_process(current, gfp_mask, order, 0, NULL,
                                "No available memory (MPOL_BIND)");
                break;


Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
We don't need to care multi thread in here.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 mm/oom_kill.c |   12 ------------
 1 file changed, 12 deletions(-)

Index: b/mm/oom_kill.c
===================================================================
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -399,23 +399,11 @@ static void __oom_kill_task(struct task_
 
 static int oom_kill_task(struct task_struct *p)
 {
-	struct task_struct *g, *q;
-
 	if (get_oom_adj(p) == OOM_DISABLE)
 		return 1;
 
 	__oom_kill_task(p, 1);
 
-	/*
-	 * kill all processes that share the ->mm (i.e. all threads),
-	 * but are in a different thread group. Don't let them have access
-	 * to memory reserves though, otherwise we might deplete all memory.
-	 */
-	do_each_thread(g, q) {
-		if (q->mm == p->mm && !same_thread_group(q, p))
-			force_sig(SIGKILL, q);
-	} while_each_thread(g, q);
-
 	return 0;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
  2009-08-04 10:25 ` KOSAKI Motohiro
@ 2009-08-04 10:28   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:28 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: fix oom_adjust_write() input sanity check.

Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 fs/proc/base.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
 		count = sizeof(buffer) - 1;
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
+
+	strstrip(buffer);
 	oom_adjust = simple_strtol(buffer, &end, 0);
+	if (*end)
+		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
 	     oom_adjust != OOM_DISABLE)
 		return -EINVAL;
-	if (*end == '\n')
-		end++;
+
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
@@ -1057,9 +1060,8 @@ static ssize_t oom_adjust_write(struct f
 	task->signal->oom_adj = oom_adjust;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
-	if (end - buffer == 0)
-		return -EIO;
-	return end - buffer;
+
+	return count;
 }
 
 static const struct file_operations proc_oom_adjust_operations = {



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
@ 2009-08-04 10:28   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-04 10:28 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

Subject: [PATCH] oom: fix oom_adjust_write() input sanity check.

Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>,
Cc: Andrew Morton <akpm@linux-foundation.org>,
---
 fs/proc/base.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Index: b/fs/proc/base.c
===================================================================
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
 		count = sizeof(buffer) - 1;
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
+
+	strstrip(buffer);
 	oom_adjust = simple_strtol(buffer, &end, 0);
+	if (*end)
+		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
 	     oom_adjust != OOM_DISABLE)
 		return -EINVAL;
-	if (*end == '\n')
-		end++;
+
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
@@ -1057,9 +1060,8 @@ static ssize_t oom_adjust_write(struct f
 	task->signal->oom_adj = oom_adjust;
 	unlock_task_sighand(task, &flags);
 	put_task_struct(task);
-	if (end - buffer == 0)
-		return -EIO;
-	return end - buffer;
+
+	return count;
 }
 
 static const struct file_operations proc_oom_adjust_operations = {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-04 10:25   ` KOSAKI Motohiro
@ 2009-08-05  0:45     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  0:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm


Hi, Kosaki. 

I am so late to invole this thread. 
But let me have a question. 

What's advantage of placing oom_adj in singal rather than task ?
I mean task->oom_adj and task->signal->oom_adj ?

I am sorry if you already discussed it at last threads. 

On Tue,  4 Aug 2009 19:25:57 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Subject: [PATCH] oom: move oom_adj to signal_struct
> 
> The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
> It is very good first step for sanitize OOM.
> 
> However Paul Menage reported the commit makes regression to his job scheduler.
> Current OOM logic can kill OOM_DISABLED process.
> 
> Why? His program has the code of similar to the following.
> 
> 	...
> 	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
> 	...
> 	if (vfork() == 0) {
> 		set_oom_adj(0); /* Invoked child can be killed */
> 		execve("foo-bar-cmd")
> 	}
> 	....
> 
> vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
> only change oom_adj for vfork() child, but also change oom_adj for vfork() parent.
> Then, vfork() parent (job scheduler) lost OOM immune and it was killed.
> 
> Actually, fork-setting-exec idiom is very frequently used in userland program. We must
> not break this assumption.
> 
> Therefore, this patch move oom_adj again. it will be moved signal_struct. it is
> truth per-process data place.
> 
> Plus, Restored writing to /proc/pid/oom_adj for a kernel thread return success again.
> changing return value might make confusing to shell-script.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Paul Menage <menage@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  Documentation/filesystems/proc.txt |    9 ++---
>  fs/proc/base.c                     |   33 +++++++++++---------
>  include/linux/mm_types.h           |    2 -
>  include/linux/sched.h              |    2 +
>  kernel/fork.c                      |    3 +
>  mm/oom_kill.c                      |   60 +++++++++++++++++++++++++++----------
>  6 files changed, 70 insertions(+), 39 deletions(-)
> 
> Index: b/fs/proc/base.c
> ===================================================================
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1002,16 +1002,17 @@ static ssize_t oom_adjust_read(struct fi
>  	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
>  	char buffer[PROC_NUMBUF];
>  	size_t len;
> -	int oom_adjust;
> +	int oom_adjust = OOM_DISABLE;
> +	unsigned long flags;
>  
>  	if (!task)
>  		return -ESRCH;
> -	task_lock(task);
> -	if (task->mm)
> -		oom_adjust = task->mm->oom_adj;
> -	else
> -		oom_adjust = OOM_DISABLE;
> -	task_unlock(task);
> +
> +	if (lock_task_sighand(task, &flags)) {
> +		oom_adjust = task->signal->oom_adj;
> +		unlock_task_sighand(task, &flags);
> +	}
> +
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1025,6 +1026,7 @@ static ssize_t oom_adjust_write(struct f
>  	struct task_struct *task;
>  	char buffer[PROC_NUMBUF], *end;
>  	int oom_adjust;
> +	unsigned long flags;
>  
>  	memset(buffer, 0, sizeof(buffer));
>  	if (count > sizeof(buffer) - 1)
> @@ -1040,19 +1042,20 @@ static ssize_t oom_adjust_write(struct f
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	task_lock(task);
> -	if (!task->mm) {
> -		task_unlock(task);
> +
> +	if (!lock_task_sighand(task, &flags)) {
>  		put_task_struct(task);
> -		return -EINVAL;
> +		return -ESRCH;
>  	}
> -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		task_unlock(task);
> +
> +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		unlock_task_sighand(task, &flags);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->mm->oom_adj = oom_adjust;
> -	task_unlock(task);
> +
> +	task->signal->oom_adj = oom_adjust;
> +	unlock_task_sighand(task, &flags);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> Index: b/include/linux/mm_types.h
> ===================================================================
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -240,8 +240,6 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> -
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> Index: b/include/linux/sched.h
> ===================================================================
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,8 @@ struct signal_struct {
>  	unsigned audit_tty;
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
> +
> +	int oom_adj;	/* OOM kill score adjustment (bit shift) */
>  };
>  
>  /* Context switch must be unlocked if interrupts are to be enabled */
> Index: b/kernel/fork.c
> ===================================================================
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,7 +426,6 @@ static struct mm_struct * mm_init(struct
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> @@ -868,6 +867,8 @@ static int copy_signal(unsigned long clo
>  
>  	tty_audit_fork(sig);
>  
> +	sig->oom_adj = current->signal->oom_adj;
> +
>  	return 0;
>  }
>  
> Index: b/mm/oom_kill.c
> ===================================================================
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
> +int get_oom_adj(struct task_struct *tsk)
> +{
> +	unsigned long flags;
> +	int oom_adj = OOM_DISABLE;
> +
> +	if (tsk->mm && lock_task_sighand(tsk, &flags)) {
> +		oom_adj = tsk->signal->oom_adj;
> +		unlock_task_sighand(tsk, &flags);
> +	}
> +
> +	return oom_adj;
> +}
> +
> +void set_oom_adj(struct task_struct *tsk, int oom_adj)
> +{
> +	unsigned long flags;
> +
> +	if (lock_task_sighand(tsk, &flags)) {
> +		tsk->signal->oom_adj = oom_adj;
> +		unlock_task_sighand(tsk, &flags);
> +	}
> +}
> +
> +
> +
>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -60,17 +85,16 @@ unsigned long badness(struct task_struct
>  	struct task_struct *child;
>  	int oom_adj;
>  
> +	oom_adj = get_oom_adj(p);
> +	if (oom_adj == OOM_DISABLE)
> +		return 0;
> +
>  	task_lock(p);
>  	mm = p->mm;
>  	if (!mm) {
>  		task_unlock(p);
>  		return 0;
>  	}
> -	oom_adj = mm->oom_adj;
> -	if (oom_adj == OOM_DISABLE) {
> -		task_unlock(p);
> -		return 0;
> -	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -283,6 +307,8 @@ static struct task_struct *select_bad_pr
>  static void dump_tasks(const struct mem_cgroup *mem)
>  {
>  	struct task_struct *g, *p;
> +	unsigned long flags;
> +	int oom_adj;
>  
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
>  	       "name\n");
> @@ -294,6 +320,12 @@ static void dump_tasks(const struct mem_
>  		if (!thread_group_leader(p))
>  			continue;
>  
> +		if (!lock_task_sighand(p, &flags))
> +			continue;
> +
> +		oom_adj = p->signal->oom_adj;
> +		unlock_task_sighand(p, &flags);
> +
>  		task_lock(p);
>  		mm = p->mm;
>  		if (!mm) {
> @@ -307,7 +339,7 @@ static void dump_tasks(const struct mem_
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -345,16 +377,11 @@ static void __oom_kill_task(struct task_
>  
>  static int oom_kill_task(struct task_struct *p)
>  {
> -	struct mm_struct *mm;
>  	struct task_struct *g, *q;
>  
> -	task_lock(p);
> -	mm = p->mm;
> -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> -		task_unlock(p);
> +	if (get_oom_adj(p) == OOM_DISABLE)
>  		return 1;
> -	}
> -	task_unlock(p);
> +
>  	__oom_kill_task(p, 1);
>  
>  	/*
> @@ -363,7 +390,7 @@ static int oom_kill_task(struct task_str
>  	 * to memory reserves though, otherwise we might deplete all memory.
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && !same_thread_group(q, p))
> +		if (q->mm == p->mm && !same_thread_group(q, p))
>  			force_sig(SIGKILL, q);
>  	} while_each_thread(g, q);
>  
> @@ -377,11 +404,12 @@ static int oom_kill_process(struct task_
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> +		int oom_adj = get_oom_adj(current);
> +
>  		task_lock(current);
>  		printk(KERN_WARNING "%s invoked oom-killer: "
>  			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> -			current->comm, gfp_mask, order,
> -			current->mm ? current->mm->oom_adj : OOM_DISABLE);
> +			current->comm, gfp_mask, order, oom_adj);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> Index: b/Documentation/filesystems/proc.txt
> ===================================================================
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1168,12 +1168,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  ------------------------------------------------------
>  
>  This file can be used to adjust the score used to select which processes should
> -be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> -of the task's mm, so all threads that share an mm with pid will have the same
> +be killed in an out-of-memory situation. All threads in the process will have the same
>  oom_adj value.  A high value will increase the likelihood of this process being
>  killed by the oom-killer.  Valid values are in the range -16 to +15 as
>  explained below and a special value of -17, which disables oom-killing
> -altogether for threads sharing pid's mm.
> +altogether the process.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1187,8 +1186,8 @@ the parent's score if they do not share 
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> -/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> -oom-killing already.
> +/proc/<pid>/oom_adj can be changed for kthreads, but it's meaningless. They are immune from
> +oom-killing.
>  
>  /proc/<pid>/oom_score shows process' current badness score.
>  
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  0:45     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  0:45 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm


Hi, Kosaki. 

I am so late to invole this thread. 
But let me have a question. 

What's advantage of placing oom_adj in singal rather than task ?
I mean task->oom_adj and task->signal->oom_adj ?

I am sorry if you already discussed it at last threads. 

On Tue,  4 Aug 2009 19:25:57 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Subject: [PATCH] oom: move oom_adj to signal_struct
> 
> The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
> It is very good first step for sanitize OOM.
> 
> However Paul Menage reported the commit makes regression to his job scheduler.
> Current OOM logic can kill OOM_DISABLED process.
> 
> Why? His program has the code of similar to the following.
> 
> 	...
> 	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
> 	...
> 	if (vfork() == 0) {
> 		set_oom_adj(0); /* Invoked child can be killed */
> 		execve("foo-bar-cmd")
> 	}
> 	....
> 
> vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
> only change oom_adj for vfork() child, but also change oom_adj for vfork() parent.
> Then, vfork() parent (job scheduler) lost OOM immune and it was killed.
> 
> Actually, fork-setting-exec idiom is very frequently used in userland program. We must
> not break this assumption.
> 
> Therefore, this patch move oom_adj again. it will be moved signal_struct. it is
> truth per-process data place.
> 
> Plus, Restored writing to /proc/pid/oom_adj for a kernel thread return success again.
> changing return value might make confusing to shell-script.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Paul Menage <menage@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  Documentation/filesystems/proc.txt |    9 ++---
>  fs/proc/base.c                     |   33 +++++++++++---------
>  include/linux/mm_types.h           |    2 -
>  include/linux/sched.h              |    2 +
>  kernel/fork.c                      |    3 +
>  mm/oom_kill.c                      |   60 +++++++++++++++++++++++++++----------
>  6 files changed, 70 insertions(+), 39 deletions(-)
> 
> Index: b/fs/proc/base.c
> ===================================================================
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1002,16 +1002,17 @@ static ssize_t oom_adjust_read(struct fi
>  	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
>  	char buffer[PROC_NUMBUF];
>  	size_t len;
> -	int oom_adjust;
> +	int oom_adjust = OOM_DISABLE;
> +	unsigned long flags;
>  
>  	if (!task)
>  		return -ESRCH;
> -	task_lock(task);
> -	if (task->mm)
> -		oom_adjust = task->mm->oom_adj;
> -	else
> -		oom_adjust = OOM_DISABLE;
> -	task_unlock(task);
> +
> +	if (lock_task_sighand(task, &flags)) {
> +		oom_adjust = task->signal->oom_adj;
> +		unlock_task_sighand(task, &flags);
> +	}
> +
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1025,6 +1026,7 @@ static ssize_t oom_adjust_write(struct f
>  	struct task_struct *task;
>  	char buffer[PROC_NUMBUF], *end;
>  	int oom_adjust;
> +	unsigned long flags;
>  
>  	memset(buffer, 0, sizeof(buffer));
>  	if (count > sizeof(buffer) - 1)
> @@ -1040,19 +1042,20 @@ static ssize_t oom_adjust_write(struct f
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	task_lock(task);
> -	if (!task->mm) {
> -		task_unlock(task);
> +
> +	if (!lock_task_sighand(task, &flags)) {
>  		put_task_struct(task);
> -		return -EINVAL;
> +		return -ESRCH;
>  	}
> -	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> -		task_unlock(task);
> +
> +	if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		unlock_task_sighand(task, &flags);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->mm->oom_adj = oom_adjust;
> -	task_unlock(task);
> +
> +	task->signal->oom_adj = oom_adjust;
> +	unlock_task_sighand(task, &flags);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> Index: b/include/linux/mm_types.h
> ===================================================================
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -240,8 +240,6 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> -	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> -
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> Index: b/include/linux/sched.h
> ===================================================================
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,8 @@ struct signal_struct {
>  	unsigned audit_tty;
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
> +
> +	int oom_adj;	/* OOM kill score adjustment (bit shift) */
>  };
>  
>  /* Context switch must be unlocked if interrupts are to be enabled */
> Index: b/kernel/fork.c
> ===================================================================
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -426,7 +426,6 @@ static struct mm_struct * mm_init(struct
>  	init_rwsem(&mm->mmap_sem);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm->flags = (current->mm) ? current->mm->flags : default_dump_filter;
> -	mm->oom_adj = (current->mm) ? current->mm->oom_adj : 0;
>  	mm->core_state = NULL;
>  	mm->nr_ptes = 0;
>  	set_mm_counter(mm, file_rss, 0);
> @@ -868,6 +867,8 @@ static int copy_signal(unsigned long clo
>  
>  	tty_audit_fork(sig);
>  
> +	sig->oom_adj = current->signal->oom_adj;
> +
>  	return 0;
>  }
>  
> Index: b/mm/oom_kill.c
> ===================================================================
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
> +int get_oom_adj(struct task_struct *tsk)
> +{
> +	unsigned long flags;
> +	int oom_adj = OOM_DISABLE;
> +
> +	if (tsk->mm && lock_task_sighand(tsk, &flags)) {
> +		oom_adj = tsk->signal->oom_adj;
> +		unlock_task_sighand(tsk, &flags);
> +	}
> +
> +	return oom_adj;
> +}
> +
> +void set_oom_adj(struct task_struct *tsk, int oom_adj)
> +{
> +	unsigned long flags;
> +
> +	if (lock_task_sighand(tsk, &flags)) {
> +		tsk->signal->oom_adj = oom_adj;
> +		unlock_task_sighand(tsk, &flags);
> +	}
> +}
> +
> +
> +
>  /**
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
> @@ -60,17 +85,16 @@ unsigned long badness(struct task_struct
>  	struct task_struct *child;
>  	int oom_adj;
>  
> +	oom_adj = get_oom_adj(p);
> +	if (oom_adj == OOM_DISABLE)
> +		return 0;
> +
>  	task_lock(p);
>  	mm = p->mm;
>  	if (!mm) {
>  		task_unlock(p);
>  		return 0;
>  	}
> -	oom_adj = mm->oom_adj;
> -	if (oom_adj == OOM_DISABLE) {
> -		task_unlock(p);
> -		return 0;
> -	}
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -283,6 +307,8 @@ static struct task_struct *select_bad_pr
>  static void dump_tasks(const struct mem_cgroup *mem)
>  {
>  	struct task_struct *g, *p;
> +	unsigned long flags;
> +	int oom_adj;
>  
>  	printk(KERN_INFO "[ pid ]   uid  tgid total_vm      rss cpu oom_adj "
>  	       "name\n");
> @@ -294,6 +320,12 @@ static void dump_tasks(const struct mem_
>  		if (!thread_group_leader(p))
>  			continue;
>  
> +		if (!lock_task_sighand(p, &flags))
> +			continue;
> +
> +		oom_adj = p->signal->oom_adj;
> +		unlock_task_sighand(p, &flags);
> +
>  		task_lock(p);
>  		mm = p->mm;
>  		if (!mm) {
> @@ -307,7 +339,7 @@ static void dump_tasks(const struct mem_
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -345,16 +377,11 @@ static void __oom_kill_task(struct task_
>  
>  static int oom_kill_task(struct task_struct *p)
>  {
> -	struct mm_struct *mm;
>  	struct task_struct *g, *q;
>  
> -	task_lock(p);
> -	mm = p->mm;
> -	if (!mm || mm->oom_adj == OOM_DISABLE) {
> -		task_unlock(p);
> +	if (get_oom_adj(p) == OOM_DISABLE)
>  		return 1;
> -	}
> -	task_unlock(p);
> +
>  	__oom_kill_task(p, 1);
>  
>  	/*
> @@ -363,7 +390,7 @@ static int oom_kill_task(struct task_str
>  	 * to memory reserves though, otherwise we might deplete all memory.
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && !same_thread_group(q, p))
> +		if (q->mm == p->mm && !same_thread_group(q, p))
>  			force_sig(SIGKILL, q);
>  	} while_each_thread(g, q);
>  
> @@ -377,11 +404,12 @@ static int oom_kill_process(struct task_
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> +		int oom_adj = get_oom_adj(current);
> +
>  		task_lock(current);
>  		printk(KERN_WARNING "%s invoked oom-killer: "
>  			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> -			current->comm, gfp_mask, order,
> -			current->mm ? current->mm->oom_adj : OOM_DISABLE);
> +			current->comm, gfp_mask, order, oom_adj);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> Index: b/Documentation/filesystems/proc.txt
> ===================================================================
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1168,12 +1168,11 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  ------------------------------------------------------
>  
>  This file can be used to adjust the score used to select which processes should
> -be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> -of the task's mm, so all threads that share an mm with pid will have the same
> +be killed in an out-of-memory situation. All threads in the process will have the same
>  oom_adj value.  A high value will increase the likelihood of this process being
>  killed by the oom-killer.  Valid values are in the range -16 to +15 as
>  explained below and a special value of -17, which disables oom-killing
> -altogether for threads sharing pid's mm.
> +altogether the process.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1187,8 +1186,8 @@ the parent's score if they do not share 
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> -/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> -oom-killing already.
> +/proc/<pid>/oom_adj can be changed for kthreads, but it's meaningless. They are immune from
> +oom-killing.
>  
>  /proc/<pid>/oom_score shows process' current badness score.
>  
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  0:45     ` Minchan Kim
@ 2009-08-05  2:29       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  2:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

Hi

> Hi, Kosaki. 
> 
> I am so late to invole this thread. 
> But let me have a question. 
> 
> What's advantage of placing oom_adj in singal rather than task ?
> I mean task->oom_adj and task->signal->oom_adj ?
> 
> I am sorry if you already discussed it at last threads. 

Not sorry. that's very good question.

I'm trying to explain the detailed intention of commit 2ff05b2b4eac
(move oom_adj to mm_struct).

In 2.6.30, OOM logic callflow is here.

__out_of_memory
	select_bad_process		for each task
		badness			calculate badness of one task
	oom_kill_process		search child
		oom_kill_task		kill target task and mm shared tasks with it

example, process-A have two thread, thread-A and thread-B and it 
have very fat memory.
And, each thread have following likes oom property.

	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
	thread-B: oom_adj = 0,           oom_score = very-high

Then, select_bad_process() select thread-B, but oom_kill_task refuse
kill the task because thread-A have OOM_DISABLE.
__out_of_memory() call select_bad_process() again. but select_bad_process()
select the same task. It mean kernel fall in the livelock.

The fact is, select_bad_process() must select killable task. otherwise
OOM logic go into livelock.

Is this enough explanation? thanks.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  2:29       ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  2:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

Hi

> Hi, Kosaki. 
> 
> I am so late to invole this thread. 
> But let me have a question. 
> 
> What's advantage of placing oom_adj in singal rather than task ?
> I mean task->oom_adj and task->signal->oom_adj ?
> 
> I am sorry if you already discussed it at last threads. 

Not sorry. that's very good question.

I'm trying to explain the detailed intention of commit 2ff05b2b4eac
(move oom_adj to mm_struct).

In 2.6.30, OOM logic callflow is here.

__out_of_memory
	select_bad_process		for each task
		badness			calculate badness of one task
	oom_kill_process		search child
		oom_kill_task		kill target task and mm shared tasks with it

example, process-A have two thread, thread-A and thread-B and it 
have very fat memory.
And, each thread have following likes oom property.

	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
	thread-B: oom_adj = 0,           oom_score = very-high

Then, select_bad_process() select thread-B, but oom_kill_task refuse
kill the task because thread-A have OOM_DISABLE.
__out_of_memory() call select_bad_process() again. but select_bad_process()
select the same task. It mean kernel fall in the livelock.

The fact is, select_bad_process() must select killable task. otherwise
OOM logic go into livelock.

Is this enough explanation? thanks.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  2:29       ` KOSAKI Motohiro
@ 2009-08-05  2:40         ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  2:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > Hi, Kosaki. 
> > 
> > I am so late to invole this thread. 
> > But let me have a question. 
> > 
> > What's advantage of placing oom_adj in singal rather than task ?
> > I mean task->oom_adj and task->signal->oom_adj ?
> > 
> > I am sorry if you already discussed it at last threads. 
> 
> Not sorry. that's very good question.
> 
> I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> (move oom_adj to mm_struct).
> 
> In 2.6.30, OOM logic callflow is here.
> 
> __out_of_memory
> 	select_bad_process		for each task
> 		badness			calculate badness of one task
> 	oom_kill_process		search child
> 		oom_kill_task		kill target task and mm shared tasks with it
> 
> example, process-A have two thread, thread-A and thread-B and it 
> have very fat memory.
> And, each thread have following likes oom property.
> 
> 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> 	thread-B: oom_adj = 0,           oom_score = very-high
> 
> Then, select_bad_process() select thread-B, but oom_kill_task refuse
> kill the task because thread-A have OOM_DISABLE.
> __out_of_memory() call select_bad_process() again. but select_bad_process()
> select the same task. It mean kernel fall in the livelock.
> 
> The fact is, select_bad_process() must select killable task. otherwise
> OOM logic go into livelock.
> 
> Is this enough explanation? thanks.
> 

Thanks for good explanation. :)

It resulted from patch of David which moved task_struct->oom_ajd 
to mm_struct. I understood it. 

It meant oom_adj was not per-process.

AFAIU, you want to make oom_adj per-process, again. 
And you selected the place with task->singal as per-process.

What I have a question is that why do you select task_struct->signal 
rather than task_struct like old?

What's benefit of using task_struct->signal ?


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  2:40         ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  2:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > Hi, Kosaki. 
> > 
> > I am so late to invole this thread. 
> > But let me have a question. 
> > 
> > What's advantage of placing oom_adj in singal rather than task ?
> > I mean task->oom_adj and task->signal->oom_adj ?
> > 
> > I am sorry if you already discussed it at last threads. 
> 
> Not sorry. that's very good question.
> 
> I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> (move oom_adj to mm_struct).
> 
> In 2.6.30, OOM logic callflow is here.
> 
> __out_of_memory
> 	select_bad_process		for each task
> 		badness			calculate badness of one task
> 	oom_kill_process		search child
> 		oom_kill_task		kill target task and mm shared tasks with it
> 
> example, process-A have two thread, thread-A and thread-B and it 
> have very fat memory.
> And, each thread have following likes oom property.
> 
> 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> 	thread-B: oom_adj = 0,           oom_score = very-high
> 
> Then, select_bad_process() select thread-B, but oom_kill_task refuse
> kill the task because thread-A have OOM_DISABLE.
> __out_of_memory() call select_bad_process() again. but select_bad_process()
> select the same task. It mean kernel fall in the livelock.
> 
> The fact is, select_bad_process() must select killable task. otherwise
> OOM logic go into livelock.
> 
> Is this enough explanation? thanks.
> 

Thanks for good explanation. :)

It resulted from patch of David which moved task_struct->oom_ajd 
to mm_struct. I understood it. 

It meant oom_adj was not per-process.

AFAIU, you want to make oom_adj per-process, again. 
And you selected the place with task->singal as per-process.

What I have a question is that why do you select task_struct->signal 
rather than task_struct like old?

What's benefit of using task_struct->signal ?


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  2:40         ` Minchan Kim
@ 2009-08-05  2:51           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  2:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Hi
> > 
> > > Hi, Kosaki. 
> > > 
> > > I am so late to invole this thread. 
> > > But let me have a question. 
> > > 
> > > What's advantage of placing oom_adj in singal rather than task ?
> > > I mean task->oom_adj and task->signal->oom_adj ?
> > > 
> > > I am sorry if you already discussed it at last threads. 
> > 
> > Not sorry. that's very good question.
> > 
> > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > (move oom_adj to mm_struct).
> > 
> > In 2.6.30, OOM logic callflow is here.
> > 
> > __out_of_memory
> > 	select_bad_process		for each task
> > 		badness			calculate badness of one task
> > 	oom_kill_process		search child
> > 		oom_kill_task		kill target task and mm shared tasks with it
> > 
> > example, process-A have two thread, thread-A and thread-B and it 
> > have very fat memory.
> > And, each thread have following likes oom property.
> > 
> > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > 	thread-B: oom_adj = 0,           oom_score = very-high
> > 
> > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > kill the task because thread-A have OOM_DISABLE.
> > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > select the same task. It mean kernel fall in the livelock.
> > 
> > The fact is, select_bad_process() must select killable task. otherwise
> > OOM logic go into livelock.
> > 
> > Is this enough explanation? thanks.
> > 
> 
> Thanks for good explanation. :)
> 
> It resulted from patch of David which moved task_struct->oom_ajd 
> to mm_struct. I understood it. 

No. It's very old problem. David's patch fixed it. 
It mean per-process oom_adj prevent select_bad_process() return
a task in unkillable process.

unfortunatelly, his patch can't treat vfork case ideally. I hope to
fix it.

> It meant oom_adj was not per-process.
> 
> AFAIU, you want to make oom_adj per-process, again. 
> And you selected the place with task->singal as per-process.
> 
> What I have a question is that why do you select task_struct->signal 
> rather than task_struct like old?
> 
> What's benefit of using task_struct->signal ?

prior Davied patch (task->oom_adj) might makes livelock.




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  2:51           ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  2:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Hi
> > 
> > > Hi, Kosaki. 
> > > 
> > > I am so late to invole this thread. 
> > > But let me have a question. 
> > > 
> > > What's advantage of placing oom_adj in singal rather than task ?
> > > I mean task->oom_adj and task->signal->oom_adj ?
> > > 
> > > I am sorry if you already discussed it at last threads. 
> > 
> > Not sorry. that's very good question.
> > 
> > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > (move oom_adj to mm_struct).
> > 
> > In 2.6.30, OOM logic callflow is here.
> > 
> > __out_of_memory
> > 	select_bad_process		for each task
> > 		badness			calculate badness of one task
> > 	oom_kill_process		search child
> > 		oom_kill_task		kill target task and mm shared tasks with it
> > 
> > example, process-A have two thread, thread-A and thread-B and it 
> > have very fat memory.
> > And, each thread have following likes oom property.
> > 
> > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > 	thread-B: oom_adj = 0,           oom_score = very-high
> > 
> > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > kill the task because thread-A have OOM_DISABLE.
> > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > select the same task. It mean kernel fall in the livelock.
> > 
> > The fact is, select_bad_process() must select killable task. otherwise
> > OOM logic go into livelock.
> > 
> > Is this enough explanation? thanks.
> > 
> 
> Thanks for good explanation. :)
> 
> It resulted from patch of David which moved task_struct->oom_ajd 
> to mm_struct. I understood it. 

No. It's very old problem. David's patch fixed it. 
It mean per-process oom_adj prevent select_bad_process() return
a task in unkillable process.

unfortunatelly, his patch can't treat vfork case ideally. I hope to
fix it.

> It meant oom_adj was not per-process.
> 
> AFAIU, you want to make oom_adj per-process, again. 
> And you selected the place with task->singal as per-process.
> 
> What I have a question is that why do you select task_struct->signal 
> rather than task_struct like old?
> 
> What's benefit of using task_struct->signal ?

prior Davied patch (task->oom_adj) might makes livelock.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  2:51           ` KOSAKI Motohiro
@ 2009-08-05  5:55             ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  5:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Hi
> > > 
> > > > Hi, Kosaki. 
> > > > 
> > > > I am so late to invole this thread. 
> > > > But let me have a question. 
> > > > 
> > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > 
> > > > I am sorry if you already discussed it at last threads. 
> > > 
> > > Not sorry. that's very good question.
> > > 
> > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > (move oom_adj to mm_struct).
> > > 
> > > In 2.6.30, OOM logic callflow is here.
> > > 
> > > __out_of_memory
> > > 	select_bad_process		for each task
> > > 		badness			calculate badness of one task
> > > 	oom_kill_process		search child
> > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > 
> > > example, process-A have two thread, thread-A and thread-B and it 
> > > have very fat memory.
> > > And, each thread have following likes oom property.
> > > 
> > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > 
> > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > kill the task because thread-A have OOM_DISABLE.
> > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > select the same task. It mean kernel fall in the livelock.
> > > 
> > > The fact is, select_bad_process() must select killable task. otherwise
> > > OOM logic go into livelock.
> > > 
> > > Is this enough explanation? thanks.
> > > 

The problem resulted from David patch.
It can solve live lock problem but make a new problem like vfork problem. 
I think both can be solved by different approach. 

It's just RFC. 

If some process is selected by OOM killer but it have a child of OOM immune,
We just decrease point of process. It can affect selection of bad process. 
After some trial, at last bad score is drastically low and another process is 
selected by OOM killer. So I think Live lock don't happen. 

New variable adding in task struct is rather high cost. 
But i think we can union it with oomkilladj 
since oomkilladj is used to present just -17 ~ 15. 

What do you think about this approach ?

----

This is based on 2.6.30 which is kernel before applying David Patch. 

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..6e195f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1150,6 +1150,11 @@ struct task_struct {
         */
        unsigned char fpu_counter;
        s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
+       /*
+        * If OOM kill happens at one process repeately, 
+        * oom_sacle_down will be increased to prevent OOM live lock 
+        */
+       unsigned int oom_scale_down;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
        unsigned int btrace_seq;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a7b2460..3592786 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
                        points >>= -(p->oomkilladj);
        }
 
+       /*
+        * adjust the score by number of OOM kill retrial
+        */
+       points >>= p->oom_scale_down;
+
 #ifdef DEBUG
        printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
        p->pid, p->comm, points);
@@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
         * Don't kill the process if any threads are set to OOM_DISABLE
         */
        do_each_thread(g, q) {
-               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
+                       p->oom_scale_down++;
                        return 1;
+               }
        } while_each_thread(g, q);
 
        __oom_kill_task(p, 1);



-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  5:55             ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  5:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Hi
> > > 
> > > > Hi, Kosaki. 
> > > > 
> > > > I am so late to invole this thread. 
> > > > But let me have a question. 
> > > > 
> > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > 
> > > > I am sorry if you already discussed it at last threads. 
> > > 
> > > Not sorry. that's very good question.
> > > 
> > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > (move oom_adj to mm_struct).
> > > 
> > > In 2.6.30, OOM logic callflow is here.
> > > 
> > > __out_of_memory
> > > 	select_bad_process		for each task
> > > 		badness			calculate badness of one task
> > > 	oom_kill_process		search child
> > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > 
> > > example, process-A have two thread, thread-A and thread-B and it 
> > > have very fat memory.
> > > And, each thread have following likes oom property.
> > > 
> > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > 
> > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > kill the task because thread-A have OOM_DISABLE.
> > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > select the same task. It mean kernel fall in the livelock.
> > > 
> > > The fact is, select_bad_process() must select killable task. otherwise
> > > OOM logic go into livelock.
> > > 
> > > Is this enough explanation? thanks.
> > > 

The problem resulted from David patch.
It can solve live lock problem but make a new problem like vfork problem. 
I think both can be solved by different approach. 

It's just RFC. 

If some process is selected by OOM killer but it have a child of OOM immune,
We just decrease point of process. It can affect selection of bad process. 
After some trial, at last bad score is drastically low and another process is 
selected by OOM killer. So I think Live lock don't happen. 

New variable adding in task struct is rather high cost. 
But i think we can union it with oomkilladj 
since oomkilladj is used to present just -17 ~ 15. 

What do you think about this approach ?

----

This is based on 2.6.30 which is kernel before applying David Patch. 

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..6e195f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1150,6 +1150,11 @@ struct task_struct {
         */
        unsigned char fpu_counter;
        s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
+       /*
+        * If OOM kill happens at one process repeately, 
+        * oom_sacle_down will be increased to prevent OOM live lock 
+        */
+       unsigned int oom_scale_down;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
        unsigned int btrace_seq;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a7b2460..3592786 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
                        points >>= -(p->oomkilladj);
        }
 
+       /*
+        * adjust the score by number of OOM kill retrial
+        */
+       points >>= p->oom_scale_down;
+
 #ifdef DEBUG
        printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
        p->pid, p->comm, points);
@@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
         * Don't kill the process if any threads are set to OOM_DISABLE
         */
        do_each_thread(g, q) {
-               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
+                       p->oom_scale_down++;
                        return 1;
+               }
        } while_each_thread(g, q);
 
        __oom_kill_task(p, 1);



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  5:55             ` Minchan Kim
@ 2009-08-05  6:03               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05  6:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, LKML, Paul Menage, David Rientjes, Rik van Riel,
	Andrew Morton, Linus Torvalds, Oleg Nesterov, linux-mm

On Wed, 5 Aug 2009 14:55:16 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > > Hi, Kosaki. 
> > > > > 
> > > > > I am so late to invole this thread. 
> > > > > But let me have a question. 
> > > > > 
> > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > 
> > > > > I am sorry if you already discussed it at last threads. 
> > > > 
> > > > Not sorry. that's very good question.
> > > > 
> > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > (move oom_adj to mm_struct).
> > > > 
> > > > In 2.6.30, OOM logic callflow is here.
> > > > 
> > > > __out_of_memory
> > > > 	select_bad_process		for each task
> > > > 		badness			calculate badness of one task
> > > > 	oom_kill_process		search child
> > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > 
> > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > have very fat memory.
> > > > And, each thread have following likes oom property.
> > > > 
> > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > 
> > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > kill the task because thread-A have OOM_DISABLE.
> > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > select the same task. It mean kernel fall in the livelock.
> > > > 
> > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > OOM logic go into livelock.
> > > > 
> > > > Is this enough explanation? thanks.
> > > > 
> 
> The problem resulted from David patch.
> It can solve live lock problem but make a new problem like vfork problem. 
> I think both can be solved by different approach. 
> 
> It's just RFC. 
> 
> If some process is selected by OOM killer but it have a child of OOM immune,
> We just decrease point of process. It can affect selection of bad process. 
> After some trial, at last bad score is drastically low and another process is 
> selected by OOM killer. So I think Live lock don't happen. 
> 
> New variable adding in task struct is rather high cost. 
> But i think we can union it with oomkilladj 
> since oomkilladj is used to present just -17 ~ 15. 
> 
> What do you think about this approach ?
> 
keeping this in "task" struct is troublesome.
It may not livelock but near-to-livelock state, in bad case.

After applying Kosaki's , oom_kill will use
"for_each_process()" instead of "do_each_thread", I think it's a way to go.

But, yes, your "scale_down" idea itself is interesitng.
Then, hmm, merging two of yours ?

Thanks,
-Kame



> ----
> 
> This is based on 2.6.30 which is kernel before applying David Patch. 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b4c38bc..6e195f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1150,6 +1150,11 @@ struct task_struct {
>          */
>         unsigned char fpu_counter;
>         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> +       /*
> +        * If OOM kill happens at one process repeately, 
> +        * oom_sacle_down will be increased to prevent OOM live lock 
> +        */
> +       unsigned int oom_scale_down;
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>         unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a7b2460..3592786 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>                         points >>= -(p->oomkilladj);
>         }
>  
> +       /*
> +        * adjust the score by number of OOM kill retrial
> +        */
> +       points >>= p->oom_scale_down;
> +
>  #ifdef DEBUG
>         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
>         p->pid, p->comm, points);
> @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
>          * Don't kill the process if any threads are set to OOM_DISABLE
>          */
>         do_each_thread(g, q) {
> -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> +                       p->oom_scale_down++;
>                         return 1;
> +               }
>         } while_each_thread(g, q);
>  
>         __oom_kill_task(p, 1);
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:03               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05  6:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, LKML, Paul Menage, David Rientjes, Rik van Riel,
	Andrew Morton, Linus Torvalds, Oleg Nesterov, linux-mm

On Wed, 5 Aug 2009 14:55:16 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > > Hi, Kosaki. 
> > > > > 
> > > > > I am so late to invole this thread. 
> > > > > But let me have a question. 
> > > > > 
> > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > 
> > > > > I am sorry if you already discussed it at last threads. 
> > > > 
> > > > Not sorry. that's very good question.
> > > > 
> > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > (move oom_adj to mm_struct).
> > > > 
> > > > In 2.6.30, OOM logic callflow is here.
> > > > 
> > > > __out_of_memory
> > > > 	select_bad_process		for each task
> > > > 		badness			calculate badness of one task
> > > > 	oom_kill_process		search child
> > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > 
> > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > have very fat memory.
> > > > And, each thread have following likes oom property.
> > > > 
> > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > 
> > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > kill the task because thread-A have OOM_DISABLE.
> > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > select the same task. It mean kernel fall in the livelock.
> > > > 
> > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > OOM logic go into livelock.
> > > > 
> > > > Is this enough explanation? thanks.
> > > > 
> 
> The problem resulted from David patch.
> It can solve live lock problem but make a new problem like vfork problem. 
> I think both can be solved by different approach. 
> 
> It's just RFC. 
> 
> If some process is selected by OOM killer but it have a child of OOM immune,
> We just decrease point of process. It can affect selection of bad process. 
> After some trial, at last bad score is drastically low and another process is 
> selected by OOM killer. So I think Live lock don't happen. 
> 
> New variable adding in task struct is rather high cost. 
> But i think we can union it with oomkilladj 
> since oomkilladj is used to present just -17 ~ 15. 
> 
> What do you think about this approach ?
> 
keeping this in "task" struct is troublesome.
It may not livelock but near-to-livelock state, in bad case.

After applying Kosaki's , oom_kill will use
"for_each_process()" instead of "do_each_thread", I think it's a way to go.

But, yes, your "scale_down" idea itself is interesitng.
Then, hmm, merging two of yours ?

Thanks,
-Kame



> ----
> 
> This is based on 2.6.30 which is kernel before applying David Patch. 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b4c38bc..6e195f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1150,6 +1150,11 @@ struct task_struct {
>          */
>         unsigned char fpu_counter;
>         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> +       /*
> +        * If OOM kill happens at one process repeately, 
> +        * oom_sacle_down will be increased to prevent OOM live lock 
> +        */
> +       unsigned int oom_scale_down;
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>         unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a7b2460..3592786 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>                         points >>= -(p->oomkilladj);
>         }
>  
> +       /*
> +        * adjust the score by number of OOM kill retrial
> +        */
> +       points >>= p->oom_scale_down;
> +
>  #ifdef DEBUG
>         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
>         p->pid, p->comm, points);
> @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
>          * Don't kill the process if any threads are set to OOM_DISABLE
>          */
>         do_each_thread(g, q) {
> -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> +                       p->oom_scale_down++;
>                         return 1;
> +               }
>         } while_each_thread(g, q);
>  
>         __oom_kill_task(p, 1);
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  5:55             ` Minchan Kim
@ 2009-08-05  6:04               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > > Hi, Kosaki. 
> > > > > 
> > > > > I am so late to invole this thread. 
> > > > > But let me have a question. 
> > > > > 
> > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > 
> > > > > I am sorry if you already discussed it at last threads. 
> > > > 
> > > > Not sorry. that's very good question.
> > > > 
> > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > (move oom_adj to mm_struct).
> > > > 
> > > > In 2.6.30, OOM logic callflow is here.
> > > > 
> > > > __out_of_memory
> > > > 	select_bad_process		for each task
> > > > 		badness			calculate badness of one task
> > > > 	oom_kill_process		search child
> > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > 
> > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > have very fat memory.
> > > > And, each thread have following likes oom property.
> > > > 
> > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > 
> > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > kill the task because thread-A have OOM_DISABLE.
> > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > select the same task. It mean kernel fall in the livelock.
> > > > 
> > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > OOM logic go into livelock.
> > > > 
> > > > Is this enough explanation? thanks.
> > > > 
> 
> The problem resulted from David patch.
> It can solve live lock problem but make a new problem like vfork problem. 
> I think both can be solved by different approach. 
> 
> It's just RFC. 
> 
> If some process is selected by OOM killer but it have a child of OOM immune,
> We just decrease point of process. It can affect selection of bad process. 
> After some trial, at last bad score is drastically low and another process is 
> selected by OOM killer. So I think Live lock don't happen. 
> 
> New variable adding in task struct is rather high cost. 
> But i think we can union it with oomkilladj 
> since oomkilladj is used to present just -17 ~ 15. 
> 
> What do you think about this approach ?

I can ack this. but please re-initialize oom_scale_down at fork and
exec time.
currently oom_scale_down makes too big affect.

and, May I ask which you hate my approach? 

> 
> ----
> 
> This is based on 2.6.30 which is kernel before applying David Patch. 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b4c38bc..6e195f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1150,6 +1150,11 @@ struct task_struct {
>          */
>         unsigned char fpu_counter;
>         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> +       /*
> +        * If OOM kill happens at one process repeately, 
> +        * oom_sacle_down will be increased to prevent OOM live lock 
> +        */
> +       unsigned int oom_scale_down;
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>         unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a7b2460..3592786 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>                         points >>= -(p->oomkilladj);
>         }
>  
> +       /*
> +        * adjust the score by number of OOM kill retrial
> +        */
> +       points >>= p->oom_scale_down;
> +
>  #ifdef DEBUG
>         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
>         p->pid, p->comm, points);
> @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
>          * Don't kill the process if any threads are set to OOM_DISABLE
>          */
>         do_each_thread(g, q) {
> -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> +                       p->oom_scale_down++;
>                         return 1;
> +               }
>         } while_each_thread(g, q);
>  
>         __oom_kill_task(p, 1);
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:04               ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > > Hi, Kosaki. 
> > > > > 
> > > > > I am so late to invole this thread. 
> > > > > But let me have a question. 
> > > > > 
> > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > 
> > > > > I am sorry if you already discussed it at last threads. 
> > > > 
> > > > Not sorry. that's very good question.
> > > > 
> > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > (move oom_adj to mm_struct).
> > > > 
> > > > In 2.6.30, OOM logic callflow is here.
> > > > 
> > > > __out_of_memory
> > > > 	select_bad_process		for each task
> > > > 		badness			calculate badness of one task
> > > > 	oom_kill_process		search child
> > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > 
> > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > have very fat memory.
> > > > And, each thread have following likes oom property.
> > > > 
> > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > 
> > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > kill the task because thread-A have OOM_DISABLE.
> > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > select the same task. It mean kernel fall in the livelock.
> > > > 
> > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > OOM logic go into livelock.
> > > > 
> > > > Is this enough explanation? thanks.
> > > > 
> 
> The problem resulted from David patch.
> It can solve live lock problem but make a new problem like vfork problem. 
> I think both can be solved by different approach. 
> 
> It's just RFC. 
> 
> If some process is selected by OOM killer but it have a child of OOM immune,
> We just decrease point of process. It can affect selection of bad process. 
> After some trial, at last bad score is drastically low and another process is 
> selected by OOM killer. So I think Live lock don't happen. 
> 
> New variable adding in task struct is rather high cost. 
> But i think we can union it with oomkilladj 
> since oomkilladj is used to present just -17 ~ 15. 
> 
> What do you think about this approach ?

I can ack this. but please re-initialize oom_scale_down at fork and
exec time.
currently oom_scale_down makes too big affect.

and, May I ask which you hate my approach? 

> 
> ----
> 
> This is based on 2.6.30 which is kernel before applying David Patch. 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b4c38bc..6e195f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1150,6 +1150,11 @@ struct task_struct {
>          */
>         unsigned char fpu_counter;
>         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> +       /*
> +        * If OOM kill happens at one process repeately, 
> +        * oom_sacle_down will be increased to prevent OOM live lock 
> +        */
> +       unsigned int oom_scale_down;
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>         unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a7b2460..3592786 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>                         points >>= -(p->oomkilladj);
>         }
>  
> +       /*
> +        * adjust the score by number of OOM kill retrial
> +        */
> +       points >>= p->oom_scale_down;
> +
>  #ifdef DEBUG
>         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
>         p->pid, p->comm, points);
> @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
>          * Don't kill the process if any threads are set to OOM_DISABLE
>          */
>         do_each_thread(g, q) {
> -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> +                       p->oom_scale_down++;
>                         return 1;
> +               }
>         } while_each_thread(g, q);
>  
>         __oom_kill_task(p, 1);
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:04               ` KOSAKI Motohiro
@ 2009-08-05  6:29                 ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  6:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 15:04:48 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > Hi
> > > > > 
> > > > > > Hi, Kosaki. 
> > > > > > 
> > > > > > I am so late to invole this thread. 
> > > > > > But let me have a question. 
> > > > > > 
> > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > 
> > > > > > I am sorry if you already discussed it at last threads. 
> > > > > 
> > > > > Not sorry. that's very good question.
> > > > > 
> > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > (move oom_adj to mm_struct).
> > > > > 
> > > > > In 2.6.30, OOM logic callflow is here.
> > > > > 
> > > > > __out_of_memory
> > > > > 	select_bad_process		for each task
> > > > > 		badness			calculate badness of one task
> > > > > 	oom_kill_process		search child
> > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > 
> > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > have very fat memory.
> > > > > And, each thread have following likes oom property.
> > > > > 
> > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > 
> > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > select the same task. It mean kernel fall in the livelock.
> > > > > 
> > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > OOM logic go into livelock.
> > > > > 
> > > > > Is this enough explanation? thanks.
> > > > > 
> > 
> > The problem resulted from David patch.
> > It can solve live lock problem but make a new problem like vfork problem. 
> > I think both can be solved by different approach. 
> > 
> > It's just RFC. 
> > 
> > If some process is selected by OOM killer but it have a child of OOM immune,
> > We just decrease point of process. It can affect selection of bad process. 
> > After some trial, at last bad score is drastically low and another process is 
> > selected by OOM killer. So I think Live lock don't happen. 
> > 
> > New variable adding in task struct is rather high cost. 
> > But i think we can union it with oomkilladj 
> > since oomkilladj is used to present just -17 ~ 15. 
> > 
> > What do you think about this approach ?
> 
> I can ack this. but please re-initialize oom_scale_down at fork and
> exec time.
> currently oom_scale_down makes too big affect.


Thanks for carefult review. 
In fact, I didn't care of it 
since it just is RFC for making sure my idea. :)

> and, May I ask which you hate my approach? 
> 

Not at all. I never hate your approach. 
This problem resulted form David's original patch.
I thought if we will fix live lock with different approach, we can remove much pain.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:29                 ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  6:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 15:04:48 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > Hi
> > > > > 
> > > > > > Hi, Kosaki. 
> > > > > > 
> > > > > > I am so late to invole this thread. 
> > > > > > But let me have a question. 
> > > > > > 
> > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > 
> > > > > > I am sorry if you already discussed it at last threads. 
> > > > > 
> > > > > Not sorry. that's very good question.
> > > > > 
> > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > (move oom_adj to mm_struct).
> > > > > 
> > > > > In 2.6.30, OOM logic callflow is here.
> > > > > 
> > > > > __out_of_memory
> > > > > 	select_bad_process		for each task
> > > > > 		badness			calculate badness of one task
> > > > > 	oom_kill_process		search child
> > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > 
> > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > have very fat memory.
> > > > > And, each thread have following likes oom property.
> > > > > 
> > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > 
> > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > select the same task. It mean kernel fall in the livelock.
> > > > > 
> > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > OOM logic go into livelock.
> > > > > 
> > > > > Is this enough explanation? thanks.
> > > > > 
> > 
> > The problem resulted from David patch.
> > It can solve live lock problem but make a new problem like vfork problem. 
> > I think both can be solved by different approach. 
> > 
> > It's just RFC. 
> > 
> > If some process is selected by OOM killer but it have a child of OOM immune,
> > We just decrease point of process. It can affect selection of bad process. 
> > After some trial, at last bad score is drastically low and another process is 
> > selected by OOM killer. So I think Live lock don't happen. 
> > 
> > New variable adding in task struct is rather high cost. 
> > But i think we can union it with oomkilladj 
> > since oomkilladj is used to present just -17 ~ 15. 
> > 
> > What do you think about this approach ?
> 
> I can ack this. but please re-initialize oom_scale_down at fork and
> exec time.
> currently oom_scale_down makes too big affect.


Thanks for carefult review. 
In fact, I didn't care of it 
since it just is RFC for making sure my idea. :)

> and, May I ask which you hate my approach? 
> 

Not at all. I never hate your approach. 
This problem resulted form David's original patch.
I thought if we will fix live lock with different approach, we can remove much pain.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:03               ` KAMEZAWA Hiroyuki
@ 2009-08-05  6:37                 ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  6:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, KOSAKI Motohiro, LKML, Paul Menage, David Rientjes,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

On Wed, 5 Aug 2009 15:03:23 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 5 Aug 2009 14:55:16 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > Hi
> > > > > 
> > > > > > Hi, Kosaki. 
> > > > > > 
> > > > > > I am so late to invole this thread. 
> > > > > > But let me have a question. 
> > > > > > 
> > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > 
> > > > > > I am sorry if you already discussed it at last threads. 
> > > > > 
> > > > > Not sorry. that's very good question.
> > > > > 
> > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > (move oom_adj to mm_struct).
> > > > > 
> > > > > In 2.6.30, OOM logic callflow is here.
> > > > > 
> > > > > __out_of_memory
> > > > > 	select_bad_process		for each task
> > > > > 		badness			calculate badness of one task
> > > > > 	oom_kill_process		search child
> > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > 
> > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > have very fat memory.
> > > > > And, each thread have following likes oom property.
> > > > > 
> > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > 
> > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > select the same task. It mean kernel fall in the livelock.
> > > > > 
> > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > OOM logic go into livelock.
> > > > > 
> > > > > Is this enough explanation? thanks.
> > > > > 
> > 
> > The problem resulted from David patch.
> > It can solve live lock problem but make a new problem like vfork problem. 
> > I think both can be solved by different approach. 
> > 
> > It's just RFC. 
> > 
> > If some process is selected by OOM killer but it have a child of OOM immune,
> > We just decrease point of process. It can affect selection of bad process. 
> > After some trial, at last bad score is drastically low and another process is 
> > selected by OOM killer. So I think Live lock don't happen. 
> > 
> > New variable adding in task struct is rather high cost. 
> > But i think we can union it with oomkilladj 
> > since oomkilladj is used to present just -17 ~ 15. 
> > 
> > What do you think about this approach ?
> > 
> keeping this in "task" struct is troublesome.
> It may not livelock but near-to-livelock state, in bad case.

Hmm. I can't understand why it is troublesome. 
I think it's related to moving oom_adj to singal_struct. 
Unfortunately, I can't understand why we have to put oom_adj 
in singal_struct?

That's why I have a question to Kosaki a while ago. 
I can't understand it still. :-(

Could you elaborate it ?

> After applying Kosaki's , oom_kill will use
> "for_each_process()" instead of "do_each_thread", I think it's a way to go.

I didn't review kosaki's approach entirely. 
After reviewing, let's discuss it, again. 

> But, yes, your "scale_down" idea itself is interesitng.
> Then, hmm, merging two of yours ?

If it is possible, I will do so. 

Thnaks for good comment, kame.

> Thanks,
> -Kame
> 
> 
> 
> > ----
> > 
> > This is based on 2.6.30 which is kernel before applying David Patch. 
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b4c38bc..6e195f7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1150,6 +1150,11 @@ struct task_struct {
> >          */
> >         unsigned char fpu_counter;
> >         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> > +       /*
> > +        * If OOM kill happens at one process repeately, 
> > +        * oom_sacle_down will be increased to prevent OOM live lock 
> > +        */
> > +       unsigned int oom_scale_down;
> >  #ifdef CONFIG_BLK_DEV_IO_TRACE
> >         unsigned int btrace_seq;
> >  #endif
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index a7b2460..3592786 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >                         points >>= -(p->oomkilladj);
> >         }
> >  
> > +       /*
> > +        * adjust the score by number of OOM kill retrial
> > +        */
> > +       points >>= p->oom_scale_down;
> > +
> >  #ifdef DEBUG
> >         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
> >         p->pid, p->comm, points);
> > @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
> >          * Don't kill the process if any threads are set to OOM_DISABLE
> >          */
> >         do_each_thread(g, q) {
> > -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> > +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> > +                       p->oom_scale_down++;
> >                         return 1;
> > +               }
> >         } while_each_thread(g, q);
> >  
> >         __oom_kill_task(p, 1);
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> > 
> 


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:37                 ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  6:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, KOSAKI Motohiro, LKML, Paul Menage, David Rientjes,
	Rik van Riel, Andrew Morton, Linus Torvalds, Oleg Nesterov,
	linux-mm

On Wed, 5 Aug 2009 15:03:23 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 5 Aug 2009 14:55:16 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > Hi
> > > > > 
> > > > > > Hi, Kosaki. 
> > > > > > 
> > > > > > I am so late to invole this thread. 
> > > > > > But let me have a question. 
> > > > > > 
> > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > 
> > > > > > I am sorry if you already discussed it at last threads. 
> > > > > 
> > > > > Not sorry. that's very good question.
> > > > > 
> > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > (move oom_adj to mm_struct).
> > > > > 
> > > > > In 2.6.30, OOM logic callflow is here.
> > > > > 
> > > > > __out_of_memory
> > > > > 	select_bad_process		for each task
> > > > > 		badness			calculate badness of one task
> > > > > 	oom_kill_process		search child
> > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > 
> > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > have very fat memory.
> > > > > And, each thread have following likes oom property.
> > > > > 
> > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > 
> > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > select the same task. It mean kernel fall in the livelock.
> > > > > 
> > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > OOM logic go into livelock.
> > > > > 
> > > > > Is this enough explanation? thanks.
> > > > > 
> > 
> > The problem resulted from David patch.
> > It can solve live lock problem but make a new problem like vfork problem. 
> > I think both can be solved by different approach. 
> > 
> > It's just RFC. 
> > 
> > If some process is selected by OOM killer but it have a child of OOM immune,
> > We just decrease point of process. It can affect selection of bad process. 
> > After some trial, at last bad score is drastically low and another process is 
> > selected by OOM killer. So I think Live lock don't happen. 
> > 
> > New variable adding in task struct is rather high cost. 
> > But i think we can union it with oomkilladj 
> > since oomkilladj is used to present just -17 ~ 15. 
> > 
> > What do you think about this approach ?
> > 
> keeping this in "task" struct is troublesome.
> It may not livelock but near-to-livelock state, in bad case.

Hmm. I can't understand why it is troublesome. 
I think it's related to moving oom_adj to singal_struct. 
Unfortunately, I can't understand why we have to put oom_adj 
in singal_struct?

That's why I have a question to Kosaki a while ago. 
I can't understand it still. :-(

Could you elaborate it ?

> After applying Kosaki's , oom_kill will use
> "for_each_process()" instead of "do_each_thread", I think it's a way to go.

I didn't review kosaki's approach entirely. 
After reviewing, let's discuss it, again. 

> But, yes, your "scale_down" idea itself is interesitng.
> Then, hmm, merging two of yours ?

If it is possible, I will do so. 

Thnaks for good comment, kame.

> Thanks,
> -Kame
> 
> 
> 
> > ----
> > 
> > This is based on 2.6.30 which is kernel before applying David Patch. 
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b4c38bc..6e195f7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1150,6 +1150,11 @@ struct task_struct {
> >          */
> >         unsigned char fpu_counter;
> >         s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
> > +       /*
> > +        * If OOM kill happens at one process repeately, 
> > +        * oom_sacle_down will be increased to prevent OOM live lock 
> > +        */
> > +       unsigned int oom_scale_down;
> >  #ifdef CONFIG_BLK_DEV_IO_TRACE
> >         unsigned int btrace_seq;
> >  #endif
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index a7b2460..3592786 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -159,6 +159,11 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >                         points >>= -(p->oomkilladj);
> >         }
> >  
> > +       /*
> > +        * adjust the score by number of OOM kill retrial
> > +        */
> > +       points >>= p->oom_scale_down;
> > +
> >  #ifdef DEBUG
> >         printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
> >         p->pid, p->comm, points);
> > @@ -367,8 +372,10 @@ static int oom_kill_task(struct task_struct *p)
> >          * Don't kill the process if any threads are set to OOM_DISABLE
> >          */
> >         do_each_thread(g, q) {
> > -               if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> > +               if (q->mm == mm && q->oomkilladj == OOM_DISABLE) {
> > +                       p->oom_scale_down++;
> >                         return 1;
> > +               }
> >         } while_each_thread(g, q);
> >  
> >         __oom_kill_task(p, 1);
> > 
> > 
> > 
> > -- 
> > Kind regards,
> > Minchan Kim
> > 
> 


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:29                 ` Minchan Kim
@ 2009-08-05  6:47                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> > > What do you think about this approach ?
> > 
> > I can ack this. but please re-initialize oom_scale_down at fork and
> > exec time.
> > currently oom_scale_down makes too big affect.
> 
> 
> Thanks for carefult review. 
> In fact, I didn't care of it 
> since it just is RFC for making sure my idea. :)

ok, I see.

> > and, May I ask which you hate my approach? 
> 
> Not at all. I never hate your approach. 
> This problem resulted form David's original patch.
> I thought if we will fix live lock with different approach, we can remove much pain.

I also think your approach is enough acceptable.

ok, Let's wait one night and to hear other developer's opinion.
We can choice more lkml preferred approach :)




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:47                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> > > What do you think about this approach ?
> > 
> > I can ack this. but please re-initialize oom_scale_down at fork and
> > exec time.
> > currently oom_scale_down makes too big affect.
> 
> 
> Thanks for carefult review. 
> In fact, I didn't care of it 
> since it just is RFC for making sure my idea. :)

ok, I see.

> > and, May I ask which you hate my approach? 
> 
> Not at all. I never hate your approach. 
> This problem resulted form David's original patch.
> I thought if we will fix live lock with different approach, we can remove much pain.

I also think your approach is enough acceptable.

ok, Let's wait one night and to hear other developer's opinion.
We can choice more lkml preferred approach :)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:37                 ` Minchan Kim
@ 2009-08-05  6:53                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, LKML, Paul Menage,
	David Rientjes, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed, 5 Aug 2009 15:03:23 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 5 Aug 2009 14:55:16 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > 
> > > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > Hi
> > > > > > 
> > > > > > > Hi, Kosaki. 
> > > > > > > 
> > > > > > > I am so late to invole this thread. 
> > > > > > > But let me have a question. 
> > > > > > > 
> > > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > > 
> > > > > > > I am sorry if you already discussed it at last threads. 
> > > > > > 
> > > > > > Not sorry. that's very good question.
> > > > > > 
> > > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > > (move oom_adj to mm_struct).
> > > > > > 
> > > > > > In 2.6.30, OOM logic callflow is here.
> > > > > > 
> > > > > > __out_of_memory
> > > > > > 	select_bad_process		for each task
> > > > > > 		badness			calculate badness of one task
> > > > > > 	oom_kill_process		search child
> > > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > > 
> > > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > > have very fat memory.
> > > > > > And, each thread have following likes oom property.
> > > > > > 
> > > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > > 
> > > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > > select the same task. It mean kernel fall in the livelock.
> > > > > > 
> > > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > > OOM logic go into livelock.
> > > > > > 
> > > > > > Is this enough explanation? thanks.
> > > > > > 
> > > 
> > > The problem resulted from David patch.
> > > It can solve live lock problem but make a new problem like vfork problem. 
> > > I think both can be solved by different approach. 
> > > 
> > > It's just RFC. 
> > > 
> > > If some process is selected by OOM killer but it have a child of OOM immune,
> > > We just decrease point of process. It can affect selection of bad process. 
> > > After some trial, at last bad score is drastically low and another process is 
> > > selected by OOM killer. So I think Live lock don't happen. 
> > > 
> > > New variable adding in task struct is rather high cost. 
> > > But i think we can union it with oomkilladj 
> > > since oomkilladj is used to present just -17 ~ 15. 
> > > 
> > > What do you think about this approach ?
> > > 
> > keeping this in "task" struct is troublesome.
> > It may not livelock but near-to-livelock state, in bad case.
> 
> Hmm. I can't understand why it is troublesome. 
> I think it's related to moving oom_adj to singal_struct. 
> Unfortunately, I can't understand why we have to put oom_adj 
> in singal_struct?
> 
> That's why I have a question to Kosaki a while ago. 
> I can't understand it still. :-(
> 
> Could you elaborate it ?

Maybe, It's because my explanation is still poor. sorry.
Please give me one more chance.

In my previous mail, I explained select_bad_process() must not
unkillable task, is this ok?
IOW, if all thread have the same oom_adj, the issue gone.

signal_struct is shared all thread in the process. then, the issue gone.


btw, signal_struct is slightly bad name. currently it is used for
process information and almost its member is not signal related.
should we rename this?

> 
> > After applying Kosaki's , oom_kill will use
> > "for_each_process()" instead of "do_each_thread", I think it's a way to go.
> 
> I didn't review kosaki's approach entirely. 
> After reviewing, let's discuss it, again. 
> 
> > But, yes, your "scale_down" idea itself is interesitng.
> > Then, hmm, merging two of yours ?
> 
> If it is possible, I will do so. 
> 
> Thnaks for good comment, kame.





^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:53                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  6:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, LKML, Paul Menage,
	David Rientjes, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

> On Wed, 5 Aug 2009 15:03:23 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 5 Aug 2009 14:55:16 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > 
> > > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > Hi
> > > > > > 
> > > > > > > Hi, Kosaki. 
> > > > > > > 
> > > > > > > I am so late to invole this thread. 
> > > > > > > But let me have a question. 
> > > > > > > 
> > > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > > 
> > > > > > > I am sorry if you already discussed it at last threads. 
> > > > > > 
> > > > > > Not sorry. that's very good question.
> > > > > > 
> > > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > > (move oom_adj to mm_struct).
> > > > > > 
> > > > > > In 2.6.30, OOM logic callflow is here.
> > > > > > 
> > > > > > __out_of_memory
> > > > > > 	select_bad_process		for each task
> > > > > > 		badness			calculate badness of one task
> > > > > > 	oom_kill_process		search child
> > > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > > 
> > > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > > have very fat memory.
> > > > > > And, each thread have following likes oom property.
> > > > > > 
> > > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > > 
> > > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > > select the same task. It mean kernel fall in the livelock.
> > > > > > 
> > > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > > OOM logic go into livelock.
> > > > > > 
> > > > > > Is this enough explanation? thanks.
> > > > > > 
> > > 
> > > The problem resulted from David patch.
> > > It can solve live lock problem but make a new problem like vfork problem. 
> > > I think both can be solved by different approach. 
> > > 
> > > It's just RFC. 
> > > 
> > > If some process is selected by OOM killer but it have a child of OOM immune,
> > > We just decrease point of process. It can affect selection of bad process. 
> > > After some trial, at last bad score is drastically low and another process is 
> > > selected by OOM killer. So I think Live lock don't happen. 
> > > 
> > > New variable adding in task struct is rather high cost. 
> > > But i think we can union it with oomkilladj 
> > > since oomkilladj is used to present just -17 ~ 15. 
> > > 
> > > What do you think about this approach ?
> > > 
> > keeping this in "task" struct is troublesome.
> > It may not livelock but near-to-livelock state, in bad case.
> 
> Hmm. I can't understand why it is troublesome. 
> I think it's related to moving oom_adj to singal_struct. 
> Unfortunately, I can't understand why we have to put oom_adj 
> in singal_struct?
> 
> That's why I have a question to Kosaki a while ago. 
> I can't understand it still. :-(
> 
> Could you elaborate it ?

Maybe, It's because my explanation is still poor. sorry.
Please give me one more chance.

In my previous mail, I explained select_bad_process() must not
unkillable task, is this ok?
IOW, if all thread have the same oom_adj, the issue gone.

signal_struct is shared all thread in the process. then, the issue gone.


btw, signal_struct is slightly bad name. currently it is used for
process information and almost its member is not signal related.
should we rename this?

> 
> > After applying Kosaki's , oom_kill will use
> > "for_each_process()" instead of "do_each_thread", I think it's a way to go.
> 
> I didn't review kosaki's approach entirely. 
> After reviewing, let's discuss it, again. 
> 
> > But, yes, your "scale_down" idea itself is interesitng.
> > Then, hmm, merging two of yours ?
> 
> If it is possible, I will do so. 
> 
> Thnaks for good comment, kame.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:37                 ` Minchan Kim
@ 2009-08-05  6:55                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05  6:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, LKML, Paul Menage, David Rientjes, Rik van Riel,
	Andrew Morton, Linus Torvalds, Oleg Nesterov, linux-mm

On Wed, 5 Aug 2009 15:37:01 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> Hmm. I can't understand why it is troublesome. 
> I think it's related to moving oom_adj to singal_struct. 
> Unfortunately, I can't understand why we have to put oom_adj 
> in singal_struct?
> 
> That's why I have a question to Kosaki a while ago. 
> I can't understand it still. :-(
> 
> Could you elaborate it ?
> 

Current code is as following
==
  do_each_thread(g,p) {
	......
	p = badness();

	record p of highest badness.
  }
  p = higest badness thread.

  Scan all threads which shares mm_struct of p. and check oom_adj
  
==	
Assume a process which has 20000 threads. And 1 of thread has OOM_DISABLE.

Then, at worst, this scan will needs
	(1+2+3+....+20000) * (20000-1) scan. (when ignoring other processes)
even with your patch.

This means the kernel wastes enough long time that Cluster-Management-Software can
detetct this as livelock, and do reboot/cluster-fail-over.

Fixing livelock is not the last goal. I (we) would like to reduct stall time
to reasonable level. If we move oom_adj to signal_struct or mm_struct, scan-cost
will be only 20000. No retry at all.

And, if we can use for_each_process() rather than do_each_thread(),
scan-cost will be 1.

(BTW, "signal" struct is bad name I think, it should be "process" struct ;)


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  6:55                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-05  6:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, LKML, Paul Menage, David Rientjes, Rik van Riel,
	Andrew Morton, Linus Torvalds, Oleg Nesterov, linux-mm

On Wed, 5 Aug 2009 15:37:01 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> Hmm. I can't understand why it is troublesome. 
> I think it's related to moving oom_adj to singal_struct. 
> Unfortunately, I can't understand why we have to put oom_adj 
> in singal_struct?
> 
> That's why I have a question to Kosaki a while ago. 
> I can't understand it still. :-(
> 
> Could you elaborate it ?
> 

Current code is as following
==
  do_each_thread(g,p) {
	......
	p = badness();

	record p of highest badness.
  }
  p = higest badness thread.

  Scan all threads which shares mm_struct of p. and check oom_adj
  
==	
Assume a process which has 20000 threads. And 1 of thread has OOM_DISABLE.

Then, at worst, this scan will needs
	(1+2+3+....+20000) * (20000-1) scan. (when ignoring other processes)
even with your patch.

This means the kernel wastes enough long time that Cluster-Management-Software can
detetct this as livelock, and do reboot/cluster-fail-over.

Fixing livelock is not the last goal. I (we) would like to reduct stall time
to reasonable level. If we move oom_adj to signal_struct or mm_struct, scan-cost
will be only 20000. No retry at all.

And, if we can use for_each_process() rather than do_each_thread(),
scan-cost will be 1.

(BTW, "signal" struct is bad name I think, it should be "process" struct ;)


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-05  6:53                   ` KOSAKI Motohiro
@ 2009-08-05  7:20                     ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  7:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, LKML, Paul Menage,
	David Rientjes, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 15:53:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, 5 Aug 2009 15:03:23 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Wed, 5 Aug 2009 14:55:16 +0900
> > > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > 
> > > > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > > > 
> > > > > > > Hi
> > > > > > > 
> > > > > > > > Hi, Kosaki. 
> > > > > > > > 
> > > > > > > > I am so late to invole this thread. 
> > > > > > > > But let me have a question. 
> > > > > > > > 
> > > > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > > > 
> > > > > > > > I am sorry if you already discussed it at last threads. 
> > > > > > > 
> > > > > > > Not sorry. that's very good question.
> > > > > > > 
> > > > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > > > (move oom_adj to mm_struct).
> > > > > > > 
> > > > > > > In 2.6.30, OOM logic callflow is here.
> > > > > > > 
> > > > > > > __out_of_memory
> > > > > > > 	select_bad_process		for each task
> > > > > > > 		badness			calculate badness of one task
> > > > > > > 	oom_kill_process		search child
> > > > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > > > 
> > > > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > > > have very fat memory.
> > > > > > > And, each thread have following likes oom property.
> > > > > > > 
> > > > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > > > 
> > > > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > > > select the same task. It mean kernel fall in the livelock.
> > > > > > > 
> > > > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > > > OOM logic go into livelock.
> > > > > > > 
> > > > > > > Is this enough explanation? thanks.
> > > > > > > 
> > > > 
> > > > The problem resulted from David patch.
> > > > It can solve live lock problem but make a new problem like vfork problem. 
> > > > I think both can be solved by different approach. 
> > > > 
> > > > It's just RFC. 
> > > > 
> > > > If some process is selected by OOM killer but it have a child of OOM immune,
> > > > We just decrease point of process. It can affect selection of bad process. 
> > > > After some trial, at last bad score is drastically low and another process is 
> > > > selected by OOM killer. So I think Live lock don't happen. 
> > > > 
> > > > New variable adding in task struct is rather high cost. 
> > > > But i think we can union it with oomkilladj 
> > > > since oomkilladj is used to present just -17 ~ 15. 
> > > > 
> > > > What do you think about this approach ?
> > > > 
> > > keeping this in "task" struct is troublesome.
> > > It may not livelock but near-to-livelock state, in bad case.
> > 
> > Hmm. I can't understand why it is troublesome. 
> > I think it's related to moving oom_adj to singal_struct. 
> > Unfortunately, I can't understand why we have to put oom_adj 
> > in singal_struct?
> > 
> > That's why I have a question to Kosaki a while ago. 
> > I can't understand it still. :-(
> > 
> > Could you elaborate it ?
> 
> Maybe, It's because my explanation is still poor. sorry.
> Please give me one more chance.
> 
> In my previous mail, I explained select_bad_process() must not
> unkillable task, is this ok?
> IOW, if all thread have the same oom_adj, the issue gone.
> 
> signal_struct is shared all thread in the process. then, the issue gone.
> 

Your and Kame's good explanation opens my eyes. :)
I realized your approach's benefit. 

Yes. Let's wait to listen others's opinios.



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-05  7:20                     ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-05  7:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Minchan Kim, KAMEZAWA Hiroyuki, LKML, Paul Menage,
	David Rientjes, Rik van Riel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, linux-mm

On Wed,  5 Aug 2009 15:53:31 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > On Wed, 5 Aug 2009 15:03:23 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Wed, 5 Aug 2009 14:55:16 +0900
> > > Minchan Kim <minchan.kim@gmail.com> wrote:
> > > 
> > > > On Wed,  5 Aug 2009 11:51:31 +0900 (JST)
> > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > 
> > > > > > On Wed,  5 Aug 2009 11:29:34 +0900 (JST)
> > > > > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > > > > 
> > > > > > > Hi
> > > > > > > 
> > > > > > > > Hi, Kosaki. 
> > > > > > > > 
> > > > > > > > I am so late to invole this thread. 
> > > > > > > > But let me have a question. 
> > > > > > > > 
> > > > > > > > What's advantage of placing oom_adj in singal rather than task ?
> > > > > > > > I mean task->oom_adj and task->signal->oom_adj ?
> > > > > > > > 
> > > > > > > > I am sorry if you already discussed it at last threads. 
> > > > > > > 
> > > > > > > Not sorry. that's very good question.
> > > > > > > 
> > > > > > > I'm trying to explain the detailed intention of commit 2ff05b2b4eac
> > > > > > > (move oom_adj to mm_struct).
> > > > > > > 
> > > > > > > In 2.6.30, OOM logic callflow is here.
> > > > > > > 
> > > > > > > __out_of_memory
> > > > > > > 	select_bad_process		for each task
> > > > > > > 		badness			calculate badness of one task
> > > > > > > 	oom_kill_process		search child
> > > > > > > 		oom_kill_task		kill target task and mm shared tasks with it
> > > > > > > 
> > > > > > > example, process-A have two thread, thread-A and thread-B and it 
> > > > > > > have very fat memory.
> > > > > > > And, each thread have following likes oom property.
> > > > > > > 
> > > > > > > 	thread-A: oom_adj = OOM_DISABLE, oom_score = 0
> > > > > > > 	thread-B: oom_adj = 0,           oom_score = very-high
> > > > > > > 
> > > > > > > Then, select_bad_process() select thread-B, but oom_kill_task refuse
> > > > > > > kill the task because thread-A have OOM_DISABLE.
> > > > > > > __out_of_memory() call select_bad_process() again. but select_bad_process()
> > > > > > > select the same task. It mean kernel fall in the livelock.
> > > > > > > 
> > > > > > > The fact is, select_bad_process() must select killable task. otherwise
> > > > > > > OOM logic go into livelock.
> > > > > > > 
> > > > > > > Is this enough explanation? thanks.
> > > > > > > 
> > > > 
> > > > The problem resulted from David patch.
> > > > It can solve live lock problem but make a new problem like vfork problem. 
> > > > I think both can be solved by different approach. 
> > > > 
> > > > It's just RFC. 
> > > > 
> > > > If some process is selected by OOM killer but it have a child of OOM immune,
> > > > We just decrease point of process. It can affect selection of bad process. 
> > > > After some trial, at last bad score is drastically low and another process is 
> > > > selected by OOM killer. So I think Live lock don't happen. 
> > > > 
> > > > New variable adding in task struct is rather high cost. 
> > > > But i think we can union it with oomkilladj 
> > > > since oomkilladj is used to present just -17 ~ 15. 
> > > > 
> > > > What do you think about this approach ?
> > > > 
> > > keeping this in "task" struct is troublesome.
> > > It may not livelock but near-to-livelock state, in bad case.
> > 
> > Hmm. I can't understand why it is troublesome. 
> > I think it's related to moving oom_adj to singal_struct. 
> > Unfortunately, I can't understand why we have to put oom_adj 
> > in singal_struct?
> > 
> > That's why I have a question to Kosaki a while ago. 
> > I can't understand it still. :-(
> > 
> > Could you elaborate it ?
> 
> Maybe, It's because my explanation is still poor. sorry.
> Please give me one more chance.
> 
> In my previous mail, I explained select_bad_process() must not
> unkillable task, is this ok?
> IOW, if all thread have the same oom_adj, the issue gone.
> 
> signal_struct is shared all thread in the process. then, the issue gone.
> 

Your and Kame's good explanation opens my eyes. :)
I realized your approach's benefit. 

Yes. Let's wait to listen others's opinios.



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
  2009-08-04 10:28   ` KOSAKI Motohiro
@ 2009-08-05 23:33     ` Andrew Morton
  -1 siblings, 0 replies; 50+ messages in thread
From: Andrew Morton @ 2009-08-05 23:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

On Tue,  4 Aug 2009 19:28:03 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Subject: [PATCH] oom: fix oom_adjust_write() input sanity check.
> 
> Andrew Morton pointed out oom_adjust_write() has very strange EIO
> and new line handling. this patch fixes it.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Paul Menage <menage@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>,
> Cc: Andrew Morton <akpm@linux-foundation.org>,
> ---
>  fs/proc/base.c |   12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> Index: b/fs/proc/base.c
> ===================================================================
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
>  		count = sizeof(buffer) - 1;
>  	if (copy_from_user(buffer, buf, count))
>  		return -EFAULT;
> +
> +	strstrip(buffer);

+1 for using strstrip()

-1 for using it wrongly.  If it strips leading whitespace it will
return a new address for the caller to use.

We could mark it __must_check() to prevent reoccurences of this error.

How does this look?

--- a/fs/proc/base.c~oom-fix-oom_adjust_write-input-sanity-check-fix
+++ a/fs/proc/base.c
@@ -1033,8 +1033,7 @@ static ssize_t oom_adjust_write(struct f
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
 
-	strstrip(buffer);
-	oom_adjust = simple_strtol(buffer, &end, 0);
+	oom_adjust = simple_strtol(strstrip(buffer), &end, 0);
 	if (*end)
 		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&


>  	oom_adjust = simple_strtol(buffer, &end, 0);

That should've used strict_strtoul() but it's too late to fix it now.

> +	if (*end)
> +		return -EINVAL;
>  	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
>  	     oom_adjust != OOM_DISABLE)
>  		return -EINVAL;
> -	if (*end == '\n')
> -		end++;
> +
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> @@ -1057,9 +1060,8 @@ static ssize_t oom_adjust_write(struct f
>  	task->signal->oom_adj = oom_adjust;
>  	unlock_task_sighand(task, &flags);
>  	put_task_struct(task);
> -	if (end - buffer == 0)
> -		return -EIO;
> -	return end - buffer;
> +
> +	return count;
>  }
>  
>  static const struct file_operations proc_oom_adjust_operations = {
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
@ 2009-08-05 23:33     ` Andrew Morton
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Morton @ 2009-08-05 23:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

On Tue,  4 Aug 2009 19:28:03 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Subject: [PATCH] oom: fix oom_adjust_write() input sanity check.
> 
> Andrew Morton pointed out oom_adjust_write() has very strange EIO
> and new line handling. this patch fixes it.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Paul Menage <menage@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>,
> Cc: Andrew Morton <akpm@linux-foundation.org>,
> ---
>  fs/proc/base.c |   12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> Index: b/fs/proc/base.c
> ===================================================================
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
>  		count = sizeof(buffer) - 1;
>  	if (copy_from_user(buffer, buf, count))
>  		return -EFAULT;
> +
> +	strstrip(buffer);

+1 for using strstrip()

-1 for using it wrongly.  If it strips leading whitespace it will
return a new address for the caller to use.

We could mark it __must_check() to prevent reoccurences of this error.

How does this look?

--- a/fs/proc/base.c~oom-fix-oom_adjust_write-input-sanity-check-fix
+++ a/fs/proc/base.c
@@ -1033,8 +1033,7 @@ static ssize_t oom_adjust_write(struct f
 	if (copy_from_user(buffer, buf, count))
 		return -EFAULT;
 
-	strstrip(buffer);
-	oom_adjust = simple_strtol(buffer, &end, 0);
+	oom_adjust = simple_strtol(strstrip(buffer), &end, 0);
 	if (*end)
 		return -EINVAL;
 	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&


>  	oom_adjust = simple_strtol(buffer, &end, 0);

That should've used strict_strtoul() but it's too late to fix it now.

> +	if (*end)
> +		return -EINVAL;
>  	if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
>  	     oom_adjust != OOM_DISABLE)
>  		return -EINVAL;
> -	if (*end == '\n')
> -		end++;
> +
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> @@ -1057,9 +1060,8 @@ static ssize_t oom_adjust_write(struct f
>  	task->signal->oom_adj = oom_adjust;
>  	unlock_task_sighand(task, &flags);
>  	put_task_struct(task);
> -	if (end - buffer == 0)
> -		return -EIO;
> -	return end - buffer;
> +
> +	return count;
>  }
>  
>  static const struct file_operations proc_oom_adjust_operations = {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
  2009-08-04 10:25 ` KOSAKI Motohiro
@ 2009-08-05 23:39   ` Andrew Morton
  -1 siblings, 0 replies; 50+ messages in thread
From: Andrew Morton @ 2009-08-05 23:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

On Tue,  4 Aug 2009 19:25:08 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
> It is very good first step for sanitize OOM.
> 
> However Paul Menage reported the commit makes regression to his job scheduler.
> Current OOM logic can kill OOM_DISABLED process.
> 
> Why? His program has the code of similar to the following.
> 
> 	...
> 	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
> 	...
> 	if (vfork() == 0) {
> 		set_oom_adj(0); /* Invoked child can be killed */
> 		execve("foo-bar-cmd")
> 	}
> 	....
> 
> vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
> only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent.
> Then, vfork() parent (job scheduler) lost OOM immune and it was killed.
> 
> Actually, fork-setting-exec idiom is very frequently used in userland program. We must
> not break this assumption.
> 
> This patch series are slightly big, but we must fix any regression soon.
> 

So I merged these but I have a feeling that this isn't the last I'll be
hearing on the topic ;)

Given the amount of churn, the amount of discussion and the size of the
patches, this doesn't look like something we should push into 2.6.31.  

If we think that the 2ff05b2b regression is sufficiently serious to be
a must-fix for 2.6.31 then can we please find something safer and
smaller?  Like reverting 2ff05b2b?


These patches clash with the controversial
mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
now.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
@ 2009-08-05 23:39   ` Andrew Morton
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Morton @ 2009-08-05 23:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

On Tue,  4 Aug 2009 19:25:08 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The commit 2ff05b2b (oom: move oom_adj value) move oom_adj value to mm_struct.
> It is very good first step for sanitize OOM.
> 
> However Paul Menage reported the commit makes regression to his job scheduler.
> Current OOM logic can kill OOM_DISABLED process.
> 
> Why? His program has the code of similar to the following.
> 
> 	...
> 	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
> 	...
> 	if (vfork() == 0) {
> 		set_oom_adj(0); /* Invoked child can be killed */
> 		execve("foo-bar-cmd")
> 	}
> 	....
> 
> vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't
> only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent.
> Then, vfork() parent (job scheduler) lost OOM immune and it was killed.
> 
> Actually, fork-setting-exec idiom is very frequently used in userland program. We must
> not break this assumption.
> 
> This patch series are slightly big, but we must fix any regression soon.
> 

So I merged these but I have a feeling that this isn't the last I'll be
hearing on the topic ;)

Given the amount of churn, the amount of discussion and the size of the
patches, this doesn't look like something we should push into 2.6.31.  

If we think that the 2ff05b2b regression is sufficiently serious to be
a must-fix for 2.6.31 then can we please find something safer and
smaller?  Like reverting 2ff05b2b?


These patches clash with the controversial
mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-04 10:25   ` KOSAKI Motohiro
@ 2009-08-06  1:34     ` Oleg Nesterov
  -1 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2009-08-06  1:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, linux-mm

Sorry for late reply. And sorry, I didn't read these patches carefully yet,
probably missed something...

On 08/04, KOSAKI Motohiro wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>
> +int get_oom_adj(struct task_struct *tsk)

is it used outside oom_kill.c ?

> +{
> +	unsigned long flags;
> +	int oom_adj = OOM_DISABLE;
> +
> +	if (tsk->mm && lock_task_sighand(tsk, &flags)) {

Minor nit. _Afaics_, unlike proc, oom_kill.c never needs lock_task_sighand()
to access ->signal->oom_adj.

If the task was found under tasklist_lock by for_each_process/do_each_thread
it must have the valid ->signal != NULL and it can't go away.


With these patches I think mm-introduce-proc-pid-oom_adj_child.patch should
be dropped. This is good ;)

Oleg.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-06  1:34     ` Oleg Nesterov
  0 siblings, 0 replies; 50+ messages in thread
From: Oleg Nesterov @ 2009-08-06  1:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, linux-mm

Sorry for late reply. And sorry, I didn't read these patches carefully yet,
probably missed something...

On 08/04, KOSAKI Motohiro wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>  static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>
> +int get_oom_adj(struct task_struct *tsk)

is it used outside oom_kill.c ?

> +{
> +	unsigned long flags;
> +	int oom_adj = OOM_DISABLE;
> +
> +	if (tsk->mm && lock_task_sighand(tsk, &flags)) {

Minor nit. _Afaics_, unlike proc, oom_kill.c never needs lock_task_sighand()
to access ->signal->oom_adj.

If the task was found under tasklist_lock by for_each_process/do_each_thread
it must have the valid ->signal != NULL and it can't go away.


With these patches I think mm-introduce-proc-pid-oom_adj_child.patch should
be dropped. This is good ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
  2009-08-05 23:33     ` Andrew Morton
@ 2009-08-06  5:06       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

>> @@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
>>               count = sizeof(buffer) - 1;
>>       if (copy_from_user(buffer, buf, count))
>>               return -EFAULT;
>> +
>> +     strstrip(buffer);
>
> +1 for using strstrip()
>
> -1 for using it wrongly.  If it strips leading whitespace it will
> return a new address for the caller to use.

Will fix. thanks.

Paul, hehe your kernel/cgroup.c parsing have the same problem. Could you
please fix it too? :-)

> We could mark it __must_check() to prevent reoccurences of this error.
>
> How does this look?

I see.
I'll do and send it.

>
> --- a/fs/proc/base.c~oom-fix-oom_adjust_write-input-sanity-check-fix
> +++ a/fs/proc/base.c
> @@ -1033,8 +1033,7 @@ static ssize_t oom_adjust_write(struct f
>        if (copy_from_user(buffer, buf, count))
>                return -EFAULT;
>
> -       strstrip(buffer);
> -       oom_adjust = simple_strtol(buffer, &end, 0);
> +       oom_adjust = simple_strtol(strstrip(buffer), &end, 0);
>        if (*end)
>                return -EINVAL;
>        if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
>
>
>>       oom_adjust = simple_strtol(buffer, &end, 0);
>
> That should've used strict_strtoul() but it's too late to fix it now.

Perhaps, it's not too late. I never seen userland program pass non number value
into this.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/4] oom: fix oom_adjust_write() input sanity check.
@ 2009-08-06  5:06       ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

>> @@ -1033,12 +1033,15 @@ static ssize_t oom_adjust_write(struct f
>>               count = sizeof(buffer) - 1;
>>       if (copy_from_user(buffer, buf, count))
>>               return -EFAULT;
>> +
>> +     strstrip(buffer);
>
> +1 for using strstrip()
>
> -1 for using it wrongly.  If it strips leading whitespace it will
> return a new address for the caller to use.

Will fix. thanks.

Paul, hehe your kernel/cgroup.c parsing have the same problem. Could you
please fix it too? :-)

> We could mark it __must_check() to prevent reoccurences of this error.
>
> How does this look?

I see.
I'll do and send it.

>
> --- a/fs/proc/base.c~oom-fix-oom_adjust_write-input-sanity-check-fix
> +++ a/fs/proc/base.c
> @@ -1033,8 +1033,7 @@ static ssize_t oom_adjust_write(struct f
>        if (copy_from_user(buffer, buf, count))
>                return -EFAULT;
>
> -       strstrip(buffer);
> -       oom_adjust = simple_strtol(buffer, &end, 0);
> +       oom_adjust = simple_strtol(strstrip(buffer), &end, 0);
>        if (*end)
>                return -EINVAL;
>        if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
>
>
>>       oom_adjust = simple_strtol(buffer, &end, 0);
>
> That should've used strict_strtoul() but it's too late to fix it now.

Perhaps, it's not too late. I never seen userland program pass non number value
into this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
  2009-08-05 23:39   ` Andrew Morton
@ 2009-08-06  5:13     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

> So I merged these but I have a feeling that this isn't the last I'll be
> hearing on the topic ;)
>
> Given the amount of churn, the amount of discussion and the size of the
> patches, this doesn't look like something we should push into 2.6.31.
>
> If we think that the 2ff05b2b regression is sufficiently serious to be
> a must-fix for 2.6.31 then can we please find something safer and
> smaller?  Like reverting 2ff05b2b?

I don't think the serious problem is only  this issue, I oppose to
ignore regression
bug report ;-)

Yes, your point makes sense. then, I'll make two patch series.
1. reverting 2ff05b2b for 2.6.31
2. retry fix oom livelock for -mm

I expect I can do that next sunday.


> These patches clash with the controversial
> mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
> now.

I think we can drop this because workaround patch is only needed until
the issue not fixed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
@ 2009-08-06  5:13     ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linus Torvalds, Oleg Nesterov, linux-mm

> So I merged these but I have a feeling that this isn't the last I'll be
> hearing on the topic ;)
>
> Given the amount of churn, the amount of discussion and the size of the
> patches, this doesn't look like something we should push into 2.6.31.
>
> If we think that the 2ff05b2b regression is sufficiently serious to be
> a must-fix for 2.6.31 then can we please find something safer and
> smaller?  Like reverting 2ff05b2b?

I don't think the serious problem is only  this issue, I oppose to
ignore regression
bug report ;-)

Yes, your point makes sense. then, I'll make two patch series.
1. reverting 2ff05b2b for 2.6.31
2. retry fix oom livelock for -mm

I expect I can do that next sunday.


> These patches clash with the controversial
> mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
> now.

I think we can drop this because workaround patch is only needed until
the issue not fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
  2009-08-06  1:34     ` Oleg Nesterov
@ 2009-08-06  5:16       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, linux-mm

2009/8/6 Oleg Nesterov <oleg@redhat.com>:
> Sorry for late reply. And sorry, I didn't read these patches carefully yet,
> probably missed something...
>
> On 08/04, KOSAKI Motohiro wrote:
>>
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>>  static DEFINE_SPINLOCK(zone_scan_lock);
>>  /* #define DEBUG */
>>
>> +int get_oom_adj(struct task_struct *tsk)
>
> is it used outside oom_kill.c ?

Good catch.
Will fix.


>> +{
>> +     unsigned long flags;
>> +     int oom_adj = OOM_DISABLE;
>> +
>> +     if (tsk->mm && lock_task_sighand(tsk, &flags)) {
>
> Minor nit. _Afaics_, unlike proc, oom_kill.c never needs lock_task_sighand()
> to access ->signal->oom_adj.
>
> If the task was found under tasklist_lock by for_each_process/do_each_thread
> it must have the valid ->signal != NULL and it can't go away.

Thanks good suggestion!
Will fix.



> With these patches I think mm-introduce-proc-pid-oom_adj_child.patch should
> be dropped. This is good ;)

I agree, It should be dropped.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/4] oom: move oom_adj to signal_struct
@ 2009-08-06  5:16       ` KOSAKI Motohiro
  0 siblings, 0 replies; 50+ messages in thread
From: KOSAKI Motohiro @ 2009-08-06  5:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Paul Menage, David Rientjes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrew Morton, Linus Torvalds, linux-mm

2009/8/6 Oleg Nesterov <oleg@redhat.com>:
> Sorry for late reply. And sorry, I didn't read these patches carefully yet,
> probably missed something...
>
> On 08/04, KOSAKI Motohiro wrote:
>>
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -34,6 +34,31 @@ int sysctl_oom_dump_tasks;
>>  static DEFINE_SPINLOCK(zone_scan_lock);
>>  /* #define DEBUG */
>>
>> +int get_oom_adj(struct task_struct *tsk)
>
> is it used outside oom_kill.c ?

Good catch.
Will fix.


>> +{
>> +     unsigned long flags;
>> +     int oom_adj = OOM_DISABLE;
>> +
>> +     if (tsk->mm && lock_task_sighand(tsk, &flags)) {
>
> Minor nit. _Afaics_, unlike proc, oom_kill.c never needs lock_task_sighand()
> to access ->signal->oom_adj.
>
> If the task was found under tasklist_lock by for_each_process/do_each_thread
> it must have the valid ->signal != NULL and it can't go away.

Thanks good suggestion!
Will fix.



> With these patches I think mm-introduce-proc-pid-oom_adj_child.patch should
> be dropped. This is good ;)

I agree, It should be dropped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
  2009-08-06  5:13     ` KOSAKI Motohiro
@ 2009-08-06  8:07       ` Minchan Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-06  8:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Linus Torvalds, Oleg Nesterov,
	linux-mm

Hi, Kosaki.

On Thu, Aug 6, 2009 at 2:13 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> So I merged these but I have a feeling that this isn't the last I'll be
>> hearing on the topic ;)
>>
>> Given the amount of churn, the amount of discussion and the size of the
>> patches, this doesn't look like something we should push into 2.6.31.
>>
>> If we think that the 2ff05b2b regression is sufficiently serious to be
>> a must-fix for 2.6.31 then can we please find something safer and
>> smaller?  Like reverting 2ff05b2b?
>
> I don't think the serious problem is only  this issue, I oppose to
> ignore regression
> bug report ;-)
>
> Yes, your point makes sense. then, I'll make two patch series.
> 1. reverting 2ff05b2b for 2.6.31
> 2. retry fix oom livelock for -mm
>
> I expect I can do that next sunday.
>
>
>> These patches clash with the controversial
>> mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
>> now.
>
> I think we can drop this because workaround patch is only needed until
> the issue not fixed.


I looked over your this patches.
I can't find out merge point our both idea.

I think it would be better than mine.

As kame pointed out, you patch can solve long stall time of do_each_thread
as well as livelock.

So I will wait for you to finish this work and review then.
I ask you add me in CC, then. :)

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH for 2.6.31 0/4] fix oom_adj regression v2
@ 2009-08-06  8:07       ` Minchan Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Minchan Kim @ 2009-08-06  8:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, LKML, Paul Menage, David Rientjes,
	KAMEZAWA Hiroyuki, Rik van Riel, Linus Torvalds, Oleg Nesterov,
	linux-mm

Hi, Kosaki.

On Thu, Aug 6, 2009 at 2:13 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> So I merged these but I have a feeling that this isn't the last I'll be
>> hearing on the topic ;)
>>
>> Given the amount of churn, the amount of discussion and the size of the
>> patches, this doesn't look like something we should push into 2.6.31.
>>
>> If we think that the 2ff05b2b regression is sufficiently serious to be
>> a must-fix for 2.6.31 then can we please find something safer and
>> smaller?  Like reverting 2ff05b2b?
>
> I don't think the serious problem is only  this issue, I oppose to
> ignore regression
> bug report ;-)
>
> Yes, your point makes sense. then, I'll make two patch series.
> 1. reverting 2ff05b2b for 2.6.31
> 2. retry fix oom livelock for -mm
>
> I expect I can do that next sunday.
>
>
>> These patches clash with the controversial
>> mm-introduce-proc-pid-oom_adj_child.patch, so I've disabled that patch
>> now.
>
> I think we can drop this because workaround patch is only needed until
> the issue not fixed.


I looked over your this patches.
I can't find out merge point our both idea.

I think it would be better than mine.

As kame pointed out, you patch can solve long stall time of do_each_thread
as well as livelock.

So I will wait for you to finish this work and review then.
I ask you add me in CC, then. :)

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2009-08-06  8:07 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-04 10:25 [PATCH for 2.6.31 0/4] fix oom_adj regression v2 KOSAKI Motohiro
2009-08-04 10:25 ` KOSAKI Motohiro
2009-08-04 10:25 ` [PATCH 1/4] oom: move oom_adj to signal_struct KOSAKI Motohiro
2009-08-04 10:25   ` KOSAKI Motohiro
2009-08-05  0:45   ` Minchan Kim
2009-08-05  0:45     ` Minchan Kim
2009-08-05  2:29     ` KOSAKI Motohiro
2009-08-05  2:29       ` KOSAKI Motohiro
2009-08-05  2:40       ` Minchan Kim
2009-08-05  2:40         ` Minchan Kim
2009-08-05  2:51         ` KOSAKI Motohiro
2009-08-05  2:51           ` KOSAKI Motohiro
2009-08-05  5:55           ` Minchan Kim
2009-08-05  5:55             ` Minchan Kim
2009-08-05  6:03             ` KAMEZAWA Hiroyuki
2009-08-05  6:03               ` KAMEZAWA Hiroyuki
2009-08-05  6:37               ` Minchan Kim
2009-08-05  6:37                 ` Minchan Kim
2009-08-05  6:53                 ` KOSAKI Motohiro
2009-08-05  6:53                   ` KOSAKI Motohiro
2009-08-05  7:20                   ` Minchan Kim
2009-08-05  7:20                     ` Minchan Kim
2009-08-05  6:55                 ` KAMEZAWA Hiroyuki
2009-08-05  6:55                   ` KAMEZAWA Hiroyuki
2009-08-05  6:04             ` KOSAKI Motohiro
2009-08-05  6:04               ` KOSAKI Motohiro
2009-08-05  6:29               ` Minchan Kim
2009-08-05  6:29                 ` Minchan Kim
2009-08-05  6:47                 ` KOSAKI Motohiro
2009-08-05  6:47                   ` KOSAKI Motohiro
2009-08-06  1:34   ` Oleg Nesterov
2009-08-06  1:34     ` Oleg Nesterov
2009-08-06  5:16     ` KOSAKI Motohiro
2009-08-06  5:16       ` KOSAKI Motohiro
2009-08-04 10:26 ` [PATCH 2/4] oom: make oom_score to per-process value KOSAKI Motohiro
2009-08-04 10:26   ` KOSAKI Motohiro
2009-08-04 10:27 ` [PATCH 3/4] oom: oom_kill doesn't kill vfork parent(or child) KOSAKI Motohiro
2009-08-04 10:27   ` KOSAKI Motohiro
2009-08-04 10:28 ` [PATCH 4/4] oom: fix oom_adjust_write() input sanity check KOSAKI Motohiro
2009-08-04 10:28   ` KOSAKI Motohiro
2009-08-05 23:33   ` Andrew Morton
2009-08-05 23:33     ` Andrew Morton
2009-08-06  5:06     ` KOSAKI Motohiro
2009-08-06  5:06       ` KOSAKI Motohiro
2009-08-05 23:39 ` [PATCH for 2.6.31 0/4] fix oom_adj regression v2 Andrew Morton
2009-08-05 23:39   ` Andrew Morton
2009-08-06  5:13   ` KOSAKI Motohiro
2009-08-06  5:13     ` KOSAKI Motohiro
2009-08-06  8:07     ` Minchan Kim
2009-08-06  8:07       ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.